pgraph: Fix GraphCmp function #430

kkuehlz · 2018-12-24T14:07:50Z

The previous algorithm would fail on the newly added tests due to the
fact that it would check for duplicate vertices. The newer algorithm
doesn't use any fancy isomorphism algorithms. It exploits the fact that
nodes can be expected to have the same label in the two graphs being
compared. This allows us to create a sorted list of vertices based on
the label of a node and the labels of its neighbors.

Once the two graphs are in sorted order, we simply use the vertexCmpFn
on vertices at the same index, and run the edgeCmpFn on each of the
vertices' respective edges.

Resolves #199

The previous algorithm would fail on the newly added tests due to the fact that it would check for duplicate vertices. The newer algorithm doesn't use any fancy isomorphism algorithms. It exploits the fact that nodes can be expected to have the same label in the two graphs being compared. This allows us to create a sorted list of vertices based on the label of a node and the labels of its neighbors. Once the two graphs are in sorted order, we simply use the vertexCmpFn on vertices at the same index, and run the edgeCmpFn on each of the vertices' respective edges. Resolves purpleidea#199

purpleidea

Very glad to see this, thanks!

Biggest issue is unrelated changes happening in the same commit. I'm generally pretty okay with this (most aren't) but the unofficial rule is that it's for cases when the patch doesn't touch critical or tricky code, or that it's trivial or easy to review. In this case, since there aren't too many tests in this space, and it's not an extremely well-understood part of the code, please split the unrelated changes up :)

Also, VertexSlice might already do some of the things you want. Not sure. This would definitely benefit from an improved doc comment on the complex methods.

Thanks!

purpleidea · 2018-12-26T19:33:27Z

pgraph/pgraph.go

@@ -197,16 +197,6 @@ func (g *Graph) FindEdge(v1, v2 Vertex) Edge {
 return edge
 }

-// Vertices returns a randomly sorted slice of all vertices in the graph.


Why did you move this function? (Look at the diff)

purpleidea · 2018-12-26T19:34:41Z

pgraph/pgraph.go

+// EdgeSlice is a linear list of edges. It can be sorted.
+type EdgeSlice []Edge
+
+func (es EdgeSlice) Len() int { return len(es) }


These need the standard comments for golint to pass. You can use the ones from the existing VertexSlice.

purpleidea · 2018-12-26T19:37:21Z

pgraph/pgraph.go

 // VerticesSorted returns a sorted slice of all vertices in the graph.
 // The order is sorted by String() to avoid the non-determinism in the map type.
 func (g *Graph) VerticesSorted() []Vertex {
- var vertices []Vertex


Not sure if this change is really necessary, but if it is, let's do it in a separate commit, since it has nothing to do with this commit which is not trivial, so we don't want to combine them if there's ever something we learn we need to revert or change in this commit.

purpleidea · 2018-12-26T19:38:01Z

pgraph/pgraph.go

@@ -246,17 +255,67 @@ func (vs VertexSlice) Less(i, j int) bool { return vs[i].String() < vs[j].String
 // Sort is a convenience method.
 func (vs VertexSlice) Sort() { sort.Sort(vs) }

+// VertexIntenseSlice is a linear list of vertices. It can be sorted.
+type VertexIntenseSlice struct {


Why is this "intense" ?

Sorry, I couldn't think of a good name. Basically, it's intense because it sorts vertices based on the String() value. If two vertices have the same value, then it looks at the labels of neighbors to try and differentiate them.

Suppose we have two vertices in a graph

---------- |v1 -> v2| |v1 -> v3| ----------

Then the first v1 gets sorted because String(v2) < String(v3).

The issue with my approach is, if we are to allow duplicates vertex names, what is to say one cannot create a graph entirely of the same vertex labels? Now the ordering imposed by one's vertex label and that of it's neighbors is not unique. So we can't impose any ordering on the two graphs, and this degrades into the traditional graph isomorphism problem.

I will try to write some tests to make it fail in this case. There are probably some other cases I'm not accounting for too.

I need to double check where the vertex name comes from. Assuming I remember correctly that it's a combination of the res type string and the res name, then we're good. Although we might need to double check this in the engine when receiving the graph.

This actually goes back to an early design goal which I don't remember if I enforced with code yet: graphs with two compatible resources in the languages, should be allowed. I forget if this is allowed at the language level or engine level and where it should be taken care of :/

Okay, so after some researching I realized I implemented a naive version of Weisfeiler-Lehman, checking only the first hop (or a node's immediate neighbors). The good news is these algorithms, while not perfect, tend to do well in practical circumstances.

I've included psoudecode below of the intended approach, so we can agree before I write code this time. It will hopefully aid in understanding and make the review process much easier.

def create_hashes(G) -> Gi.hash_label <- G.label foreach Gi foreach Gi in sort_by_hash_labels(G): foreach neigh in Gi: neigh.hash_label <- hash(Gi.hash_label + neigh.hash_label) def isomorphic(G1, G2) -> if count(G1.vertices) ≠ count(G2.vertices) then ret false if count(G1.edges) ≠ count(G2.edges) then ret false hG1, hG2 <- create_hashes(G1), create_hashes(G2) sG1, sG2 <- sort_by_hash_labels(hG1), sort_by_hash_labels(hG2) foreach (Gi, Gj) in (sG1, sG2): if Gi.hash_label ≠ G.hash_label then ret false ret true

Since every neighbor gets a hash assignment based on its neighbor's labels, we get unique hashes that span the entire reachable graph from each node. If interested, read more here and here. As the first article mentions, this approach can fail on highly symmetric graphs, but it has the benefit of being much more understandable and quite fast.

can fail on highly symmetric graphs

This sounds like a blocker right there... :/ If it always worked on DAG's, then that's okay, however the article mentions chains as being problematic too... What do you think?

Unsure. Note that the highly symmetric implies a large majority of the graph consists of duplicate labels. If we are modeling resources, it's not often we'll see a large amount of duplicates, but I may be wrong.

The second article claims infinitesimally small collision likelihood, and the author ran it on over 10k samples. The algorithm is a little more involved than my pseudocode, but still much less complex and much faster than any of the exponential algorithms that guarantee isomorphism in all cases. I am still happy to implement vf2. If we could get some edge case graphs of what a resource could realistically look like, I think the algorithm in the second article might work well. If it doesn't meet our needs, then we use one of the more involved algorithms. It's also important to note that the runtime of vf2 is horrible since isomorphism is NP, and if we are comparing resource graphs to check if ought to be reapplied, in some cases simply reapplying the resource could be faster than checking for isomorphism. Those are my thoughts. I'll leave the choice of algorithm to your discretion.

Okay I did a bunch of research, worked on some tests, and hopefully have some answers. Sorry I wasn't better prepared earlier. One of the downsides of trying to "move fast" with few resources on a new project :P

We can't choose any algorithm that could fail even if that's an infinitesimal probability. It's just not clean.

Even if something is NP-complete, doesn't mean that it won't be able to complete the execution in a reasonable amount of time for the size of graphs that we have. Whether our graphs will have 100's, 1000's, 10,000's of vertices, I don't know, but I'm optimistic that we can perform the needed computations fast enough.

As you might have already seen, there are multiple DAG's in the code base. (I count 3 atm.) The function graph, and the resource graph are the two main ones that we need to be careful about.

In the resource graph (that the engine runs) we currently allow duplicate resources, but this is a bug. We should consider either error-ing there and/or de-duplicating equivalent resources. This is an issue that can be seen if we use the yaml front-end eg:

--- graph: mygraph resources: file: - name: file0 path: "/tmp/f0" content: | i am f0 state: exists - name: file0 path: "/tmp/f0" content: | i am f1 state: exists edges: []

This shouldn't work, but it does :/ Badly. They fight. Oops.

This doesn't happen in mcl code, because it's clever enough to de-duplicate output!

The function graph however can have identical (same .String() values, different pointers of course) vertices. We then ultimately run interpret on that AST, and get the resource graph. At this stage we ensure that we don't have duplicate resources.

So it actually seems that GraphCmp is very useful for tests (where we compare a function graph with identical vertices) but not actually used anywhere in the production code. The most similar function to GraphCmp is GraphSync, which fortunately doesn't need to deal with duplicate vertices since this is only used in the engine at the moment where this is supposed to be unique :)

So this patch is useful, but only for tests atm. I'm going to update #199 with some more info. Sorry about being a disorganized maintainer here. If you'd like to update it with the fixes, I'd love to merge it, but it's probably not worth spending days on.

purpleidea · 2018-12-26T19:38:26Z

pgraph/pgraph.go

+}
+
+func (vs VertexIntenseSlice) Swap(i, j int) {
+ v := vs.vertices


no need to shorten, if you really want to, then rename the struct field (it's private anyways)

purpleidea · 2018-12-26T19:40:28Z

pgraph/pgraph.go

+ }
+
+ // Labels are equal, time to check neighbor labels
+ neighSet1 := vs.g.GraphVertices(v[i])


If you're doing this, then I'm assuming you have a graph in the "intense" struct, which matches the vertex list stored outside as well? So a redundant copy?? Which makes me think this whole thing should just be a method on Graph instead? Or perhaps I'm misunderstanding the idea here...

No copies. It iterates over vertices and uses neighbors to compare, as described above. the neighSet just has the neighbors of vertex v[i].

But you have two versions of Vertexes:

In the struct as field vs

In the struct as field graph.adjacency (which can be accessed by methods if need be)

purpleidea · 2018-12-26T19:41:06Z

pgraph/pgraph.go

-func (g *Graph) GraphCmp(graph *Graph, vertexCmpFn func(Vertex, Vertex) (bool, error), edgeCmpFn func(Edge, Edge) (bool, error)) error {
- if graph == nil || g == nil {
- if graph != g {
+func (g *Graph) GraphCmp(graphOther *Graph, vertexCmpFn func(Vertex, Vertex) (bool, error), edgeCmpFn func(Edge, Edge) (bool, error)) error {


Unnecessary rename, especially if it happens in the same commit.

purpleidea · 2018-12-26T19:43:08Z

pgraph/pgraph.go

 return fmt.Errorf("base graph has %d edges, while input graph has %d", e1, e2)
 }

- var m = make(map[Vertex]Vertex) // g to graph vertex correspondence


The churn in this function will be easier for me to review, once unrelated changes are split into separate commits (or unnecessary renaming is done in a later patch in the future.)

kkuehlz force-pushed the graphcmp branch from 8db28e1 to be71e62 Compare December 24, 2018 14:10

kkuehlz mentioned this pull request Dec 26, 2018

[LOVE] Add a cpu count fact #398

Closed

purpleidea requested changes Dec 26, 2018

View reviewed changes

purpleidea force-pushed the master branch 2 times, most recently from 7e7feab to 16d7c6a Compare February 15, 2019 14:27

purpleidea force-pushed the master branch 2 times, most recently from 9726b48 to 01d4226 Compare May 6, 2019 11:56

purpleidea force-pushed the master branch from 73f82c8 to 4d18741 Compare May 4, 2021 09:30

purpleidea force-pushed the master branch 4 times, most recently from 2319f4f to 04f5ba6 Compare September 12, 2022 01:44

purpleidea force-pushed the master branch 4 times, most recently from fe2313c to 271a94e Compare December 4, 2023 02:35

purpleidea force-pushed the master branch from b08136d to 78e59a9 Compare January 22, 2024 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgraph: Fix GraphCmp function #430

pgraph: Fix GraphCmp function #430

kkuehlz commented Dec 24, 2018

purpleidea left a comment

purpleidea Dec 26, 2018

purpleidea Dec 26, 2018

purpleidea Dec 26, 2018

purpleidea Dec 26, 2018

kkuehlz Dec 26, 2018 •

edited

Loading

kkuehlz Dec 26, 2018 •

edited

Loading

purpleidea Dec 26, 2018

kkuehlz Dec 27, 2018 •

edited

Loading

purpleidea Dec 27, 2018

kkuehlz Dec 27, 2018 •

edited

Loading

purpleidea Dec 29, 2018

purpleidea Dec 26, 2018

purpleidea Dec 26, 2018

kkuehlz Dec 26, 2018

purpleidea Dec 26, 2018

purpleidea Dec 26, 2018

purpleidea Dec 26, 2018

pgraph: Fix GraphCmp function #430

Are you sure you want to change the base?

pgraph: Fix GraphCmp function #430

Conversation

kkuehlz commented Dec 24, 2018

purpleidea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkuehlz Dec 26, 2018 • edited Loading

Choose a reason for hiding this comment

kkuehlz Dec 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkuehlz Dec 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkuehlz Dec 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkuehlz Dec 26, 2018 •

edited

Loading

kkuehlz Dec 26, 2018 •

edited

Loading

kkuehlz Dec 27, 2018 •

edited

Loading

kkuehlz Dec 27, 2018 •

edited

Loading