-
-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pgraph: Fix GraphCmp function #430
base: master
Are you sure you want to change the base?
Conversation
The previous algorithm would fail on the newly added tests due to the fact that it would check for duplicate vertices. The newer algorithm doesn't use any fancy isomorphism algorithms. It exploits the fact that nodes can be expected to have the same label in the two graphs being compared. This allows us to create a sorted list of vertices based on the label of a node and the labels of its neighbors. Once the two graphs are in sorted order, we simply use the vertexCmpFn on vertices at the same index, and run the edgeCmpFn on each of the vertices' respective edges. Resolves purpleidea#199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very glad to see this, thanks!
Biggest issue is unrelated changes happening in the same commit. I'm generally pretty okay with this (most aren't) but the unofficial rule is that it's for cases when the patch doesn't touch critical or tricky code, or that it's trivial or easy to review. In this case, since there aren't too many tests in this space, and it's not an extremely well-understood part of the code, please split the unrelated changes up :)
Also, VertexSlice might already do some of the things you want. Not sure. This would definitely benefit from an improved doc comment on the complex methods.
Thanks!
@@ -197,16 +197,6 @@ func (g *Graph) FindEdge(v1, v2 Vertex) Edge { | |||
return edge | |||
} | |||
|
|||
// Vertices returns a randomly sorted slice of all vertices in the graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you move this function? (Look at the diff)
// EdgeSlice is a linear list of edges. It can be sorted. | ||
type EdgeSlice []Edge | ||
|
||
func (es EdgeSlice) Len() int { return len(es) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need the standard comments for golint to pass. You can use the ones from the existing VertexSlice.
// VerticesSorted returns a sorted slice of all vertices in the graph. | ||
// The order is sorted by String() to avoid the non-determinism in the map type. | ||
func (g *Graph) VerticesSorted() []Vertex { | ||
var vertices []Vertex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this change is really necessary, but if it is, let's do it in a separate commit, since it has nothing to do with this commit which is not trivial, so we don't want to combine them if there's ever something we learn we need to revert or change in this commit.
@@ -246,17 +255,67 @@ func (vs VertexSlice) Less(i, j int) bool { return vs[i].String() < vs[j].String | |||
// Sort is a convenience method. | |||
func (vs VertexSlice) Sort() { sort.Sort(vs) } | |||
|
|||
// VertexIntenseSlice is a linear list of vertices. It can be sorted. | |||
type VertexIntenseSlice struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this "intense" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I couldn't think of a good name. Basically, it's intense because it sorts vertices based on the String()
value. If two vertices have the same value, then it looks at the labels of neighbors to try and differentiate them.
Suppose we have two vertices in a graph
----------
|v1 -> v2|
|v1 -> v3|
----------
Then the first v1
gets sorted because String(v2) < String(v3)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with my approach is, if we are to allow duplicates vertex names, what is to say one cannot create a graph entirely of the same vertex labels? Now the ordering imposed by one's vertex label and that of it's neighbors is not unique. So we can't impose any ordering on the two graphs, and this degrades into the traditional graph isomorphism problem.
I will try to write some tests to make it fail in this case. There are probably some other cases I'm not accounting for too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to double check where the vertex name comes from. Assuming I remember correctly that it's a combination of the res type string and the res name, then we're good. Although we might need to double check this in the engine when receiving the graph.
This actually goes back to an early design goal which I don't remember if I enforced with code yet: graphs with two compatible resources in the languages, should be allowed. I forget if this is allowed at the language level or engine level and where it should be taken care of :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so after some researching I realized I implemented a naive version of Weisfeiler-Lehman, checking only the first hop (or a node's immediate neighbors). The good news is these algorithms, while not perfect, tend to do well in practical circumstances.
I've included psoudecode below of the intended approach, so we can agree before I write code this time. It will hopefully aid in understanding and make the review process much easier.
def create_hashes(G) ->
Gi.hash_label <- G.label foreach Gi
foreach Gi in sort_by_hash_labels(G):
foreach neigh in Gi:
neigh.hash_label <- hash(Gi.hash_label + neigh.hash_label)
def isomorphic(G1, G2) ->
if count(G1.vertices) ≠ count(G2.vertices) then ret false
if count(G1.edges) ≠ count(G2.edges) then ret false
hG1, hG2 <- create_hashes(G1), create_hashes(G2)
sG1, sG2 <- sort_by_hash_labels(hG1), sort_by_hash_labels(hG2)
foreach (Gi, Gj) in (sG1, sG2):
if Gi.hash_label ≠ G.hash_label then ret false
ret true
Since every neighbor gets a hash assignment based on its neighbor's labels, we get unique hashes that span the entire reachable graph from each node. If interested, read more here and here. As the first article mentions, this approach can fail on highly symmetric graphs, but it has the benefit of being much more understandable and quite fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can fail on highly symmetric graphs
This sounds like a blocker right there... :/ If it always worked on DAG's, then that's okay, however the article mentions chains as being problematic too... What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure. Note that the highly symmetric implies a large majority of the graph consists of duplicate labels. If we are modeling resources, it's not often we'll see a large amount of duplicates, but I may be wrong.
The second article claims infinitesimally small collision likelihood, and the author ran it on over 10k samples. The algorithm is a little more involved than my pseudocode, but still much less complex and much faster than any of the exponential algorithms that guarantee isomorphism in all cases. I am still happy to implement vf2. If we could get some edge case graphs of what a resource could realistically look like, I think the algorithm in the second article might work well. If it doesn't meet our needs, then we use one of the more involved algorithms. It's also important to note that the runtime of vf2 is horrible since isomorphism is NP, and if we are comparing resource graphs to check if ought to be reapplied, in some cases simply reapplying the resource could be faster than checking for isomorphism. Those are my thoughts. I'll leave the choice of algorithm to your discretion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I did a bunch of research, worked on some tests, and hopefully have some answers. Sorry I wasn't better prepared earlier. One of the downsides of trying to "move fast" with few resources on a new project :P
-
We can't choose any algorithm that could fail even if that's an infinitesimal probability. It's just not clean.
-
Even if something is NP-complete, doesn't mean that it won't be able to complete the execution in a reasonable amount of time for the size of graphs that we have. Whether our graphs will have 100's, 1000's, 10,000's of vertices, I don't know, but I'm optimistic that we can perform the needed computations fast enough.
-
As you might have already seen, there are multiple DAG's in the code base. (I count 3 atm.) The function graph, and the resource graph are the two main ones that we need to be careful about.
In the resource graph (that the engine runs) we currently allow duplicate resources, but this is a bug. We should consider either error-ing there and/or de-duplicating equivalent resources. This is an issue that can be seen if we use the yaml
front-end eg:
---
graph: mygraph
resources:
file:
- name: file0
path: "/tmp/f0"
content: |
i am f0
state: exists
- name: file0
path: "/tmp/f0"
content: |
i am f1
state: exists
edges: []
This shouldn't work, but it does :/ Badly. They fight. Oops.
This doesn't happen in mcl
code, because it's clever enough to de-duplicate output!
The function graph however can have identical (same .String()
values, different pointers of course) vertices. We then ultimately run interpret
on that AST, and get the resource graph. At this stage we ensure that we don't have duplicate resources.
So it actually seems that GraphCmp
is very useful for tests (where we compare a function graph with identical vertices) but not actually used anywhere in the production code. The most similar function to GraphCmp
is GraphSync
, which fortunately doesn't need to deal with duplicate vertices since this is only used in the engine at the moment where this is supposed to be unique :)
So this patch is useful, but only for tests atm. I'm going to update #199 with some more info. Sorry about being a disorganized maintainer here. If you'd like to update it with the fixes, I'd love to merge it, but it's probably not worth spending days on.
} | ||
|
||
func (vs VertexIntenseSlice) Swap(i, j int) { | ||
v := vs.vertices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to shorten, if you really want to, then rename the struct field (it's private anyways)
} | ||
|
||
// Labels are equal, time to check neighbor labels | ||
neighSet1 := vs.g.GraphVertices(v[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're doing this, then I'm assuming you have a graph in the "intense" struct, which matches the vertex list stored outside as well? So a redundant copy?? Which makes me think this whole thing should just be a method on Graph instead? Or perhaps I'm misunderstanding the idea here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No copies. It iterates over vertices and uses neighbors to compare, as described above. the neighSet
just has the neighbors of vertex v[i]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you have two versions of Vertexes:
- In the struct as field vs
- In the struct as field graph.adjacency (which can be accessed by methods if need be)
func (g *Graph) GraphCmp(graph *Graph, vertexCmpFn func(Vertex, Vertex) (bool, error), edgeCmpFn func(Edge, Edge) (bool, error)) error { | ||
if graph == nil || g == nil { | ||
if graph != g { | ||
func (g *Graph) GraphCmp(graphOther *Graph, vertexCmpFn func(Vertex, Vertex) (bool, error), edgeCmpFn func(Edge, Edge) (bool, error)) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary rename, especially if it happens in the same commit.
return fmt.Errorf("base graph has %d edges, while input graph has %d", e1, e2) | ||
} | ||
|
||
var m = make(map[Vertex]Vertex) // g to graph vertex correspondence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The churn in this function will be easier for me to review, once unrelated changes are split into separate commits (or unnecessary renaming is done in a later patch in the future.)
7e7feab
to
16d7c6a
Compare
9726b48
to
01d4226
Compare
2319f4f
to
04f5ba6
Compare
fe2313c
to
271a94e
Compare
The previous algorithm would fail on the newly added tests due to the
fact that it would check for duplicate vertices. The newer algorithm
doesn't use any fancy isomorphism algorithms. It exploits the fact that
nodes can be expected to have the same label in the two graphs being
compared. This allows us to create a sorted list of vertices based on
the label of a node and the labels of its neighbors.
Once the two graphs are in sorted order, we simply use the vertexCmpFn
on vertices at the same index, and run the edgeCmpFn on each of the
vertices' respective edges.
Resolves #199