Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgraph: Fix GraphCmp function #430

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kkuehlz
Copy link
Contributor

@kkuehlz kkuehlz commented Dec 24, 2018

The previous algorithm would fail on the newly added tests due to the
fact that it would check for duplicate vertices. The newer algorithm
doesn't use any fancy isomorphism algorithms. It exploits the fact that
nodes can be expected to have the same label in the two graphs being
compared. This allows us to create a sorted list of vertices based on
the label of a node and the labels of its neighbors.

Once the two graphs are in sorted order, we simply use the vertexCmpFn
on vertices at the same index, and run the edgeCmpFn on each of the
vertices' respective edges.

Resolves #199

The previous algorithm would fail on the newly added tests due to the
fact that it would check for duplicate vertices. The newer algorithm
doesn't use any fancy isomorphism algorithms. It exploits the fact that
nodes can be expected to have the same label in the two graphs being
compared. This allows us to create a sorted list of vertices based on
the label of a node and the labels of its neighbors.

Once the two graphs are in sorted order, we simply use the vertexCmpFn
on vertices at the same index, and run the edgeCmpFn on each of the
vertices' respective edges.

Resolves purpleidea#199
Copy link
Owner

@purpleidea purpleidea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very glad to see this, thanks!

Biggest issue is unrelated changes happening in the same commit. I'm generally pretty okay with this (most aren't) but the unofficial rule is that it's for cases when the patch doesn't touch critical or tricky code, or that it's trivial or easy to review. In this case, since there aren't too many tests in this space, and it's not an extremely well-understood part of the code, please split the unrelated changes up :)

Also, VertexSlice might already do some of the things you want. Not sure. This would definitely benefit from an improved doc comment on the complex methods.

Thanks!

@@ -197,16 +197,6 @@ func (g *Graph) FindEdge(v1, v2 Vertex) Edge {
return edge
}

// Vertices returns a randomly sorted slice of all vertices in the graph.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move this function? (Look at the diff)

// EdgeSlice is a linear list of edges. It can be sorted.
type EdgeSlice []Edge

func (es EdgeSlice) Len() int { return len(es) }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need the standard comments for golint to pass. You can use the ones from the existing VertexSlice.

// VerticesSorted returns a sorted slice of all vertices in the graph.
// The order is sorted by String() to avoid the non-determinism in the map type.
func (g *Graph) VerticesSorted() []Vertex {
var vertices []Vertex
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this change is really necessary, but if it is, let's do it in a separate commit, since it has nothing to do with this commit which is not trivial, so we don't want to combine them if there's ever something we learn we need to revert or change in this commit.

@@ -246,17 +255,67 @@ func (vs VertexSlice) Less(i, j int) bool { return vs[i].String() < vs[j].String
// Sort is a convenience method.
func (vs VertexSlice) Sort() { sort.Sort(vs) }

// VertexIntenseSlice is a linear list of vertices. It can be sorted.
type VertexIntenseSlice struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "intense" ?

Copy link
Contributor Author

@kkuehlz kkuehlz Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I couldn't think of a good name. Basically, it's intense because it sorts vertices based on the String() value. If two vertices have the same value, then it looks at the labels of neighbors to try and differentiate them.

Suppose we have two vertices in a graph

----------
|v1 -> v2|
|v1 -> v3|
----------

Then the first v1 gets sorted because String(v2) < String(v3).

Copy link
Contributor Author

@kkuehlz kkuehlz Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with my approach is, if we are to allow duplicates vertex names, what is to say one cannot create a graph entirely of the same vertex labels? Now the ordering imposed by one's vertex label and that of it's neighbors is not unique. So we can't impose any ordering on the two graphs, and this degrades into the traditional graph isomorphism problem.

I will try to write some tests to make it fail in this case. There are probably some other cases I'm not accounting for too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double check where the vertex name comes from. Assuming I remember correctly that it's a combination of the res type string and the res name, then we're good. Although we might need to double check this in the engine when receiving the graph.

This actually goes back to an early design goal which I don't remember if I enforced with code yet: graphs with two compatible resources in the languages, should be allowed. I forget if this is allowed at the language level or engine level and where it should be taken care of :/

Copy link
Contributor Author

@kkuehlz kkuehlz Dec 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so after some researching I realized I implemented a naive version of Weisfeiler-Lehman, checking only the first hop (or a node's immediate neighbors). The good news is these algorithms, while not perfect, tend to do well in practical circumstances.

I've included psoudecode below of the intended approach, so we can agree before I write code this time. It will hopefully aid in understanding and make the review process much easier.

def create_hashes(G) ->
  Gi.hash_label <- G.label foreach Gi
  
  foreach Gi in sort_by_hash_labels(G):
    foreach neigh in Gi:
      neigh.hash_label <- hash(Gi.hash_label + neigh.hash_label)

def isomorphic(G1, G2) ->
  if count(G1.vertices) ≠ count(G2.vertices) then ret false
  if count(G1.edges) ≠ count(G2.edges) then ret false

  hG1, hG2 <- create_hashes(G1), create_hashes(G2)  
  sG1, sG2 <- sort_by_hash_labels(hG1), sort_by_hash_labels(hG2)

  foreach (Gi, Gj) in (sG1, sG2):
    if Gi.hash_labelG.hash_label then ret false
  ret true

Since every neighbor gets a hash assignment based on its neighbor's labels, we get unique hashes that span the entire reachable graph from each node. If interested, read more here and here. As the first article mentions, this approach can fail on highly symmetric graphs, but it has the benefit of being much more understandable and quite fast.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can fail on highly symmetric graphs

This sounds like a blocker right there... :/ If it always worked on DAG's, then that's okay, however the article mentions chains as being problematic too... What do you think?

Copy link
Contributor Author

@kkuehlz kkuehlz Dec 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure. Note that the highly symmetric implies a large majority of the graph consists of duplicate labels. If we are modeling resources, it's not often we'll see a large amount of duplicates, but I may be wrong.

The second article claims infinitesimally small collision likelihood, and the author ran it on over 10k samples. The algorithm is a little more involved than my pseudocode, but still much less complex and much faster than any of the exponential algorithms that guarantee isomorphism in all cases. I am still happy to implement vf2. If we could get some edge case graphs of what a resource could realistically look like, I think the algorithm in the second article might work well. If it doesn't meet our needs, then we use one of the more involved algorithms. It's also important to note that the runtime of vf2 is horrible since isomorphism is NP, and if we are comparing resource graphs to check if ought to be reapplied, in some cases simply reapplying the resource could be faster than checking for isomorphism. Those are my thoughts. I'll leave the choice of algorithm to your discretion.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I did a bunch of research, worked on some tests, and hopefully have some answers. Sorry I wasn't better prepared earlier. One of the downsides of trying to "move fast" with few resources on a new project :P

  1. We can't choose any algorithm that could fail even if that's an infinitesimal probability. It's just not clean.

  2. Even if something is NP-complete, doesn't mean that it won't be able to complete the execution in a reasonable amount of time for the size of graphs that we have. Whether our graphs will have 100's, 1000's, 10,000's of vertices, I don't know, but I'm optimistic that we can perform the needed computations fast enough.

  3. As you might have already seen, there are multiple DAG's in the code base. (I count 3 atm.) The function graph, and the resource graph are the two main ones that we need to be careful about.

In the resource graph (that the engine runs) we currently allow duplicate resources, but this is a bug. We should consider either error-ing there and/or de-duplicating equivalent resources. This is an issue that can be seen if we use the yaml front-end eg:

---
graph: mygraph
resources:
  file:
  - name: file0
    path: "/tmp/f0"
    content: |
      i am f0
    state: exists
  - name: file0
    path: "/tmp/f0"
    content: |
      i am f1
    state: exists
edges: []

This shouldn't work, but it does :/ Badly. They fight. Oops.

This doesn't happen in mcl code, because it's clever enough to de-duplicate output!

The function graph however can have identical (same .String() values, different pointers of course) vertices. We then ultimately run interpret on that AST, and get the resource graph. At this stage we ensure that we don't have duplicate resources.

So it actually seems that GraphCmp is very useful for tests (where we compare a function graph with identical vertices) but not actually used anywhere in the production code. The most similar function to GraphCmp is GraphSync, which fortunately doesn't need to deal with duplicate vertices since this is only used in the engine at the moment where this is supposed to be unique :)

So this patch is useful, but only for tests atm. I'm going to update #199 with some more info. Sorry about being a disorganized maintainer here. If you'd like to update it with the fixes, I'd love to merge it, but it's probably not worth spending days on.

}

func (vs VertexIntenseSlice) Swap(i, j int) {
v := vs.vertices
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to shorten, if you really want to, then rename the struct field (it's private anyways)

}

// Labels are equal, time to check neighbor labels
neighSet1 := vs.g.GraphVertices(v[i])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're doing this, then I'm assuming you have a graph in the "intense" struct, which matches the vertex list stored outside as well? So a redundant copy?? Which makes me think this whole thing should just be a method on Graph instead? Or perhaps I'm misunderstanding the idea here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No copies. It iterates over vertices and uses neighbors to compare, as described above. the neighSet just has the neighbors of vertex v[i].

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have two versions of Vertexes:

  1. In the struct as field vs
  2. In the struct as field graph.adjacency (which can be accessed by methods if need be)

func (g *Graph) GraphCmp(graph *Graph, vertexCmpFn func(Vertex, Vertex) (bool, error), edgeCmpFn func(Edge, Edge) (bool, error)) error {
if graph == nil || g == nil {
if graph != g {
func (g *Graph) GraphCmp(graphOther *Graph, vertexCmpFn func(Vertex, Vertex) (bool, error), edgeCmpFn func(Edge, Edge) (bool, error)) error {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary rename, especially if it happens in the same commit.

return fmt.Errorf("base graph has %d edges, while input graph has %d", e1, e2)
}

var m = make(map[Vertex]Vertex) // g to graph vertex correspondence
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The churn in this function will be easier for me to review, once unrelated changes are split into separate commits (or unnecessary renaming is done in a later patch in the future.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LOVE] Implement graph isomorphism
2 participants