R – Storing very large graphs on disk/streaming graph partitioning algorithms

graphgraph-databasesperformance

Suppose that I have a very large undirected, unweighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-BFS disk page layout, disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem.

The graph itself looks like a spaghetti, think of random set of points randomly interconnected, with some bias towards shorter edges.

The problem is, how does one partition graph this large? The available graph partitioners I have found work with graphs that fit into memory only. I could not find any descriptions nor implementations of any streaming graph partitioning algorithms.

OR, maybe there is an alternative to partitioning graph for getting a disk layout that works well with BFS?

Right now as an approximation I use the fact that the vertices have spatial coordinates attached to them and put the vertices on disk in Hilbert sort order. This way spatially close vertices land on the same page, but the presence or absence of edge between them is completely ignored. Can I do better?

As an alternative, I can split graph into pieces using the Hilbert sort order for vertices, partition the subgraphs, stitch them back and accept poor partitioning on the seams.

Some things I have looked into already:

  1. How to store a large directed unweighted graph with billions of nodes and vertices
  2. http://neo4j.org/ – I found zero information on how does it do graph layout on disk

Partitioning implementations (unless I'm mistaken, all of them need to fit graph into memory):

  1. http://glaros.dtc.umn.edu/gkhome/views/metis
  2. http://www.sandia.gov/~bahendr/chaco.html
  3. http://staffweb.cms.gre.ac.uk/~c.walshaw/jostle/
  4. http://www.cerfacs.fr/algor/Softs/MESHPART/

EDIT: info on how the graphs looks like and that BFS can start everywhere.
EDIT: idea on partitioning subgraphs

Best Answer

No algorithm truly needs to "fit into memory"--you can always page things in and out as needed. But you do want to avoid having the computation take unreasonably long--and global graph partitioning in the generic case is a NP-complete problem, which is "unreasonably long" for most problems that do not even fit in memory.

Fortunately, you want to do breadth-first searches, which means that you want a format where breadth-first is the easy computation. I don't know of any algorithms offhand that do this, but you can construct your own breadth-first layout if you're willing to allow a bit of extra disk space.

If the edges are not biased towards local interactions, then disentangling the graph will be difficult. If they are biased towards local interactions, then I suggest an algorithm like the following:

  • Pick a random set of vertices as starting points from throughout the entire data set.
  • For each vertex, collect all neighboring vertices (takes one sweep through the data set).
  • For each set of neighboring vertices collect the set of neighbors-of-neighbors and rank them according to how many edges connect to them. If you don't have space in a page to store them all, keep the most-connected vertices. If you do have space to save them all, you may wish to throw away the least useful ones (e.g. if the fraction of edges kept within a page / fraction of vertices needing storage ratio drops "too low"--where "too low" will depend on how much breadth your searches really need, and whether you can do any pruning and so on--then don't include those in the neighborhood.
  • Repeat the process of collecting and ranking neighbors until your neighborhood is full (e.g. fills some page size that suits you). Then check for repeats among the randomly chosen starts. If you have a small number of vertices appearing in both, remove them from one or the other, whichever breaks fewer edges. If you have a large number of vertices appearing in both, keep the neighborhood with the best (vertices in neighborhood/broken edge) ratio and throw the other away.

Now you have some local neighborhoods that are approximately locally optimal in that breadth-first searches tend to fall inside. If your breadth-first search prunes off unproductive branches pretty effectively, then this is probably good enough. If not, you probably want adjacent neighborhoods to cluster.

If you don't need adjacent neighborhoods to cluster too much, you set aside the vertices you've grouped into neighborhoods, and repeat the process on the remaining data until all vertices are accounted for. You change each vertex identifier to (vertex,neighborhood), and you're done: when following edges, you know exactly which page to grab, and most of them will be close by given the construction.

If you do need adjacent neighborhoods, then you'll need to keep track of your growing neighborhoods. You repeat the previous process (pick at random, grow neighborhoods), but now rank neighbors by both how many edges they satisfy within the neighborhood and what fraction of their edges that leave the neighborhood are in an existing group. You might need weighting factors, but something like

score = (# edges within) - (# neighborhoods outside) - (# neighborhoodless edges outside)

would probably do the trick.

Now, this is not globally or even locally optimal, but this or something very much like it should give a nicely locally-connected structure, and should let you produce a covering set of neighborhoods that have relatively high interconnectivity.

Again, it depends whether your breadth-first search prunes branches or not. If it does, the inexpensive thing to do is to maximize local interconnectivity. If it doesn't the thing to do is to minimize external connectivity--and in that case, I'd suggest just collecting breadth-first sets up to some size and saving those (with duplication at the edges of the sets--you're not badly limited by hard drive space, are you?).

Related Topic