Algorithm Execution – How and Where to Run Algorithms on Large Datasets?

algorithmscloud computinglarge-scale-project

I would like to run the PageRank algorithm on graph with 4 000 000 nodes and around 45 000 000 edges.

Currently I use neo4j graph databse and classic relational database (postgres) and for software projects I mostly use C# and Java.

Does anyone know what would be the best way to perform a PageRank computation on such graph? Is there any way to modify the PageRank algorithm in order to run it at home computer or server (48GB RAM) or is there any useful cloud service to push the data along the algorithm and retrieve the results?

At this stage the project is at the research stage so in case of using cloud service if possible, would like to use such provider that doesn't require much administration and service setup, but instead focus just on running the algorith once and get the results without much overhead administration work.

Best Answer

The nice thing about the PageRank algorithm is that it can be solved iteratively in a distributed way, within the MapReduce framework. However, the working data for Pagerank on ~5M nodes and ~50M edges should fit perfectly well in 4GB ram, never mind 48GB....

Specifically, you don't need to store all data for each web page in memory -- instead, you should digest your input database so that the working data for the PageRank solver refers to nodes by index. Even with no particular effort at optimization, each node should take no more than 32 bytes, and each edge no more than 16 bytes, for in-memory space of less than 1GB.

A demonstration example for this kind of datastructure, in C++/STL:

std::vector<float> old_rank, new_rank;  // rankings for each node
std::vector<int> end_edge;  // index after final edge for each node
std::vector<int> edge_dest;  // destination node index for each edge
std::vector<float> edge_weight;  // fractional weight for each edge

...

void pagerank_iteration(float base_value, float scale_value) {
    new_rank.fill(0.0);
    int edge = 0;  // loop variable:  current edge index
    for(int node=0; node<first_edge.size(); ++node) {  // loop over nodes
        while(edge < end_edge[node]) { // loop over edges of current node
            int dest_node = edge_dest[edge];
            new_rank[dest_node] += edge_weight[edge] * old_rank[node];
            ++edge;
        }
    }
    assert(edge == edge_dest.size());

    for(int node=0; node<new_rank.size(); ++node) {  // add scale/offset
        new_rank[node] = base_value + scale_value * new_rank[node];
    }
}

Running on a single PC is easier than running it on a cloud service, because you don't need to use a network-capable framework (although it might be a good idea to keep that possibility in mind). The relatively small sizes you are describing can be solved easily with an ad-hoc single-threaded algorithm, and you can either use Hadoop locally or roll your own MapReduce using threads or inter-process communication if you want more cores.

Related Topic