Distributed Systems – Algorithms for Data Structures

algorithmsbig datadata structures

The hash table data structure can be easily spread across multiple machines with a simple algorithm to distribute the keys:

machine_to_query = item_key % machine_count

When you want to read and write key value pairs, you use the key to work out which machine stores the data, then you talk to that machine. If you want a count of the total number of items, you'd need to request the count from each server and add them up.

What algorithms exist for efficiently managing data structures where the data is partitioned across multiple machines? Distributed algorithms, not parallel algorithms.

How might something like a sorted array work in a distributed fashion? Efficiently.

Best Answer

I don't know about published books that have this kind of thing, but there are some real world examples you could look at. Scala has a http://www.scala-lang.org/api/current/index.html#scala.collection.parallel.immutable.package">Parallel Immutable Collections package. They have a few hash-backed things, but also a vector (implemented as a shallow tree - http://xuwei-k.github.com/scala-library-sxr/scala-library-2.10.0-M1/scala/collection/parallel/immutable/ParVector.scala.html">source code available) and a sequence.

I think collections are being rewritten in Java 8 as part of http://openjdk.java.net/projects/lambda/">Project Lambda so you could look into that too. I expected source code to be available somewhere, but I can't find it after a brief search. I think one key element (which I think you assume in your question) is that having a collection do its own concurrency management is a big win. Instead of iterating over collections externally where every user has to manage the concurrency, the collection does some kind of map() or reduce() operation where it is passed a function that operates on each item or filters items and the collection manages its concurrency internally.

I think most of these use a divide-and-conquer approach, sending the divisions to various processors. You might Google http://en.wikipedia.org/wiki/Amdahl%27s_law">Ahmdal's Law as a starting point because it governs the maximum possible performance gain from running any algorithm over multiple processors. Also map-reduce and big-data.

Related Topic