Design – How is intermediate data organized in MapReduce

designfunctional programming

From what I understand, each mapper outputs an intermediate file. The intermediate data (data contained in each intermediate file) is then sorted by key.

Then, a reducer is assigned a key by the master. The reducer reads from the intermediate file containing the key and then calls reduce using the data it has read.

But in detail, how is the intermediate data organized? Can a data corresponding to a key be held in multiple intermediate files? What happens when there is too much data corresponding to one key to be held by a single file?

In short, how do intermediate partitions differ from intermediate files and how are these differences dealt with in the implementation?

Best Answer

You might have missed one step. MapReduce works like this:

Map -> Shuffle -> Sort -> Reduce

More details about how the Shuffle & Sort step works: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

But in detail, how is the intermediate data organized?

Its just the output of the mapper. The actual shuffle and sort is a separate step.

Can a data corresponding to a key be held in multiple intermediate files?

Yes, but each reducer is guaranteed to get all data corresponding to a key due to the sort step, even though different mappers can produce data corresponding to a key.

What happens when there is too much data corresponding to one key to be held by a single file?

Would not actually matter too much. Each mapper gets a limited subset of the input, and the output will correspond in size to input for all mappers.

Related Topic