Java – Hadoop and Object Reuse, Why

hadoopjavaperformance

In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any evidence that this change actually improved performance (unless I missed it).

It would speed up the system substantially if we reused the keys and values […] but I think it is worth doing.

This seems completely counter to this very popular answer. Is there some credence to the Hadoop developer's claim? Is there something "special" about Hadoop that would invalidate the notion of object creation being cheap?

Best Answer

If you read the article you linked, it says

Running a simple unit test on your desktop machine should highlight that creating 1x10^6 new String objects with random byte content is slower than using a single Text object and calling the set method to configure the underlying byte contents

Well, that is self-evident. Creating a million new strings is always going to be slower than using a StringBuilder to manipulate a single string; everyone knows that. But this may be a straw man; last time I checked, you still needed an individual string for each key in a collection.

If his argument is that allocating a million new strings to make a copy of the collection is expensive, well, yes it is. Strings are reference types, after all; you could just store references to the original strings.

I guess we'll have to wait for him to complete his benchmarks.

Related Topic