In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any evidence that this change actually improved performance (unless I missed it).
It would speed up the system substantially if we reused the keys and values […] but I think it is worth doing.
This seems completely counter to this very popular answer. Is there some credence to the Hadoop developer's claim? Is there something "special" about Hadoop that would invalidate the notion of object creation being cheap?
Best Answer
If you read the article you linked, it says
Well, that is self-evident. Creating a million new strings is always going to be slower than using a StringBuilder to manipulate a single string; everyone knows that. But this may be a straw man; last time I checked, you still needed an individual string for each key in a collection.
If his argument is that allocating a million new strings to make a copy of the collection is expensive, well, yes it is. Strings are reference types, after all; you could just store references to the original strings.
I guess we'll have to wait for him to complete his benchmarks.