Java Multithreading – Multi-Processing Approach to Listing Word Occurrences in a Text File

javamapmultithreading

A classical problem: read the words from a text file and list the occurence of each unique word in the file.

I solved the problem using a hash map, but how could the performance be improved? I tried reading multiple lines of that file using threads, but even that would be like a bottle neck and there are chances of race condition in a hashmap. Using concurrent HashMap would cause a bottleneck. What would be an ideal multithreaded approach?

Best Answer

Supposing you can efficiently split your files into blocks (for instance, groups of lines), you can try to associate some blocks to each thread, and to build an hashmap for each of your threads. As soon as two threads have finished, you can merge their hashmaps into a single new hashmap (The hashmap is nothing but a monad), and proceed until you obtain a single final hashmap counting the words for the entire file.

You have to tune some parameters in order to find the most interesting tradeoff between fine granularity and efficiency: number of threads, number of blocks, etc.

A probably suboptimal but straightforward implementation would be to wait until all your hashmaps are built before merging all of them at the same time. An unchecked attempt in Java 8:

Function<String, Map<String, Long>> countWords = (block) -> {
   Map<String, Long> ret = new HashMap<String, Long>();
   for(String word : block.split(" ")){
      ret.merge(word, 1, (a,b) -> a+b);
   }

   return ret;
};

BinaryOperator<Map<String, Long>> combine = (m1, m2) -> {
   Map<String, Long> m3 = new HashMap<>(m1);
   m2.forEach((k, v) -> m3.merge(k, v, (a,b) -> a+b);

   return m3;
};

Stream<String> blocks = file.getLineBlocks().parallelStream();
Stream<Map<String, Long>> counts = blocks.map(block -> countWords(blocks));
Map<String, Long> count = counts.reduce(new HashMap<String, Long>(), combine);