Ruby File Handling – Multithreading a CSV with Output

file handlingruby

I have a script written in Ruby that has maxed out a core in my server's Xeon processor for the last 2 hours. Since it's currently only using 1 of four possible cores, I want to try and rewrite the script to take advantage of all four cores.

I can use the .each_slice(n) method on the array that contains my data, but I'm curious as to what would then be the best/most efficient way to then write this data to a file. It seems that I have a couple options.

  1. Pass the file object to the functions being called by the Thread.new function (I assume this is legal in ruby?) and have them write as they see fit.

  2. Have each function store the results in arrays, return the array on completion and let the main program then write to the disk.

  3. Pass the same array object to each item and have them all add to it, then sort it.

My assumption is that although 2 would probably be more memory intensive, it'll by far be most efficient method. Is there a different way to accomplish what I'm doing?

Best Answer

You didn't mention if it matters what order the output is in. If it really doesn't matter I'd use a single-threaded Queue to receive messages from multiple reader/processor threads. Then another thread could be in charge of reading from the Queue and writing to the output file.

If order matters then your #2 seems a good idea.

I am concerned that with your suggestions #1 and #3 that if one is not very careful with the locking around the shared output object (file pointer or array) that one would be in 'undefined behavior' territory.

Related Topic