Java – Processing Large Files Concurrently

concurrencyfile handlingjava

So at a high level my use case is as follows –

I periodically (every 24 hours) get a very large file (size can vary
from MBs to 10s of GBs) which I need to process within 24 hours. The
processing involves reading a record, apply some Business Logic and
updating a database with the record.

Current solution is a single threaded version which

  1. initially reads the entire file in memory, that is, it reads each line and constructs a POJO. So essentially it creates a big List
  2. It then iterates on the List and applies business logic on each Pojo and saves them in database

This works for small files with less than 10 million records. But as the systems are scaling we are getting more load, i.e. larger files (with >100 million records occasionally). In this scenario we see timeouts, that is we are unable to process the entire file within 24 hours

So I am planning to add some concurrency here.

A simple solution would be-

  1. Read entire file in memory (create POJOs for each record, as we are doing currently) or read each record one by one and create POJO
  2. Spawn threads to concurrently process these POJOs.

This solution seems simple, the only downside I see is that the file parsing might take time since it is single threaded (RAM is not a concern, I use a quite big EC2 instance).

Another solution would be to –

  1. Somehow break the file into multiple sub-files
  2. Process each file in parallel

This seems slightly complicated since I would have to break the file up into multiple smaller files.

Any inputs on suggestions here on the approaches would be welcomed.

Best Answer

The most likely efficient way to do this is:

  • Have a single thread that reads the input file. Harddisks are at their fastest when reading sequentially.
  • Do not read it into memory all at once! That is a huge waste of memory which could be used much better to speed the processing!
  • Instead, have this single thread read a bundle of entries (maybe 100, maybe 1000, this is a tuning parameter) at once and submit them to a thread to process. If each line represents a record, the reading thread can defer all the parsing (other than looking for newlines) to the processing threads. But even if not, it is very unlikely that the parsing of records is your bottleneck.
  • Do the thread handling through a fixed size thread pool, choose the size to be the number of CPU cores on the machine, or maybe a bit more.
  • If your database is an SQL database, make sure the individual threads access the database through a connection pool and do all their DB updates for one bundle of entries in a single transaction and using batch inserts.

You might want to use Spring Batch for this, as it will guide you towards doing the right thing. But it is somewhat overengineered and hard to use.

Keep in mind that all of this might still be futile if the DB becomes your bottleneck, which it very easily can be - SQL databases are notoriously bad at dealing with concurrent updates, and it might require quite a but of finetuning to avoid lock contention and deadlocks.