So at a high level my use case is as follows –
I periodically (every 24 hours) get a very large file (size can vary
from MBs to 10s of GBs) which I need to process within 24 hours. The
processing involves reading a record, apply some Business Logic and
updating a database with the record.
Current solution is a single threaded version which
- initially reads the entire file in memory, that is, it reads each line and constructs a POJO. So essentially it creates a big List
- It then iterates on the List and applies business logic on each Pojo and saves them in database
This works for small files with less than 10 million records. But as the systems are scaling we are getting more load, i.e. larger files (with >100 million records occasionally). In this scenario we see timeouts, that is we are unable to process the entire file within 24 hours
So I am planning to add some concurrency here.
A simple solution would be-
- Read entire file in memory (create POJOs for each record, as we are doing currently) or read each record one by one and create POJO
- Spawn threads to concurrently process these POJOs.
This solution seems simple, the only downside I see is that the file parsing might take time since it is single threaded (RAM is not a concern, I use a quite big EC2 instance).
Another solution would be to –
- Somehow break the file into multiple sub-files
- Process each file in parallel
This seems slightly complicated since I would have to break the file up into multiple smaller files.
Any inputs on suggestions here on the approaches would be welcomed.
Best Answer
The most likely efficient way to do this is:
You might want to use Spring Batch for this, as it will guide you towards doing the right thing. But it is somewhat overengineered and hard to use.
Keep in mind that all of this might still be futile if the DB becomes your bottleneck, which it very easily can be - SQL databases are notoriously bad at dealing with concurrent updates, and it might require quite a but of finetuning to avoid lock contention and deadlocks.