Spring Batch – Usefulness of Transactions in Batch Processing

springtransaction

I understand what a transaction is in a web application, where you have some groups of database interactions which have to fail or succeed together so the database is always in coherent state.

But why is a framework like Spring Batch built around transactions ? My group of committed records is not a logical group (the size of this group is set up using commit-interval property) : it's not a problem if one fails and the others succeed. And in the contrary, if my commit-interval is 100, what's the purpose of rolling back 100 independent operations when one fails ?

If my question is not clear enough, let's take an example : I have a job with a few steps, and each main step is about parsing some xml files and inserting fragments into DB. What will happen if all these steps run in no transaction, and any failing reading/processing/writing results in a caught exception and just produces some logs ? What am I losing doing so ?

Best answer I've found to this so far is : batch transactions are not logical transactions, they are about performance, you can't process item in chunks without transactions, is that true ?

I understand how to set up transactions and I've wrote a few jobs already, my question is not about "how" but about "why".

Best Answer

First, transactional databases typically do not allow to write data "without transactions" - any kind of write access will be enclosed in a transactions. From the viewpoint of the database, a write operation is just a write operation for which data integrity has to be ensured. If it is caused by a batch process, an OLTP process or a mixture of both does not matter.

Furthermore, it is not too hard to imagine a batch process where you have to insert data in two tables, with a master-child relationship between them, and you want to make sure when a master record is inserted, also the related child records will be inserted completely, and when the insert of a child record fails, the master record shall not be inserted at all. So opposed to the case where you insert into data only into one table and each INSERT is an atomic operation, now you have to consider a group of INSERTs as one atomic operation.

For such a process, you can still insert 100 records into your master table, and all related child records within the same transaction. But you will definitely have to avoid to place the COMMIT somewhere at the wrong place between the INSERT of a master record and its related child records. So your COMMITs have to respect the logical relationship of your data in such a case.

Another scenario: if your batch process needs to make a detailed log about which of the INSERTs failed and which were successful, you may have to enclose each logical atomic group of inserts into a single transaction. That may also be true if you have just a single table to be filled. You can think of optimizing this by first trying to insert your data in chunks of 100, and if that fails, you try again to insert your data one-by-one, but at least for the second step, you need to cut your transactions logically correct.

Related Topic