Java REST API – How to Check for Duplicates Before Inserting Records

databaseiteratorjavarestvalidation

I am developing an application in Java to parse and upload records from a CSV to an online database, via a REST API.

While I know for sure that there are no duplicate records in each CSV file, I cannot be sure that each CSV file has only been processed once (*see clarification below), so I need to check for duplicates before inserting.

[CLARIFICATION] I cannot implement a solution by checking that each CSV file has only been processed once. The CSV files contain bank transaction records downloaded from a bank. Therefore I know that each individual CSV file does not contain duplicates. However, multiple CSV files could be downloaded for the same date range, or for overlapping date ranges, etc. – so I need to check for duplicates at the transaction level rather than the file level.

Unfortunately I have no control over the back-end database, and I can only use the methods available via the API. This means the usual solutions using SQL (e.g. this question) are not suitable.

Methods available from API (http://help.moneytrackin.com/index.php/REST_API):

  • listTransactions

  • editTransaction

  • insertTransaction

Methods available but probably not relevant:

  • listProjects

  • listWriteProjects

  • getBalance

  • getTags

  • newProject

  • deleteProject

  • listTagTransactions

  • deleteTransaction

  • listCurrencies

  • userData

It is not a huge database: only four columns, and a few thousand records.

It seems like my only option is to iterate over each record to be inserted and compare it to each record from the database:

get ListOfRecordsInDb from database using listRecords(). Store in HashMap,
    local database or similar data structure??

for each record to be inserted,
    iterate over ListOfRecordsInDb, checking none of them match 
       the record to be inserted
    if no match found, insert record

This seems very inefficient. Are there any other options? If not, what is the most efficient way to compare thousands of records, using Java?


Answers to comments/questions:

What happens if you call insertTransaction with a transaction that
already exists? Does it duplicate it or does it fail?

The transaction is successfully inserted as a duplicate

Does the CSV file have an "id" column?

No. The available columns are Date, Description, Amount and Balance. The combination of these makes each record unique, so I could potentially create an ID based on these.

Does listRecords() allow pagination, or can it only return all of the
records?

It only returns the records, in XML format.

Best Answer

I cannot be sure that each CSV file has only been processed once...

You might want to attempt to solve your question by handling this first. If I am getting this right, the crux of your problem doesn't appear to be individual duplicate transactions (since you mentioned "I know for sure that there are no duplicate records in each CSV file"), but to prevent duplicate processing per file.

Hence, you can consider adding some kind of state logic in your Java application that knows whether a file has been processed by computing and storing its checksum, e.g. its MD5 hash. Once you have a matching checksum, you know that there's a good chance the file has been processed before. You can perform further verification such as by inspecting the number of lines, or other certain unique identifiers of each file.

Further extending this idea, if there are possibilities of the same transaction appear across different CSV files, then your only other option, besides updating the database schema to handle duplicate records properly, is to store all the processed transactions locally within your Java application. If there can be multiple instances of your application (either on the same computer, or across a network), then you'll either need yet another centralized database to handle this, or some distributed data grid... by then, the better option is still back to the drawing board to improve on your existing database schema.

edit

To flip things around, the other considerations to look into , if changing the database schema to handle duplicates nicely is entirely don't-even-think-about-it nigh impossible, is to evaluate how much data your Java application will need to process at any given time, and how fast the connection is between the database and your application.

On the lower end, say your application is processing only 10 records per file, averaging one file an hour. The network connection is very good, say almost as good as accessing a locally-hosted database. In this case, I don't think there's much of a performance impact from having to query all the records.

On the extreme end, your application is expected to read thousand-line-long transaction files every 10 seconds, and the network connection is extremely bad, say taking a minute to query all the records. In this case, you have more concern about processing the files in a speedy manner, and this is how you can probably suggest modifying the database schema. :)

So, assuming all is fine in the lower-end case, what would be an efficient way of comparing a relatively large data set with a smaller input set for duplicates? I'll suggest marshalling the XML payload you get into a HashSet. Also, I hope you have a Transaction domain class that has properly-implemented hashCode() and equals() methods. A Java 8 potential solution would then be:

// assuming your database records are marshalled into currentSet
inputSet.stream().filter(v -> !currentSet.contains(v))
                    .forEach( /* these are the new records to send to the database */);

Also, the elephant in the room: concurrent insertions. Will there be any? If so, how do you intend to handle it then?