How to Keep Data in Sync with 3rd Party API – Architecture Guide

apiArchitecturewebsites

I'm currently facing a difficult problem regarding re-archituring a legacy webshop, i would really appreciate some insight.

The system depends on a 3rd party API for it's product/variant data the latter returns a JSON "page" containing an array of products.

Currently to keep everything in sync the process is to query this API once in a while to get all the products for a supplier and then updating a local MYSQL database.
I cannot just diff the changes made to a product since there is no reliable information in the API about product updates.

Edit: let's say i update the data twice a week, in order to check if the data changed i need to query my database for each product and then check field by field. I tried it and it takes forever. The API has a field last_updated_at on every record but it doesn't reflect when a field changed only that a new indexation occurred for the whole supplier. Unfortunately, more often than not no information has to change for this indexation to occur.

This process is written in procedural PHP and is quite heavy since it depends on json_decode and some MYSQL inserts/update inside foreach loops (one for each level of the JSON) some of the documents can easily weight more than 20MB, and a supplier can have up to 1000+ products/ 10000+ variants, furthermore we are syncing up to 50+ suppliers.

I would like to depart from this model and introduce some kind of continuous update model.

I don't think i strictly need MYSQL for storing products since I have to re-normalize the data to fit. At first I was thinking about fetching a single product as the front-end requests it and caching the response, then only use MYSQL for the order/quote part. Unfortunately I can't do something like GET /product/ID, the API only accept POST request with a request body so caching seems more complicated.

Anyway to cut it short: every time i think of something it seems I am stopped in my tracks by some hurdle that keep making me question my approach. I seem to go round and round without making any headway to tackle this.

Why would you do in my position?

Best Answer

There are a few different concerns here. First, with the current model, you might be able to work around your current issues using a hash function. You take the data that you get from the supplier as a whole (i.e. original json, or whatever) and run something like MD5 or SHA1* on it. You then store this on the record along with the data.

Now when you check the supplier's site again, you calculate the hash. If it matches, the data is the same as before and you don't need to change the DB. If there are fields in thew supplier response like unique codes or timestamps, you will need remove or mask them before doing the has calculation.

For the redesign

There are many caching solutions that don't require using GETs but I think you wish to use some built-in functionality of a webserver, perhaps? If that's necessary, you could build a proxy web API that takes GET requests and makes the POST call to the external server. This layer would also make a good place to implement your cache. You could even use this layer directly in your front-end interface, just don't rely on the client-side to relay the price to your order processing layer.

*MD5 and SHA1 should not be used for anything related to security but for something like this it they are perfectly adequate.