Is it possible to protect json api scraping using random tokens

jsonSecurity

I have built a restful json api for an online store using Laravel.

I now wish to create an AngularJS app to run the front-end web application. Product prices for my store need to update every second, so Angular needs to get the product json from the server once per second to update the html.

I wish to somehow do something to protect this publicly accessible json data from being stolen by scrapers / bots repetitively hitting my api to steal my product and pricing data.

My thoughts:

Each json api response contains a random use-once token (i am aware about doing stuff to stop collisions).

This random token is injected via AngularJS into a hidden field within the html page. Angularjs reads the hidden token prior to re-requesting the api and uses this in the header of the next get request.

A json response is only provided with the correct token in the header. Once the token is used, it expires, and any attempt to re-use the expired token, results in that IP address being locked out for a period of time or flagged to administrator to investigate.

Is there a better way of doing the above? Are there any kind of actual available solutions to my problem? Considering that I am paying for the product pricing feed in the first place, I really don't want some schmuck getting it from me for free just because I exposed an api!

Whilst I understand that there is probably no way of being 99% certain that somebody could find a way of scraping my json, there must be something relatively simple I can do to protect my json feed from dumb bots / scrapers simply using curl to steal data.

Best Answer

The short answer is: there is no adequate way to fully protect publicly accessible data from copying.

Cheap loss-less copy is one of main advantages of digital data. Anyone who can access your data is able copy it. And it's not easy to get rid of this ability.

The trick with nonce suggested by you makes scraping a little bit more complex, but not impossible, and even not much harder. A bot can evaluate anything that can be evaluated by a web browser. And in fact, modern bots usually do it. They may run a headless web browser (like PhantomJS) and "see" pages exactly as user sees. The most advanced bots emulate mouse clicks and randomize delays between actions, so it is very hard to distinguish them from humans.

And if your data is really public (there is no authentication) then there is no strong way to protect it from bots, although you can make their life a little bit harder. it is always a confrontation between a shop’s owner and a bot’s owner. The shop’s owner tries to make the bot more complex and expensive by making data harder to extract. The bot’s developer tries to make the bot more sophisticated. It continues until the bot or protection mechanisms become too costly for someone’s business.

You can use several tricks like data obfuscation, captcha, nonce, some heuristics to detect human activity. It will filter-out most of mass spiders that aren’t developed specially for your web site. If someone is aimed to your shop and develops a scraper specially for it then it is likely that you can’t protect from him.

So I think you should go to the light side and minimize costs by making your JSON API as simple and straightforward as possible.