Magento – Scraping data from Magento without privileged access or trust

apiextensions

There appear to be a number of ways of scraping product data from a Magento site, but all seem to have their upsides and downsides.

We deal with sites who have little to no technical resource, but who have given us permission to scrape their product catalog. There appear to be 3 different ways of doing this, none of which really work:

  • Manual web scraping – developer intensive, requires updating when the theme changes.
  • Magento Web API – requires setting up an API user, too technical for many users.
  • Magento Plugin – too technical for many users, exposes sensitive business data so many companies won't do this.

Are we missing something? Is there a better alternative, or are there ways of changing any of the above 3 to be better for scraping?

For example, is it possible to provide a link to a 'one-click-setup' like process for API access? Shopify do this in a nice way using OAuth and permission scopes, so we can give our partners a link that will give us read only access to just their product catalog, in a way that non-technical users can use.

Best Answer

Not sure why magento plugin in and of itself would be too technical, especially if instructed to install via magento connect.

Which could build an accessible XML feed for you so you could scrape/retrieve the feed via HTTP without worrying about a changing theme layer.

I don't think this is the one click answer you're looking for, but an 'alternative' solution could be to have clients upload a custom script that you provide.

That script could be run via cron, and would perform periodic dumps of specified DB tables (i.e. no tables which contain 'sensitive business data').

Each dump could be retrieved via ssh/sftp if you have access to that, a public facing folder / email if not. Setting up a crontask via cpanel would be pretty easy for the average user.

That would give you the most complete dataset, although not without its glaring downsides.

As a sidenote, xpath parser for webscraping is an elegant tool, and could be implemented in a way to be mostly theme agnostic if it comes to that.

Related Topic