Mocking a Site for Scraper Testing – Preferred Approach

acceptance-testingweb scraping

Subj. Atm I'm using Selenium and Python, but the same applies to any other scraping solution.

I'm wondering:

  1. which of the options outlined below are optimal/recommended/best practices
  2. if there are existing solutions/helper libraries, which keywords I should look them up by.

To stay objective, "optimal/recommended/best practices" means "widely used and/or promoted/endorsed by high-profile projects in the niche."

I couldn't find any Selenium-related or general-purpose material on this topic having spent about a day of net time searching around which probably means I'm lacking some critical piece(s) of information.


The basic operations when scraping are:

  • searching for element (by CSS selector/XPath and/or by hand for things that those aren't capable of)
  • interacting with an element (input text, click)
  • read element data

And the call chain goes like this:

(Test code ->) User code -> Framework (selenium) -> Browser (web driver) -> Site


So, there are 3 hops here that I could mock. Each one poses challenges:

  • Mock the site: launch a local HTTP server and direct the browser there
    • Have to reimplement the scraped site's interface, in web technologies
  • Mock the browser (e.g. populate HtmlUnit (an in-process browser engine) with predefined HTML at appropriate moments)
    • much simpler but still need to emulate state transitions/action reactions somehow
  • Mock the framework calls
    • The truest to the unit testing philosophy, the least work
    • I'm however worried that it's too restrictive. E.g. I can find the same element by various means. A mock object can only accept a very specific course of action as it lacks the sophistication to e.g. check if some other selector would produce the same result.

There are also two options for what content to provide — either

  • provide the site's original content that it produced for a test query, compiling it into some sort or self-contained package
    • labor-intensive and error-prone, or
  • provide the bare minimun to satisfy the tested algorithm
    • much simpler but would fail for other possible algorithms that would succeed with the real site

One last concern is the fact that a site is effectively a state machine. I'm not sure which will be more useful:

  • implement the complete state machine, probably as some kind of specification, and set/check its states in the tests
    • very labor-intensive without some kind of library that reduces the work to writing a formal specification; or
  • simply validate the action sequences
    • which doesn't seem to actually test the code against anything — it merely reiterates what the code does

Update to address an expressed concern:

I'm scraping a 3rd-party site — which can and will change without notice one day. So, I'm fine with testing against "the site's interface as it was at the time of writing" — to quickly check if a code change broke the scraper's internal logic.

Best Answer

You can go crazy mocking every detail. That's probably not feasible given many complicated test cases.

Where possible, it is better to record a complete real-world test input, redact it to get rid of irrelevant details, and then run it through your complete scraping engine. How to represent the site and how to replay it depends on the fidelity you need for these tests. E.g. this might be very difficult if the site is expected to make Ajax requests to multiple domains.

For example, you might get away by simply storing the HTML of a page you want to scrape. In extreme cases, you would want to log and replay all HTTP requests the site makes. I.e., the site makes the request and you replay a response you recorded from the live site.

In all cases, the thing you assert is that your scraper extracted the correct data from the page. How it gets there is going to be secondary.

The advantage of these high-level testing methods is that

  • they are quite realistic,
  • and once the test suite is set up, adding another testcase doesn't require much effort.

The disadvantage is that these tests are somewhat slow – much faster than doing actual requests to live sites, but still slower than targeted unit tests of your scraper.

Over time, you can grow a corpus of realistic test cases. If a test case stops being useful (e.g. because the live site changed), you can always throw it out.

Related Topic