Web Data Parsing – How to Get Data from a Webpage Efficiently

dataparsing

Recently I've learned that using a regex to parse the HTML of a website to get the data you need isn't the best course of action.

So my question is simple: What then, is the best / most efficient and a generally stable way to get this data?

I should note that:

  • There are no API's
  • There is no other source where I can get the data from (no databases, feeds and such)
  • There is no access to the source files. (Data from public websites)
  • Let's say the data is normal text, displayed in a table in a html page

I'm currently using python for my project but a language independent solution/tips would be nice.

As a side question: How would you go about it when the webpage is constructed by Ajax calls?

EDIT:

In the case of HTML parsing, I know that there is no actual stable way to get the data. As soon as the page changes, your parser is done for. What I mean with stable in this case is: an efficient way to parse the page, that always hands me the same results (for the same set of data obviously) provided that the page does not change.

Best Answer

Well, here are my 2 cents:

If there is no AJAX involved, or it can be cleared easily, 'fix' the HTML to XHTML (using HTMLTidy for example), then use XPath instead of regular expressions to extract the information.
In a well-structured web page, the logically separated entities of information are in different <div>s, or whatever other tag, which means you would be able to easily find the right information with a simple XPath expression. This is great also because you can test it in, say, Chrome's console, or Firefox' developer console and verify it works before writing even one line of other code.
This approach also has very high signal-to-noise ratio, since usually expressions to select the relevant information will be one-liners. They are also way easier to read than regular expressions and are designed for that purpose.

If there is AJAX and serious JavaScript-ing involved in the page, embed a browser component in the application and use its DOM to trigger events you need, and XPath to extract information. There are plenty good embeddable browser components out there, most of which use real-world browsers under the hood, which is a good thing, since a web-page might be incorrect (X)HTML, but still render good on all major browsers (actually, most of the pages eventually get this way).

Related Topic