Python – Extracting info from html using PHP(XPath), PHP/Python(Regexp) or Python(XPath)

htmlPHPpythonregexxpath

I have approx. 40k+ html documents where I need to extract information from. I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow…. I am advised to use regexp but the html files are not marked up semantically (table based layout, with meaning-less tag/classes used everywhere) and I don't know where i should start…

Just being curious, is using regexp (PHP/Python) faster than using Python's XPath library? Is Xpath library for Python generally faster than PHP's counterpart?

Best Answer

If speed is a requirement have a look at lxml. lxml is a pythonic binding for the libxml2 and libxslt C libraries. Using the C libraries is much faster than any pure php or python version.

There are some impressive benchmarks from Ian Bicking:

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

Parsing Results:

Parsing Resutls http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png

Related Solutions

Python – How to execute a program or call a system command

Use the subprocess module in the standard library:

import subprocess
subprocess.run(["ls", "-l"])

The advantage of subprocess.run over os.system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).

Even the documentation for os.system recommends using subprocess instead:

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

On Python 3.4 and earlier, use subprocess.call instead of .run:

subprocess.call(["ls", "-l"])

Html – Adding HTML entities using CSS content

You have to use the escaped unicode :

.breadcrumbs a:before {
  content: '\0000a0';
}

More info on : http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/

Best Answer

Related Solutions

Python – How to execute a program or call a system command

Html – Adding HTML entities using CSS content

Related Topic