R – How to prevent XML::XPath from fetching a DTD while processing an XML file

dtdperlxml

My XML (a.xhtml) starts like this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...

My code starts like this

use XML::XPath;

use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => "a.xhtml");

my $nodeset = $xp->find('/html/body//table'); 

It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).

Is there a way to explicitly declare an HTTP proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the DTD.

Best Answer

XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.

So you can write this:

my $p = XML::Parser->new( NoLWP => 1);
my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");

Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including &alpha; in the table and printing it for example).

As a matter of fact you probably should not use XML::XPath, which is not actively maintained.

Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.