Java – DTD download error while parsing XHTML document in XOM

dtdjavaxom

I am trying to parse an HTML document with the doctype declared to use
the transitional dtd as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

When I do Builder.build on the document, I get the following exception:

  java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
       at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1305)
       at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
       at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
       at nu.xom.Builder.build(Builder.java:1127)
       at nu.xom.Builder.build(Builder.java:1019)

If I remove the doc type declaration, it parses just fine. I can
successfully download the dtd from my browser, which tells me that the
url is valid. I don't want to remove the doc type declaration. Is
there a way tell the builder not to download the dtd or provide it
with an alternate dtd?

Best Answer

This solves the problem:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setValidating(false);
            factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
            Document document = factory.newDocumentBuilder().parse(is);

Related Solutions

R – How to prevent XML::XPath from fetching a DTD while processing an XML file

XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.

So you can write this:

my $p = XML::Parser->new( NoLWP => 1);
my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");

Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including α in the table and printing it for example).

As a matter of fact you probably should not use XML::XPath, which is not actively maintained.

Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.

.net – An error has occurred opening extern DTD (w3.org, xhtml1-transitional.dtd). 503 Server Unavailable

The answer is, I have to provide my own XmlResolver. I don't think this is built-in to .NET 3.5. That's baffling. It's also baffling that it has taken me this long to stumble onto this problem. It's also baffling that I couldn't find someone else who solved this problem already?

Ok, so.. the XmlResolver. I created a new class, derived from XmlResolver and over-rode three key things: Credentials (set), ResolveUri and GetEntity.

public sealed class XhtmlResolver : XmlResolver
{
    public override System.Net.ICredentials Credentials
    {
        set { throw new NotSupportedException();}
    }

    public override object GetEntity(Uri absoluteUri, string role, Type t)
    {
       ...
    }

    public override Uri ResolveUri(Uri baseUri, string relativeUri)
    {
      ...
    }
}

The documentation on this stuff is pretty skimpy, so I'll tell you what I learned. The operation of this class is like so: the XmlReader will call ResolveUri first, then, given a resolved Uri, will then call GetEntity. That method is expected to return an object of Type t (passed as a param). I have only seen it request a System.IO.Stream.

My idea is to embed local copies of the DTD and its dependencies for XHTML1.0 into the assembly, using the csc.exe /resource option, and then retrieve the stream for that resouce.

private System.IO.Stream GetStreamForNamedResource(string resourceName)
{
    Assembly a = Assembly.GetExecutingAssembly();
    return  a.GetManifestResourceStream(resourceName);
}

Pretty simple. This gets called from GetEntity().

But I can improve on that. Instead of embedding the DTDs in plaintext, I gzipped them first. Then modify the above method like so:

private System.IO.Stream GetStreamForNamedResource(string resourceName)
{
    Assembly a = Assembly.GetExecutingAssembly();
    return  new System.IO.Compression.GZipStream(a.GetManifestResourceStream(resourceName), System.IO.Compression.CompressionMode.Decompress);
}

That code opens the stream for an embedded resource, and returns a GZipStream configured for decompression. The reader gets the plaintext DTD.

What I wanted to do is resolve only URIs for DTDs from Xhtml 1.0. So I wrote the ResolveUri and GetEntity to look for those specific DTDs, and respond affirmatively only for them.

For an XHTML document with the DTD statement, the flow is like this;

XmlReader calls ResolveUri with the public URI for the XHTML DTD, which is "-//W3C//DTD XHTML 1.0 Transitional//EN". If the XmlResolver can resolve, it should return... a valid URI. If it cannot resolve, it should throw. My implementation just throws for the public URI.
XmlReader then calls ResolveUri with the System Identifier for the DTD, which in this case is "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd". In this case, the XhtmlResolver returns a valid Uri.
XmlReader then calls GetEntity with that URI. XhtmlResolver grabs the embedded resource stream and returns it.

The same thing happens for the dependencies - xhtml_lat1.ent, and so on. In order for the resolver to work, all those things need to be embedded.

And yes, if the Resolver cannot resolve a URI, it is expected to throw an Exception. This isn't officially documented as far as I could see. It seems a bit surprising. (An egregious violation of the principle of least astonishment). If instead, ResolveUri returns null, the XmlReader will call GetEntity on the null URI, which .... ah, is hopeless.

This works for me. It should work for anyone who does XML processing on XHTML from .NET. If you want to use this in your own applications, grab the DLL. The zip includes full source code. Licensed under the MS Public License.

You can plug it into your XML apps that fiddle with XHTML. Use it like this:

// for an XmlDocument...
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.XmlResolver = new Ionic.Xml.XhtmlResolver();
doc.Load(xhtmlFile);

// for an XmlReader...
var xmlReaderSettings = new XmlReaderSettings
    {
        ProhibitDtd = false,
        XmlResolver = new XhtmlResolver()
    };
using (var stream = File.OpenRead(fileToRead))
{
    XmlReader reader = XmlReader.Create(stream, xmlReaderSettings);
    while (reader.Read())
    {
     ...
    }

Best Answer

Related Solutions

R – How to prevent XML::XPath from fetching a DTD while processing an XML file

.net – An error has occurred opening extern DTD (w3.org, xhtml1-transitional.dtd). 503 Server Unavailable

Related Topic