R – ArrayIndexOutOfBoundsException in xerces parsing

arraysxercesxml

I do not know where the problem is… Help and Thanks!

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192

at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:543)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1619)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1657)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1740)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2930)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:277)
at myPackage.MainClass.main(MainClass.java:39)

In the mainclass, code framework as below:

SAXParserFactory sf = SAXParserFactory.newInstance();   
SAXParser sax = sf.newSAXParser();   
sax.parse("english.xml", new DefaultElementHandler("page"){   
public void processElement(Element element) { 
// process the element
}
}); 

The XML file is huge 4G, and full of text, I need to parse the file and process the text.

Currently, I did nothing the process part, just wanted to print them out in the console. Then OOB…

Best Answer

I know this post is ten years old, but I am answering this because this Stack Overflow post is the top result on Google and anyone else who runs into this might need a fix, as I did just today.

Yes, it is a bug in Xerces and as of March 2020 it is STILL NOT FIXED. It is relatively straight-forward to work around, however.

The bug has nothing to do with the file size. Xerces has issues with certain 4-byte UTF-8 character sequences. It has been patched multiple times over the course of years. (https://bugs.openjdk.java.net/browse/JDK-8080085)

Depending on your platform, your Java environment may assume a default encoding of UTF-16. When Xerces hits one of these four byte sequences on a UTF-16 platform, you get the exception trace shown.

Fortunately, this is easy to fix. One easy fix the bug report suggests is to convert all 4-byte UTF-8 characters in the input file into numeric character entities. The other "more correct" way is to explicitly specify your encoding... even if it was already specified in your XML schema, specify it anyway as part of your input stream.

e.g. If you are accessing Xerces via SAX, do not call SAXParser.parse(filename, handler) the way it shows in most tutorials. Instead, you need to create your own InputStream like so:

final SAXParser saxParser = factory.newSAXParser();
File file = new File(filename);
InputStream inputStream = new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");                      
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");                      
saxParser.parse(is, handler);

Hope this helps someone!

Related Topic