Java – How to parse XML files too big to fit in memory

designjavaxml

I'm trying to process XML files that are too big to fit in memory. They range in size anywhere from dozens of megabytes to over 120GB. My initial attempt had me reading the files as plain text, in chunks of a few thousand characters at a time, and looking for individual completed XML tags in the little String chunks:

FileReader fileReader;
    try {
        fileReader = new FileReader(file);

        DocumentBuilder factory = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc;

        int charsToReadAtOnce = 1000;
        char[] readingArray = new char[charsToReadAtOnce ];
        int readOffset = 0;
        StringBuilder buffer = new StringBuilder(charsToReadAtOnce * 3);

        while(fileReader.read(readingArray, readOffset, charsToReadAtOnce ) != -1) {
            buffer.append(new String(readingArray));
            String current = buffer.toString();

            doc = factory.parse(new InputSource(new StringReader(buffer.toString())));

            //see if string contains a complete XML tag
            //if so, save useful info and manually clear it
        }
    } catch (ALL THE EXCEPTIONS...

This was getting messy and complicated fast with a lot of edge cases like tags over 1000 characters long and ignoring start and end tags. Instead of pressing on, I want to use a less painful algorithm but can't come up with a real good one. Does Java have a more appropriate way to handle massive XML files like these? While asking this question, I came across Read a zipped xml with .NET. I think something like that but obviously for Java might work for me but I don't know if it exists?

Best Answer

Streaming API (such as SAX see https://docs.oracle.com/javase/tutorial/jaxp/sax/) vs DOM api's. Former one process tags as they occur, while the latter represents the entire DOM model in memory. See also https://stackoverflow.com/q/6828703/744133