Reviewing Object-Oriented Parser Models in Java

androidjavaparsingregular expressions

First time I make use of Code Review. At the moment I'm creating an Android application and the company I'm developing it for, has a specific file format that can be used to give text based data specific formatting rules. See it as an rich text format especially build and designed for their use.

The File format to work with

The file format was or at least is in my opinion designed to be used for pure formatting reasons and is actually not very well designed for data access. The Android application I'm writing depends very strongly on data access, that being said I had to design a smart parser which is capable of transforming the formatted text to some kind of data model which could still have the formatting included.

A typical file could exist out of 600 lines of text having a total of 60000 characters. And could look like this:

\id sit amet
\ide UTF-8
\rem File version: 2013-06-18
\rem Copyright 1995-2012
\t Lorem ipsum
\h 1
\st Nunc at urna
\p
\v 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
\v 2 Suspendisse ac turpis quis diam elementum cursus sit amet non urna.
\pq Vestibulum vulputate \nd dui\nd* eget pharetra adipiscing.
\pq Suspendisse viverra metus quis feugiat rutrum.
\v 3 Nunc at urna vehicula, luctus quam convallis, adipiscing metus.
\v 4 Integer sit amet neque condimentum, laoreet neque ut, feugiat tellus.
\h 2
\v 1 Nam commodo orci et magna eleifend imperdiet.
\v 2 Cras a lorem ac nunc interdum lacinia.
\v 3 Integer vestibulum mauris id scelerisque lacinia.
\st Ut molestie nibh
\p
\v 4 Duis aliquam \nd diam\nd* nec nisl hendrerit posuere.
\v 5 Donec eget sapien id purus sollicitudin elementum non eget arcu.
\b
\v 6 Fusce auctor justo vitae aliquam luctus.
\v 7 Phasellus eget nisl at dui dignissim pharetra.
\v 8 Mauris iaculis mi at augue mattis, eu bibendum nunc placerat.

The \h marker and the \v marker are the most important markers they are used to identify the data using the integer behind it. The other makers are mostly irrelevant (like the \id, \ide, \rem markers), and some like \pq are used to append their data to the line above it but with some formatting applied to it. And some markers are "in-text" like the \nd marker.

Of course there are a lot of rules how the tags can be used in a valid file, but it's totally irrelevant for the review I think. (Or is there someone willing to read 90 pages?)

The structure of the parser

So I came up with the following (Object oriented) parsing model:

The parser has to be:

  • Able to separate DATA from FORMATTING
  • Fast
  • separate the data and the formatting so both can be easily accessed
  • Able to parse a subset of data or a specific range of data from a file

A selection of data is often referred as: File – Level – Sub-level(actual data)

Where the file is just a file to parse, the level an integer which can be found in the file by the \h marker and the Sub-level just like Level with the \v maker.

The parser I wrote returns an object which holds two lists – one with all the formatting and one with all the data.

public class ParseResult {
    protected ArrayList<FileData> data = new ArrayList<FileData>();
    protected ArrayList<FileFormat> format = new ArrayList<FileFormat>();
}

The FileData object and the FileFormat object can be extended to create numerous different formats and data objects. This is the structure I use in the parser I wrote:

For data:

FileData  
    -FileDataLocation extends FileData  
        -FileDataText extends FileDataLocation  
        -FileDataThing extends FileDataLocation  
    -FileDataAnotherThing extends FileData  

For formatting:

FileFormat  
    -FileFormatLevel extends FileFormat  
    -FileFormatTitle extends FileFormat  
    -FileFormatSubLevel extends FileFormat  
    -FileFormatSpecialTokens extends FileFormat 

To use data from the parsed file I use the data ArrayList. To print the data formatted I loop trough the format ArrayList which is sorted so that the beginning of the file would be at the 0 index of the list and the end of the parsed file the last index of the list.

Each formatting object has a reference to a data object in the data list. So that when using the formatting objects the data is always accessible. The formatting objects only hold information what peace of data the format should be applied to and what kind of formatting.

Parsing method

When parsing a file the parser reads the file line by line and constantly keeps track of the \h and \v markers, and starts capturing data in the ParseResult object when a specific combination of the markers is found (Lets call that a location).

To check what type of marker the parser is dealing with I scan the line until I find the first whitespace. When a whitespace has been found a substring can be made from the start of the line to the location of the whitespace.

Comparing the substring with a list of known makers tells the parser what to do with the line.

After the marker is found the line is processed using regex. This is an example regex for a line with the \v marker:

\v\s(([0-9]{1,3})(?:-([0-9]{1,3}))?)\s{1}(.+)[\n\r$]*

Review?

The performance of the parser is not bad at all, the largest files take about 600ms to process completely (9000 lines, 280000 characters).

Things I would like to know

  • Data is now (after parsing) stored in just a list, searching for a specific peace of data is not very well optimized. Are there other structures I could use, which I can search using file(string) - level(integer) - sub-level(integer) so I can quickly get a specific data object.
  • When looping trough the formatting list I must check what instance an object belongs to, and then cast the object so I can use the specific methods and functions. Is there a better, other, alternative method?
  • The file format is pretty basic, should I consider removing all regex functions and parse it whole by just regular code?

Best Answer

Data is now (after parsing) stored in just a list, searching for a specific peace of data is not very well optimized. Are there other structures I could use, which I can search using file(string) - level(integer) - sub-level(integer) so I can quickly get a specific data object.

Assuming that you want to search by File, Level, Sub-Level only, you have a clear hierarchical structure in that description, it sounds like you can divide one ArrayList into multiple steps. You could even make each hierarchy level a class itself to make it more clear, for example:

class SomeFileData {
   List<LevelData> levels;
}

class LevelData {
   List<SubLevelData> sublevels;
}

class SubLevelData {
   // Probably similar to your current `FileData` implementation
}

And if you need to save multiple files at once, you can use a Map with the file name as key.

class LotsOfFileData {
    Map<String, SomeFileData> files; 
}

Using this hierarchy, you can easily get file "mydata.dat", level 3, sublevel 4.

LotsOfFileData allData;
allData.getFile("mydata.dat").getLevel(3).getSublevel(4)

Each step in the get process looks at the current class' list or map to retrieve the data.

By the way, declare variables according to their interface, not their implementation. Declare your ArrayLists as List to allow for easy change of implementation.