C++ – Preventing a parser from turning into a (seemingly) god-sized object

anti-patternsc

So I have a program whose purpose is to take text files and parse them into a binary format that an embedded system understands. However, the text format I've inherited that I need to parse is sufficiently complex enough that after refactoring the main parse routine I'm left with a class with more than 50 methods that almost all look something like parseChannel, parseWCommand, parseVCommand, parsePCommand, parseLoop, parseHex, parseInt, etc. etc. etc.

Needless to say, the class, from its declaration, looks huge and daunting.

However, the methods to interface with the class are extremely simple, just parse (compile the text, figure out its compiled size) and link (fix the pointers once the location in memory is known, return the finalized raw binary data). Really, to the user the class has virtually no other reason for existence besides those two functions, but there's so much seemingly useless fluff in the class declaration that it's hard to even see what it's supposed to do. There's a similar situation going on with the rather massive number of data members that, again, are useless to everything but the class's internal methods that need them to talk to each other, though I don't know if that's as much of an issue.

I've considered making a separate class solely for parsing that's used within the parse method, but it seems strange to me to build an entirely separate class that's only used in a single method in a single class. It seems a bit…superfluous, I guess? And I don't even know if that's attacking the right problem.

I guess in the end, here's what I'm asking:

  1. Is this seemingly-huge class actually a problem? I don't think it's strictly a "god object", but it superficially looks like one.
  2. If it is a problem, what are the best method(s) to fix it?

Best Answer

Ditch the OOP

The member functions probably don’t need to be. You can have the Parser class provide a few primitive operations (accept character, backtrack, &c.) and implement the rest of the parser as free functions taking a Parser reference. They can live in the source file (under an anonymous namespace) thus keeping the header nice and minimal.

In this sense, Parser is really a minimal wrapper around a stream state.

Also, it helps to factor out combinators such as many, choice, and so on, to avoid writing error-prone and verbose manual loops.

Sketch

In Parse.h, your entire public API.

unique_ptr<const Program> parse(istream&);

Your parser state can be entirely private to Parse.cpp.

struct Parser {

  Parser(istream& stream) : stream(stream) {}

  bool accept(const char expected) {
    char actual;
    if (!stream.get(actual)) return false;
    if (expected == actual) return true;
    stream.seekg(-1, ios::cur);
    return false;
  }

  template<class F, class O>
  bool accept_if(F predicate, O output) {
    char actual;
    if (!stream.get(actual)) return false;
    if (predicate(actual)) {
      *output++ = actual;
      return true;
    }
    return false;
  }

  void expect(const char expected) { if (!accept(expected)) throw ...; }

  void push_mark() { marks.push_back(stream.tellg()); }
  void pop_mark() { stream.seekg(marks.back()); drop_mark(); }
  void drop_mark() { marks.pop_back(); }

private:
  istream& stream;
  vector<istream::pos_type> marks;
};

The public API implementation just forwards to the start production of a grammar with a new parser.

unique_ptr<const Program> parse(istream& stream) {
  return parse_program(Parser(stream));
}

Output iterators are a fairly convenient way to produce multiple results.

template<class F, class O>
bool many_if(Parser& parser, F predicate, O output) {
  bool success;
  while (parser.accept_if(predicate, output)) success = true;
  return success;
}

Pointers have the advantage of a conversion to bool.

unique_ptr<const Expression> number(Parser& parser) {
  string digits;
  auto append = back_inserter(digits);
  return many_if(isdigit, append) ? make_unique<Number>(digits) : nullptr;
}

And so on. The real strengths of C++ to leverage when writing parsers are:

  • Generic programming
  • Iterators
  • Streams

Objects, not so much.