An elegant way to split text into words combined with adjacent punctuation and determine which punctuation mark it is

parsingtext processing

Firstly, I realize that question title is about as terrible as the sample code I'll post below, so please bear with me while I explain the problem more clearly, and if you have a better idea for the title – be my guest and edit it.

Imagine a long plain text. It consists of words separated by punctuation marks and/or spaces. What I need to do is convert it to a list of words+punctuation marks that separate this word from the next one. And the twist is I also need to determine which punctuation mark it is (or what's the last one if there's more than one in a row). So, I need to turn the text into a collection of structures:

{
   wordFollowedByPunctuation: String;
   punctuationMark: PunctuationType; // E. g. {Point, Comma, Colon, Space, ...}
}

If all the punctuation marks were single characters, it would be easy since we could use single-pass character-wise parsing. I have a working, albeit awful, C++ prototype (using QtQString and QChar – for Unicode support).

Here's the TextFragment structure – I'll be converting the text into a collection of these:

struct TextFragment
{
    enum Delimiter {
        Space,
        Comma,
        Point,
        ExclamationMark,
        QuestionMark,
        Dash,
        Colon,
        Semicolon,
        Ellipsis,
        Bracket,
        Newline
    };

    inline TextFragment(const QString& text, Delimiter delimiter) : _text(text), _delimitier(delimiter) {}

    const QString _text;
    const Delimiter _delimitier;
};

And here's the actual parsing:

const QString text = readText(device);

    struct Delimiter {
        QChar delimiterCharacter;
        TextFragment::Delimiter delimiterType;

        inline bool operator< (const Delimiter& other) const {
            return delimiterCharacter < other.delimiterCharacter;
        }
    };

    static const std::set<Delimiter> delimiters {
        {' ', TextFragment::Space},
        {'.', TextFragment::Point},
        {':', TextFragment::Colon},
        {';', TextFragment::Semicolon},
        {',', TextFragment::Comma},
        // TODO: dash should be ignored unless it has an adjacent space!
        {'-', TextFragment::Dash},
        // TODO:
        // {"...", TextFragment::Ellipsis},
        {'⋯', TextFragment::Ellipsis},
        {'…', TextFragment::Ellipsis},
        {'!', TextFragment::ExclamationMark},
        {'\n', TextFragment::Newline},
        {'?', TextFragment::QuestionMark},

        {')', TextFragment::Bracket},
        {'(', TextFragment::Bracket},
        {'[', TextFragment::Bracket},
        {']', TextFragment::Bracket},
        {'{', TextFragment::Bracket},
        {'}', TextFragment::Bracket}
    };

    std::vector<TextFragment> fragments;

    QString buffer;
    bool wordEnded = false;
    TextFragment::Delimiter lastDelimiter = TextFragment::Space;
    for (QChar ch: text)
    {
        if (ch == '\r')
            continue;

        const auto it = delimiters.find({ch, TextFragment::Space});
        if (it == delimiters.end()) // Not a delimiter
        {
            if (wordEnded) // This is the first letter of a new word
            {
                fragments.emplace_back(buffer, lastDelimiter);
                wordEnded = false;
                buffer = ch;
            }
            else
                buffer += ch;
        }
        else // This is a delimiter. Append it to the current word.
        {
            lastDelimiter = it->delimiterType;
            wordEnded = true;
            buffer += ch;
        }
    }

    return fragments;

Here we have a state machine that tries hard not to look like one, and is worse for that. It works. But it has a bigger problem than coding style: some delimiters are multi-character, and it just can't handle them. One example is ellipsis consisting of 3 dots: "…" I want to tell it apart from a single dot. Another example is I want to distinguish between a hyphen and a dash. A hyphen separates parts of a compound word, e. g. "up-to-date", and as far as I'm concerned it's not a punctuation mark. A dash, on the other hand, is: "Joe — and his trusty mutt — was always welcome." Now, there is a special dash character, but in plain non-Unicode texts both are commonly represented by a hyphen ("-"). Then the only way I see to tell them apart is to only "space+hyphen+space" or "space+hyphen+other punctuation mark" as a dash.

Example: the sentence

If only he had tried… well, it doesn't matter now.

should result in

{"If ", Space},
{"only ", Space},
{"he ", Space},
{"had ", Space},
{"tried... ", Ellipsis},
{"well, ", Comma},
{"it ", Space},
{"doesn't ", Space},
{"matter ", Space},
{"now", Point}

The way I see it, I need some sort of exhaustive parsing instead of my greedy prototype (and, naturally, separators themselves should be represented by strings, not characters). What's the simplest way to do that? Can I use regular expressions for this (I'm terrible with them)?

Best Answer

hyphen is a punctuation. Consider the text "we need to fix this up - to date we haven't bothered" and "fix the errors up-to and including Tuesday". Relying on spaces won't help you with sloppy typists.

Typically though, you handle your text single character at a time, and once you locate the start of a multi-char punctuation, you then process subsequent text at that point. eg. when you find a '.' you then read-ahead to determine in the next 2 characters are also '.', in which case you combine the 3 into a single 'ellipsis' punctuation. The problem becomes one of reading ahead into the stream and, if you do not consume he subsequent characters, put them back into the stream for the main processing loop to work with. This kind of problem is why stream buffers have functions such as putback() and peek().