C# – hand coding a parser

ccompiler-constructionlexerparsing

For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.

I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple language, say CSS. And I wanna do this right.

This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.

I find my self writing code like this (hopefully you can infer the rest from this snippet):

public CssToken ReadNext()
{
    int val;
    while ((val = _reader.Read()) != -1)
    {
        var c = (char)val;
        switch (_stack.Top)
        {
            case ParserState.Init:
                if (c == ' ')
                {
                    continue; // ignore
                }
                else if (c == '.')
                {
                    _stack.Transition(ParserState.SubIdent, ParserState.Init);
                }
                break;

            case ParserState.SubIdent:
                if (c == '-')
                {
                    _token.Append(c);
                }
                _stack.Transition(ParserState.SubNMBegin);
                break;

What is this called? and how far off am I from something reasonable well understood? I'm trying to balance something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.

What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.

I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.

The parser will be implemented in much the same way, currently, having everything in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into separate methods?

Best Answer

What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.

Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.

There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.

If you do want to write your own lexer by hand, good ones often look something like this:

function readToken() // note: returns only one token each time
    while !eof
        c = peekChar()
        if c in A-Za-z
            return readIdentifier()
        else if c in 0-9
            return readInteger()
        else if c in ' \n\r\t\v\f'
            nextChar()
        ...
    return EOF

function readIdentifier()
    ident = ""
    while !eof
        c = nextChar()
        if c in A-Za-z0-9
            ident.append(c)
        else
            return Token(Identifier, ident)
            // or maybe...
            return Identifier(ident)

Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).

What are the syntax errors?

PHP belongs to the C-style and imperative programming languages. It has rigid grammar rules, which it cannot recover from when encountering misplaced symbols or identifiers. It can't guess your coding intentions.

Function definition syntax abstract

Most important tips

There are a few basic precautions you can always take:

Use proper code indentation, or adopt any lofty coding style. Readability prevents irregularities.
Use an IDE or editor for PHP with syntax highlighting. Which also help with parentheses/bracket balancing.
Read the language reference and examples in the manual. Twice, to become somewhat proficient.

How to interpret parser errors

A typical syntax error message reads:

Parse error: syntax error, unexpected T_STRING, expecting ';' in file.php on line 217

Which lists the possible location of a syntax mistake. See the mentioned file name and line number.

A moniker such as T_STRING explains which symbol the parser/tokenizer couldn't process finally. This isn't necessarily the cause of the syntax mistake, however.

It's important to look into previous code lines as well. Often syntax errors are just mishaps that happened earlier. The error line number is just where the parser conclusively gave up to process it all.

Solving syntax errors

There are many approaches to narrow down and fix syntax hiccups.

Open the mentioned source file. Look at the mentioned code line.
- For runaway strings and misplaced operators, this is usually where you find the culprit.
- Read the line left to right and imagine what each symbol does.
More regularly you need to look at preceding lines as well.
- In particular, missing ; semicolons are missing at the previous line ends/statement. (At least from the stylistic viewpoint. )
- If { code blocks } are incorrectly closed or nested, you may need to investigate even further up the source code. Use proper code indentation to simplify that.
Look at the syntax colorization!
- Strings and variables and constants should all have different colors.
- Operators +-*/. should be tinted distinct as well. Else they might be in the wrong context.
- If you see string colorization extend too far or too short, then you have found an unescaped or missing closing " or ' string marker.
- Having two same-colored punctuation characters next to each other can also mean trouble. Usually, operators are lone if it's not ++, --, or parentheses following an operator. Two strings/identifiers directly following each other are incorrect in most contexts.
Whitespace is your friend. Follow any coding style.
Break up long lines temporarily.
- You can freely add newlines between operators or constants and strings. The parser will then concretize the line number for parsing errors. Instead of looking at the very lengthy code, you can isolate the missing or misplaced syntax symbol.
- Split up complex if statements into distinct or nested if conditions.
- Instead of lengthy math formulas or logic chains, use temporary variables to simplify the code. (More readable = fewer errors.)
- Add newlines between:
  1. The code you can easily identify as correct,
  2. The parts you're unsure about,
  3. And the lines which the parser complains about.
  Partitioning up long code blocks really helps to locate the origin of syntax errors.
Comment out offending code.
- If you can't isolate the problem source, start to comment out (and thus temporarily remove) blocks of code.
- As soon as you got rid of the parsing error, you have found the problem source. Look more closely there.
- Sometimes you want to temporarily remove complete function/method blocks. (In case of unmatched curly braces and wrongly indented code.)
- When you can't resolve the syntax issue, try to rewrite the commented out sections from scratch.
As a newcomer, avoid some of the confusing syntax constructs.
- The ternary ? : condition operator can compact code and is useful indeed. But it doesn't aid readability in all cases. Prefer plain if statements while unversed.
- PHP's alternative syntax (if:/elseif:/endif;) is common for templates, but arguably less easy to follow than normal { code } blocks.
The most prevalent newcomer mistakes are:
- Missing semicolons ; for terminating statements/lines.
- Mismatched string quotes for " or ' and unescaped quotes within.
- Forgotten operators, in particular for the string . concatenation.
- Unbalanced ( parentheses ). Count them in the reported line. Are there an equal number of them?
Don't forget that solving one syntax problem can uncover the next.
- If you make one issue go away, but other crops up in some code below, you're mostly on the right path.
- If after editing a new syntax error crops up in the same line, then your attempted change was possibly a failure. (Not always though.)
Restore a backup of previously working code, if you can't fix it.
- Adopt a source code versioning system. You can always view a diff of the broken and last working version. Which might be enlightening as to what the syntax problem is.
Invisible stray Unicode characters: In some cases, you need to use a hexeditor or different editor/viewer on your source. Some problems cannot be found just from looking at your code.
- Try grep --color -P -n "\[\x80-\xFF\]" file.php as the first measure to find non-ASCII symbols.
- In particular BOMs, zero-width spaces, or non-breaking spaces, and smart quotes regularly can find their way into the source code.
Take care of which type of linebreaks are saved in files.
- PHP just honors \n newlines, not \r carriage returns.
- Which is occasionally an issue for MacOS users (even on OS X for misconfigured editors).
- It often only surfaces as an issue when single-line // or # comments are used. Multiline /*...*/ comments do seldom disturb the parser when linebreaks get ignored.
If your syntax error does not transmit over the web: It happens that you have a syntax error on your machine. But posting the very same file online does not exhibit it anymore. Which can only mean one of two things:
- You are looking at the wrong file!
- Or your code contained invisible stray Unicode (see above). You can easily find out: Just copy your code back from the web form into your text editor.
Check your PHP version. Not all syntax constructs are available on every server.
- php -v for the command line interpreter
- <?php phpinfo(); for the one invoked through the webserver.
Those aren't necessarily the same. In particular when working with frameworks, you will them to match up.
Don't use PHP's reserved keywords as identifiers for functions/methods, classes or constants.
Trial-and-error is your last resort.

If all else fails, you can always google your error message. Syntax symbols aren't as easy to search for (Stack Overflow itself is indexed by SymbolHound though). Therefore it may take looking through a few more pages before you find something relevant.

Further guides:

PHP Debugging Basics by David Sklar
Fixing PHP Errors by Jason McCreary
PHP Errors – 10 Common Mistakes by Mario Lurig
Common PHP Errors and Solutions
How to Troubleshoot and Fix your WordPress Website
A Guide To PHP Error Messages For Designers - Smashing Magazine

White screen of death

If your website is just blank, then typically a syntax error is the cause. Enable their display with:

error_reporting = E_ALL
display_errors = 1

In your php.ini generally, or via .htaccess for mod_php, or even .user.ini with FastCGI setups.

Enabling it within the broken script is too late because PHP can't even interpret/run the first line. A quick workaround is crafting a wrapper script, say test.php:

<?php
   error_reporting(E_ALL);
   ini_set("display_errors", 1);
   include("./broken-script.php");

Then invoke the failing code by accessing this wrapper script.

It also helps to enable PHP's error_log and look into your webserver's error.log when a script crashes with HTTP 500 responses.

Scala: list.flatten: no implicit argument matching parameter type (Any) = > Iterable[Any] was found

If you are expecting to be able to "flatten" List(1, 2, List(3,4), 5) into List(1, 2, 3, 4, 5), then you need something like:

implicit def any2iterable[A](a: A) : Iterable[A] = Some(a)

Along with:

val list: List[Iterable[Int]] = List(1, 2, List(3,4), 5) // providing type of list 
                                                         // causes implicit 
                                                         // conversion to be invoked

println(list.flatten( itr => itr )) // List(1, 2, 3, 4, 5)

EDIT: the following was in my original answer until the OP clarified his question in a comment on Mitch's answer

What are you expecting to happen when you flatten a List[Int]? Are you expecting the function to sum the Ints in the List? If so, you should be looking at the new aggegation functions in 2.8.x:

val list = List(1, 2, 3)
println( list.sum ) //6