Understanding Context Free Grammar using a simple C code

ccompilergrammarparsingprogramming-languages

I'm trying to understand the difference between terminal and non-terminal values in a real language. I wasn't able to find enough examples on real language CFGs on the internet, most examples are abstract. Assume we have the following

int main(){
   int a = 5;
   return a + 6;
}

Are the following statements true?

Terminals: int, (, ), {, }, 5, return, +, 6, ;

Non-terminals: main, a

Best Answer

Terminal and non-terminal symbols are an aspect of a grammar, not of a language. In a BNF grammar (which describes a context-free language), the nonterminals are the symbols “on the left”.

For example, a simple C-ish grammar might be (using BNF + regex-like quantifiers):

<program> ::= <function>*
<function> ::= <type> <identifier> '(' ')' <block>
<block> ::= '{' <statement>* '}'
<statement> ::= <statement_define>
<statement> ::= <statement_return>
<statement_return> ::= 'return' <expr> ';'
<statement_define> ::= <type> <identifier> '=' <expr> ';'
<expr> ::= <expr> '+' <term>
<expr> ::= <term>
<term> ::= <identifier>
<term> ::= <literal>

As defined, the symbols program, function, block, statement, statement_return, statement_define, expr, and term are non-terminals. They can be substituted by their right-hand side. In contrast, the symbols type, identifier, (, ), {, }, return, ;, +, and literal are terminals because they are not defined in the grammar. They form the “alphabet” that this grammar operates on.

In practice the above grammar is incomplete because some symbols have not been defined, so a separate parser (called a tokenizer, scanner, or lexer) would be responsible for recognizing them.

A grammar can be used to describe the structure of a given input. For example, a tokenizer might turn your source code into a token stream

type:int, identifier:main, (, ), {,
  type:int, identifier:a, =, literal:5, ;,
  return, identifier:a, +, literal:6, ;,
}

which the parser would turn into an abstract syntax tree based on the grammar. Here:

program
+ function
  + type: int
  + identifier: main
  + '('
  + ')'
  + block
    + '{'
    + statement
      + statement_define
        + type: int
        + identifier: a
        + '='
        + expr
          + term
            + literal: 5
        + ';'
      + statement_return
        + 'return'
        + expr
          + term
            + identifier: a
          + '+'
          + term
            + literal: 6
        + ';'
    + '}'

We can also use the grammar to generate source code, by substituting non-terminal symbols. For example:

<program>

<function>

<type> <identifier> ( ) <block>

<type> <identifier> ( ) { <statement>* }

<type> <identifier> ( ) { <statement_return> }

<type> <identifier> ( ) { return <expr>; }

<type> <identifier> ( ) { return <term>; }

<type> <identifier> ( ) { return <literal>; }

At that point no non-terminal symbols (as defined by the grammar) remain.