Understanding Context Free Grammar using a simple C code

ccompilergrammarparsingprogramming-languages

I'm trying to understand the difference between terminal and non-terminal values in a real language. I wasn't able to find enough examples on real language CFGs on the internet, most examples are abstract. Assume we have the following

int main(){
   int a = 5;
   return a + 6;
}

Are the following statements true?

Terminals: int, (, ), {, }, 5, return, +, 6, ;

Non-terminals: main, a

Best Answer

Terminal and non-terminal symbols are an aspect of a grammar, not of a language. In a BNF grammar (which describes a context-free language), the nonterminals are the symbols “on the left”.

For example, a simple C-ish grammar might be (using BNF + regex-like quantifiers):

<program> ::= <function>*
<function> ::= <type> <identifier> '(' ')' <block>
<block> ::= '{' <statement>* '}'
<statement> ::= <statement_define>
<statement> ::= <statement_return>
<statement_return> ::= 'return' <expr> ';'
<statement_define> ::= <type> <identifier> '=' <expr> ';'
<expr> ::= <expr> '+' <term>
<expr> ::= <term>
<term> ::= <identifier>
<term> ::= <literal>

As defined, the symbols program, function, block, statement, statement_return, statement_define, expr, and term are non-terminals. They can be substituted by their right-hand side. In contrast, the symbols type, identifier, (, ), {, }, return, ;, +, and literal are terminals because they are not defined in the grammar. They form the “alphabet” that this grammar operates on.

In practice the above grammar is incomplete because some symbols have not been defined, so a separate parser (called a tokenizer, scanner, or lexer) would be responsible for recognizing them.

A grammar can be used to describe the structure of a given input. For example, a tokenizer might turn your source code into a token stream

type:int, identifier:main, (, ), {,
  type:int, identifier:a, =, literal:5, ;,
  return, identifier:a, +, literal:6, ;,
}

which the parser would turn into an abstract syntax tree based on the grammar. Here:

program
+ function
  + type: int
  + identifier: main
  + '('
  + ')'
  + block
    + '{'
    + statement
      + statement_define
        + type: int
        + identifier: a
        + '='
        + expr
          + term
            + literal: 5
        + ';'
      + statement_return
        + 'return'
        + expr
          + term
            + identifier: a
          + '+'
          + term
            + literal: 6
        + ';'
    + '}'

We can also use the grammar to generate source code, by substituting non-terminal symbols. For example:

<program>

<function>

<type> <identifier> ( ) <block>

<type> <identifier> ( ) { <statement>* }

<type> <identifier> ( ) { <statement_return> }

<type> <identifier> ( ) { return <expr>; }

<type> <identifier> ( ) { return <term>; }

<type> <identifier> ( ) { return <literal>; }

At that point no non-terminal symbols (as defined by the grammar) remain.

Related Solutions

Are all languages basically the same

The basics of most procedural languages are pretty much the same.

They offer:

Scalar data types: usually boolean, integers, floats and characters
Compound data types: arrays (strings are special case) and structures
Basic code constructs: arithmetic over scalars, array/structure access, assignments
Simple control structures: if-then, if-then-else, while, for loops
Packages of code blocks: functions, procedures with parameters
Scopes: areas in which identifiers have specific meanings

If you understand this, you have a good grasp of 90% of the languages on the planet. What makes these languages slightly more difficult to understand is the incredible variety of odd syntax that people use to say the same basic things. Some use terse notation involving odd punctuation (APL being an extreme). Some use lots of keywords (COBOL being an excellent representative). That doesn't matter much. What does matter is if the language is complete enough by itself to do complex tasks without causing you tear your hair out. (Try coding some serious string hacking in Window DOS shell script: it is Turing capable but really bad at everything).

More interesting procedural languages offer

Nested or lexical scopes, namespaces
Pointers allowing one entity to refer to another, with dynamic storage allocation
Packaging of related code: packages, objects with methods, traits
More sophisticated control: recursion, continuations, closures
Specialized operators: string and array operations, math functions

While not technically a property of the langauge, but a property of the ecosystem in which such languages live, are the libraries that are easily accessible or provided with the language as part of the development tool. Having a wide range of library facilities simplifies/speeds writing applications simply because one doesn't have to reinvent what the libraries do. While Java and C# are widely thought to be good languages in and of themselves, what makes them truly useful are the huge libraries that come with them, and easily obtainable extension libraries.

The languages which are harder to understand are the non-procedural ones:

Purely functional languages, with no assignments or side effects
Logic languages, such as Prolog, in which symbolic computation and unification occur
Pattern matching languages, in which you specify shapes that are matched to the problem, and often actions are triggered by a match
Constraint languages, which let you specify relations and automatically solve equations
Hardware description languages, in which everything executes in parallel
Domain-specific languages, such as SQL, Colored Petri Nets, etc.

There are two major representational styles for languages:

Text based, in which identifiers name entities and information flows are encoded implicitly in formulas that uses the identifiers to name the entities (Java, APL, ...)
Graphical, in which entities are drawn as nodes, and relations between entities are drawn as explicit arcs between those nodes (UML, Simulink, LabView)

The graphical languages often allow textual sublanguages as annotations in nodes and on arcs. Odder graphical languages recursively allow graphs (with text :) in nodes and on arcs. Really odd graphical languages allow annotation graphs to point to graphs being annotated.

Most of these languages are based on a very small number of models of computation:

The lambda calculus (basis for Lisp and all functional languages)
Post systems (or string/tree/graph rewriting techniques)
Turing machines (state modification and selection of new memory cells)

Given the focus by most of industry on procedural languages and complex control structures, you are well served if you learn one of the more interesting languages in this category well, especially if it includes some type of object-orientation.

I highly recommend learning Scheme, in particular from a really wonderful book: Structure and Interpretation of Computer Programs. This describes all these basic concepts. If you know this stuff, other languages will seem pretty straightforward except for goofy syntax.

Language Design – Real-World Use Cases for Chomsky Type-I Grammar

Good question. Although as mentioned in the comments very many programming languages are context-sensitive, that context-sensitivity is often not resolved in the parsing phase but in later phases -- that is, a superset of the language is parsed using a context-free grammar, and some of those parse trees are later filtered out.

However, that does not mean that those languages aren't context-sensitive, so here are some examples:

Haskell allows you to define functions that are used as operators, and to also define the the precedence and associativity of those operators. In other words, you can't build the correct parse tree for an operator expression like:

a @@ b @@ c ## d ## e

unless you've already parsed the precedence/associativity declarations for @@ and ##:

infixr 8 @@
infixr 6 ##

A second example is Bencode, a data language that prefixes content with its length:

<length>:<contents>

The issue with this format is that it's pretty much impossible to parse without something context-sensitive, because the only way to figure out the "field" sizes is by ... parsing the string.

A third example is XML, assuming arbitrary tag names are allowed: opening tag names must have matching close tags:

<hi>
 <bye>
 the closing tag has to match bye
 </bye>
</hi> <!-- has to match "hi" -->

Best Answer

Related Solutions

Are all languages basically the same

Language Design – Real-World Use Cases for Chomsky Type-I Grammar

Related Topic