Regular vs Context Free Grammars

automatacontext-free-grammarregular-language

I'm studying for my computing languages test, and there's one idea I'm having problems wrapping my head around.

I understood that regular grammars are simpler and cannot contain ambiguity, but can't do a lot of tasks that are required for programming languages. I also understood that context-free grammars allow ambiguity, but allow for some things necessary for programming languages (like palindromes).

What I'm having trouble with is understanding how I can derive all of the above by knowing that regular grammar nonterminals can map to a terminal or a nonterminal followed by a terminal or that a context-free nonterminal maps to any combination of terminals and nonterminals.

Can someone help me put all of this together?

Best Answer

Regular grammar is either right or left linear, whereas context free grammar is basically any combination of terminals and non-terminals. Hence you can see that regular grammar is a subset of context-free grammar.

So for a palindrome for instance, is of the form,

S->ABA
A->something
B->something

You can clearly see that palindromes cannot be expressed in regular grammar since it needs to be either right or left linear and as such cannot have a non-terminal on both side.

Since regular grammars are non-ambiguous, there is only one production rule for a given non-terminal, whereas there can be more than one in the case of a context-free grammar.

Related Solutions

What programming languages are context-free

What programming languages are context-free? [...]

My gut tells me that functional languages might be context-free [...]

The short version: There are hardly any real-world programming languages that are context-free in any meaning of the word. Whether a language is context-free or not has nothing to do with it being functional. It is simply a matter of how complex the language rules and features are to parse.

Here's a CFG for the imperative language Brainfuck:

Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'

And here's a CFG for the functional SKI combinator calculus:

Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'

These CFGs recognize all valid programs of the two languages because they're so simple.

The longer version: Usually, context-free grammars (CFGs) are only used to roughly specify the syntax of a language. One must distinguish between syntactically correct programs and programs that compile/evaluate correctly. Most commonly, compilers split language analysis into syntax analysis that builds and verifies the general structure of a piece of code, and semantic analysis that verifies the meaning of the program.

If by "context-free language" you mean "... for which all programs compile", then the answer is: hardly any. Languages that fit this bill hardly have any rules or complicated features, like the existence of variables, whitespace-sensitivity, a type system, or any other context: Information defined in one place and relied upon in another.

If, on the other hand, "context-free language" only means "... for which all programs pass syntax analysis", the answer is a matter of how complex the syntax alone is. There are many syntactic features that are hard or impossible to describe with a CFG alone. Some of these are overcome by adding additional state to parsers for keeping track of counters, lookup tables, and so on.

Examples of syntactic features that are not possible to express with a CFG:

Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.

Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.

The example is:
```
  typedef int my_int;
  my_int x;
```
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name.

In context-free grammar terms, the X → ... rule that would trigger on my_int is ambiguous: It could be either one that produces an identifier, or one that produces a typedef'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.
Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
```
  E → E ^ E
  E → E × E
  E → E + E
  E → (E)
  E → num
```
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.

Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
```
  E₀ → EA E₁
  EA → E₁ + EA
  EA → ε
  E₁ → EM E₂
  EM → E₂ × EM
  EM → ε
  E₂ → E₃ EP
  EP → ^ E₃ EP
  E₃ → num
  E₃ → (E₀)
```
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.

In theory, all finite sets of strings are regular languages, and so all legal programs of bounded size are regular. Since regular languages are a subset of context-free languages, all programs of bounded size are context-free. The argument continues,

While it can be argued that it would be an acceptable limitation for a language to allow only programs of less than a million lines, it is not practical to describe a programming language as a regular language: The description would be far too large.
— Torben Morgensen's Basics of Compiler Design, ch. 2.10.2

The same goes for CFGs. To address your sub-question a little differently,

Which programming languages are defined by a context-free grammar?

Most real-world programming languages are defined by their implementations, and most parsers for real-world programming languages are either hand-written or uses a parser generator that extends context-free parsing. It is unfortunately not that common to find an exact CFG for your favourite language. When you do, it's usually in Backus-Naur form (BNF), or a parser specification that most likely isn't purely context-free.

Examples of grammar specifications from the wild:

Context-free grammars versus context-sensitive grammars

An important detail here is that grammars do not accept strings; they generate strings. Grammars are descriptions of languages that provide a means for generating all possible strings contained in the language. In order to tell if a particular string is contained in the language, you would use a recognizer, some sort of automaton that processes a given string and says "yes" or "no."

A context-free grammar (CFG) is a grammar where (as you noted) each production has the form A → w, where A is a nonterminal and w is a string of terminals and nonterminals. Informally, a CFG is a grammar where any nonterminal can be expanded out to any of its productions at any point. The language of a grammar is the set of strings of terminals that can be derived from the start symbol.

A context-sensitive grammar (CSG) is a grammar where each production has the form wAx → wyx, where w and x are strings of terminals and nonterminals and y is also a string of terminals. In other words, the productions give rules saying "if you see A in a given context, you may replace A by the string y." It's an unfortunate that these grammars are called "context-sensitive grammars" because it means that "context-free" and "context-sensitive" are not opposites, and it means that there are certain classes of grammars that arguably take a lot of contextual information into account but aren't formally considered to be context-sensitive.

To determine whether a string is contained in a CFG or a CSG, there are many approaches. First, you could build a recognizer for the given grammar. For CFGs, the pushdown automaton (PDA) is a type of automaton that accepts precisely the context-free languages, and there is a simple construction for turning any CFG into a PDA. For the context-sensitive grammars, the automaton you would use is called a linear bounded automaton (LBA).

However, these above approaches, if treated naively, are not very efficient. To determine whether a string is contained in the language of a CFG, there are far more efficient algorithms. For example, many grammars can have LL(k) or LR(k) parsers built for them, which allows you to (in linear time) decide whether a string is contained in the grammar. All grammars can be parsed using the Earley parser, which in O(n³) can determine whether a string of length n is contained in the grammar (interestingly, it can parse any unambiguous CFG in O(n²), and with lookaheads can parse any LR(k) grammar in O(n) time!). If you were purely interested in the question "is string x contained in the language generated by grammar G?", then one of these approaches would be excellent. If you wanted to know how the string x was generated (by finding a parse tree), you can adapt these approaches to also provide this information. However, parsing CSGs is, in general, PSPACE-complete, so there are no known parsing algorithms for them that run in worst-case polynomial time. There are some algorithms that in practice tend to run quickly, though. The authors of Parsing Techniques: A Practical Guide (see below) have put together a fantastic page containing all sorts of parsing algorithms, including one that parses context-sensitive languages.

If you're interested in learning more about parsing, consider checking out the excellent book "Parsing Techniques: A Practical Guide, Second Edition" by Grune and Jacobs, which discusses all sorts of parsing algorithms for determining whether a string is contained in a grammar and, if so, how it is generated by the parsing algorithm.

Best Answer

Related Solutions

What programming languages are context-free

Context-free grammars versus context-sensitive grammars

Related Topic