CMSC 28000 — Lecture 13

Grammars

At last, we're ready to step beyond the comfortable world of regular languages and finite automata. Despite having sold this course to you as a course about computation and machines, we're going to be taking a different tack at first and saving the machine model for later. Instead, we'll be looking at grammars.

Like finite automata, grammars were developed in a different context than computation. As you might guess, formal grammars were designed as a way to formally specify grammars for natural language and were developed by Noam Chomsky in 1956 as phrase-structure grammars. But while the modern treatment of formal grammars was developed relatively recently, the notion of a formal grammar can be traced back to the Indian subcontinent, as far as the 4th century BCE.

Here's a simple grammar for English taken from Jurafsky and Martin (2020): \begin{align*} S & \to NP \enspace VP \\ NP & \to Pronoun \mid ProperNoun \mid Det \enspace Nominal \\ Nominal & \to Nominal \enspace Noun \mid Noun \\ VP & \to Verb \mid Verb \enspace NP \mid Verb \enspace NP \enspace PP \mid Verb \enspace PP \\ PP &\to Preposition \enspace NP \end{align*}

However, despite their origins as a tool to define the grammatical structure of natural language, where they're most commonly found now is in defining the grammatical structure of programming languages. For example, the Python Language Reference describes the language using snippets that look like


    for_stmt ::=  "for" target_list "in" starred_list ":" suite
                  ["else" ":" suite]

This says that a for_stmt is a statement that begins with the literal string for, followed by a string that follows the rules of a target_list, followed by the literal string in, followed by a string that follows the rules of starred_list, followed by the literal string :, followed by a block of code defined by suite. Here are those rules


    suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
    stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]

In Section 10, we can find the full grammar specification that is the formally specified grammar that is used in the compiler.

I mentioned before that in the process of lexical analysis, regular languages are only able to tell us what kinds of tokens it sees, but not whether they're in the correct order. The phase which enforces the correct order of tokens, or whether a string of tokens is "grammatically correct", is called parsing. As you might expect, parsers parse strings and determine their grammatical validity, and as you might guess, parsers are built on context-free grammars.

Now is a good time to have a look at the Chomsky Hierarchy. The hierarchy is due to Chomsky, and categorizes languages by the complexity of their grammars. An important point about the hierarchy is that each class is contained in the more powerful class. This makes sense: every language that can be generated by a right linear grammar can be generated by a more powerful grammar.

Grammar	Language class	Machine model
Type-0 (Unrestricted)	Recursively enumerable	Turing machines
Type-1 (Context-sensitive)	Context-sensitive	Linear-bounded automata
Type-2 (Context-free)	Context-free	Pushdown automata
Type-3 (Right/left linear)	Regular	Finite automata

Interestingly, the hierarchy, although intended for grammars, happens to map very nicely onto automata models.

Context-free grammars

A context-free grammar (CFG) is a 4-tuple $G = (V, \Sigma, P, S)$, where

$V$ is a finite set of variables, often denoted by uppercase letters,
$\Sigma$ is a finite alphabet, called terminals or terminal symbols, with $\Sigma \cap V = \emptyset$,
$P \subseteq V \times (V \cup \Sigma)^*$ is a set of productions or rules, written in the form $A \to \beta$ for $A \in V$ and $\beta \in (V \cup \Sigma)^*$,
$S$ is the start variable.

Usually, it's enough when specifying the grammar to give the set of productions, since any variables and terminals that are used will appear as part of a production.

Let's take a look at an example.

Let $G = (V,\Sigma,P,S)$, where $V = \{S\}$, $\Sigma = \{a,b\}$, $S$ is the start variable, and $P$ contains the following productions: \begin{align*} S &\to aSb \\ S &\to \varepsilon \end{align*}

Just like we think of states in finite automata as describing some property about the states that reach them, we should think of variables as describing some property of the strings that get generated by them.

Our grammar in this case is quite simple: it consists of a single variable $S$, so the language of the grammar $G$ consists of strings that are generated via the variable $S$. We also notice that the production rule $S \to aSb$ refers to itself—that is, it's recursive. A plain reading of this is that strings generated by $S$ are strings generated by $S$ but with an $a$ on the left and a $b$ on the right.

As we've seen plenty of times, such a definition can go on forever, so we have another rule, that $S \to \varepsilon$. So we have two cases: one recursive and one that's not. And so we can arrive at the conclusion that strings generated by $S$ are either empty (our base case) or $awb$, where $w$ results from $S$. It turns out that this grammar generates the language $\{a^k b^k \mid k \geq 0\}$.

One might ask why these are called context-free languages and grammars. The name has to do with how the grammar works: for any production $A \to \alpha$, and any string $\beta A \gamma$, we can always apply the production $A \to \alpha$. The context of $A$, whatever is around $A$, has no bearing on what rule gets applied, only that the variable is present. Consider, for example, the following definition for context-sensitive grammars.

A grammar $G = (V, \Sigma, P, S)$ is said to be context-sensitive if every production in $P$ is of the form $\alpha B \gamma \to \alpha \beta \gamma$, for $\alpha, \gamma \in (V \cup \Sigma)^*$, $\beta \in (V \cup \Sigma)^+$, and $B \in V$.

A consequence of this definition is that a rule of the form $\alpha B \gamma \to \alpha \beta \gamma$ means that replacing $B$ with $\beta$ is only allowed within the context $(\alpha,\gamma)$.

Let's go through a few more examples.

Recall that palindromes are strings $w$ such that $w = w^R$. We claim that the following grammar generates the language of palindromes over the alphabet $\{a,b\}$. \[S \to aSa \mid bSb \mid a \mid b \mid \varepsilon\] To see why that is, we can view palindromes recursively. A palindrome is either

A single symbol, like $a$ or $b$, or the empty string $\varepsilon$, or
If $w$ is a palindrome, then $awa$ or $bwb$ is a palindrome.

Does this definition work? Consider a string $cwc$, where $w$ is a palindrome—so $w = w^R$. If we apply the definition of reversal, we have \[(cwc)^R = (wc)^R c = c^R w^R c = cw^Rc = cwc.\]

Here's a more practical example, as in something that is probably of interest to you if you're a CS major. We claim that the following grammar generates strings of balanced brackets \[S \to \mathtt [ S \mathtt ] \mid SS \mid \varepsilon\] Again, it helps to think recursively based on the variable. There are three cases:

The empty string is a string of balanced brackets (there are zero brackets to be balanced).
If we take two strings of balanced brackets, concatenating them together doesn't change that. That is, if $u$ and $v$ are balanced, then $uv$ is balanced, since we've properly closed up $u$ before starting $v$ and $v$ is properly balanced.
Finally, if we take a string of balanced brackets, say $w$, then surrounding it in brackets, like $\mathtt [ w \mathtt ]$ should still be balanced.

One thing you might have noticed with some of these examples is that they implictly use the regular operations:

When we choose between two rules $S \to A_1 \mid A_2$, we're really describing a union of strings generated from $A_1$ or those generated from $A_2$.
When we take a rule $S \to A_1 A_2$, we're really describing a concatenation of strings generated from $A_1$ and those generated from $A_2$.
When we have a rule $S \to \alpha S \beta$, we're really describing some repetition of strings generated from $S$.

We can use these observations to conclude that the context-free languages are closed under the regular operations.

The class of context-free languages is closed under the regular operations.

Let $G_1 = (V_1,\Sigma,P_1,S_1)$ and $G_2 = (V_2, \Sigma,P_2,S_2)$ be context-free grammars. We can construct new grammars for each regular operation.

For a grammar that generates $L(G_1) \cup L(G_2)$, we add new production rules \[S \to S_1 \mid S_2.\]
For a grammar that generates $L(G_1) L(G_2)$, we add a new production rule \[S \to S_1 S_2.\]
For a grammar that generates $L(G_1)^*$, we add new production rules \[S \to S_1 S \mid \varepsilon.\]

To prove this formally, we would need to through an inductive argument—this makes a nice exercise.