CMSC 28000 — Lecture 12

Context-Free Languages

At last, we're ready to step beyond the comfortable world of regular languages and finite automata. Despite having sold this course to you as a course about computation and machines, we're going to be taking a different tack at first and saving the machine model for later. Instead, we'll be looking at grammars.

Like finite automata, context-free grammars were developed in a different context than computation. As you might guess, formal grammars were designed as a way to formally specify grammars for natural language and were developed by Noam Chomsky in 1956 as phrase-structure grammars. But while the modern treatment of formal grammars was developed relatively recently, the notion of a formal grammar can be traced back to the Indian subcontinent, as far as the 4th century BCE.

Here's a simple grammar for English taken from Jurafsky and Martin (2020): \begin{align*} S & \to NP \enspace VP \\ NP & \to Pronoun \mid ProperNoun \mid Det \enspace Nominal \\ Nominal & \to Nominal \enspace Noun \mid Noun \\ VP & \to Verb \mid Verb \enspace NP \mid Verb \enspace NP \enspace PP \mid Verb \enspace PP \\ PP &\to Preposition \enspace NP \end{align*}

However, despite their origins as a tool to define the grammatical structure of natural language, where they're most commonly found now is in defining the grammatical structure of programming languages. For example, here is the Java Language Specification for Java SE 13. It's in Chapter 19 that we get the good stuff. Here's a snippet, mainly dealing with all of the ways you can write a for statement.

The syntax is not exactly the same, but the idea is.

I mentioned before that in the process of lexical analysis, regular languages are only able to tell us what kinds of tokens it sees, but not whether they're in the correct order. The phase which enforces the correct order of tokens, or whether a string of tokens is "grammatically correct", is called parsing. As you might expect, parsers parse strings and determine their grammatical validity, and as you might guess, parsers are built on context-free grammars.

Now is a good time to have a look at the Chomsky Hierarchy. The hierarchy is due to Chomsky, and categorizes languages by the complexity of their grammars. An important point about the hierarchy is that each class is contained in the more powerful class. This makes sense: every language that can be generated by a right linear grammar can be generated by a more powerful grammar.

Grammar	Language class	Machine model
Type-0 (Unrestricted)	Recursively enumerable	Turing machines
Type-1 (Context-sensitive)	Context-sensitive	Linear-bounded automata
Type-2 (Context-free)	Context-free	Pushdown automata
Type-3 (Right/left linear)	Regular	Finite automata

Interestingly, the hierarchy, although intended for grammars, happens to map very nicely onto automata models.

A context-free grammar (CFG) is a 4-tuple $G = (V, \Sigma, P, S)$, where

$V$ is a finite set of variables, often denoted by uppercase letters,
$\Sigma$ is a finite alphabet, called terminals or terminal symbols, with $\Sigma \cap V = \emptyset$,
$P \subseteq V \times (V \cup \Sigma)^*$ is a set of productions or rules, written in the form $A \to \beta$ for $A \in V$ and $\beta \in (V \cup \Sigma)^*$,
$S$ is the start variable.

Usually, it's enough when specifying the grammar to give the set of productions, since any variables and terminals that are used will appear as part of a production.

Let's take a look at an example.

Let $G = (V,\Sigma,P,S)$, where $V = \{S\}$, $\Sigma = \{a,b\}$, $S$ is the start variable, and $P$ contains the following productions: \begin{align*} S &\to aSb \\ S &\to \varepsilon \end{align*}

It's not too hard to see that the grammar $G$ generates the language $\{a^k b^k \mid k \geq 0\}$. This is not very different from an inductive definition: a word is generated by $G$ if it is empty, or $awb$ for a word $w$ which is generated by $G$. And so we can guess that grammars are recursive structures in the same way that many other objects are in computer science.

We also probably already have an intuitive understanding of how grammars are supposed to generate a language: the idea is that we simply keep on applying as many rules as we like, transforming the word until we end up with a word that contains no variables. Of course, we would like to formalize such a notion. This will occur in the following series of definitions.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar.

A sentential form is a string of variables and terminals $\alpha \in (V \cup \Sigma)^*$.
We define the relation $\Rightarrow_G$ on sentential forms by $\alpha B \gamma \Rightarrow_G \alpha \beta \gamma$ if $\alpha, \gamma \in (V \cup \Sigma)^*$ and $B \to \beta \in P$.
We define $\Rightarrow_G^*$ to be the reflexive, transitive closure of $\Rightarrow_G$. That is, $\alpha \Rightarrow_G^* \beta$ if there are sentential forms $\alpha = \alpha_0, \alpha_1, \dots, \alpha_n = \beta$ such that \[\alpha_0 \Rightarrow_G \alpha_1 \Rightarrow_G \cdots \Rightarrow_G \alpha_n.\] Alternatively, we can define $\Rightarrow_G^k$ for $k \in \mathbb N$ by
- $\beta \Rightarrow_G^0 \beta$ for all $\beta \in (V \cup \Sigma)^*$,
- $\alpha \Rightarrow_G^{k+1} \gamma$ if $\alpha \Rightarrow_G^k \beta$ and $\beta \Rightarrow_G \gamma$.
Then we take \[\Rightarrow_G^* = \bigcup_{k \geq 0} \Rightarrow_G^k.\]
Usually, we can drop the subscript $G$ if it is clear from context which grammar we're talking about.
A derivation is 0 or more applications of $\Rightarrow_G$ to a sentential form.

These definitions allow us to give the following definition for what it means to be "generated" by a grammar.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar. The language generated by $G$ is $$L(G) = \{w \in \Sigma^* \mid S \Rightarrow_G^* w\}.$$ A language $L$ is a context-free language if there exists a context-free grammar $G$ such that $L(G) = L$.

One might ask why these are called context-free languages and grammars. The name has to do with how the grammar works: for any production $A \to \alpha$, and any string $\beta A \gamma$, we can always apply the production $A \to \alpha$. The context of $A$, whatever is around $A$, has no bearing on what rule gets applied, only that the variable is present. Consider, for example, the following definition for context-sensitive grammars.

A grammar $G = (V, \Sigma, P, S)$ is said to be context-sensitive if every production in $P$ is of the form $\alpha B \gamma \to \alpha \beta \gamma$, for $\alpha, \gamma \in (V \cup \Sigma)^*$, $\beta \in (V \cup \Sigma)^+$, and $B \in V$.

A consequence of this definition is that a rule of the form $\alpha B \gamma \to \alpha \beta \gamma$ means that replacing $B$ with $\beta$ is only allowed within the context $(\alpha,\gamma)$.

Let's return to our example from above with the following proposition and show that the language $\{a^k b^k \mid k \geq 0\}$ is context-free.

Let $G$ be the following context-free grammar \[S \to aSb \mid \varepsilon.\] Then $L(G) = \{a^k b^k \mid k \geq 0\}$.

Let $L = \{a^k b^k \mid k \geq 0\}$. First, suppose that $L \subseteq L(G)$ and consider a word $x \in L$. We will show that $x$ can be generated by $G$ by induction on $|x|$.

We begin with the base case, $|x| = 0$, which means that $x = \varepsilon$. However, $S \to \varepsilon$ is a production in $G$, so $S \Rightarrow x$ and $x \in L(G)$.

Now for our inductive step, we consider $|x| \geq 1$ and thus $x \neq \varepsilon$. For our induction hypothesis, we suppose that for any $w \in L$ with $|w| < |x|$, we have $w \in L(G)$. That is, $S \Rightarrow^* w$.

Since $x \in L$ and $|x| > 0$, we have $x = a^n b^n$ for some $n \geq 1$. Then this means that $w = a^{n-1} b^{n-1} \in L$. Since $|w| < |x|$, we have $w \in L(G)$ by our induction hypothesis and thus there is a derivation $S \Rightarrow^* w$. Then we can use this to derive $x$ by $$S \Rightarrow aSb \Rightarrow^* awb = x.$$ Thus, we have $x \in L(G)$ and $L \subseteq L(G)$ as desired.

We'll finish the other direction next time.