CMSC 28000 — Lecture 14

We have an intuitive understanding of how grammars are supposed to generate a language: start with the start variable $S$ and keep on applying as many rules as we like, transforming the string until we end up with a string that contains no variables. We will formalize this idea through the following series of definitions.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar.

Consider the following grammar $G$ that generates strings of balanced parentheses \[S \to \mathtt [ S \mathtt ] S \mid S \mathtt [ S \mathtt ] \mid \varepsilon,\] The string $\mathtt{[][[]]}$ has the following derivations in $G$.

An interesting observation from this is that the first two derivations actually result in the same parse tree, on the left, while the different leftmost derivation admits a different parse tree.

What this suggests is that derivations for a particular string are not unique, nor are the length of their derivations, even if we restrict ourselves to leftmost derivations. Furthermore, they do not admit the same parse trees. But one can conclude that a parse tree corresponds to a particular leftmost derivation.

These definitions allow us to give the following definition for what it means to be "generated" by a grammar.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar. The language generated by $G$ is $$L(G) = \{w \in \Sigma^* \mid S \Rightarrow_G^* w\}.$$ A language $L$ is a context-free language if there exists a context-free grammar $G$ such that $L(G) = L$.

Let's return to our example from above with the following proposition and show that the language $\{a^k b^k \mid k \geq 0\}$ is context-free.

Let $G$ be the following context-free grammar \[S \to aSb \mid \varepsilon.\] Then $L(G) = \{a^k b^k \mid k \geq 0\}$.

Let $L = \{a^k b^k \mid k \geq 0\}$. First, suppose that $L \subseteq L(G)$ and consider a word $x \in L$. We will show that $x$ can be generated by $G$ by induction on $|x|$.

We begin with the base case, $|x| = 0$, which means that $x = \varepsilon$. However, $S \to \varepsilon$ is a production in $G$, so $S \Rightarrow x$ and $x \in L(G)$.

Now for our inductive step, we consider $|x| \geq 1$ and thus $x \neq \varepsilon$. For our induction hypothesis, we suppose that for any $w \in L$ with $|w| < |x|$, we have $w \in L(G)$. That is, $S \Rightarrow^* w$.

Since $x \in L$ and $|x| > 0$, we have $x = a^n b^n$ for some $n \geq 1$. Then this means that $w = a^{n-1} b^{n-1} \in L$. Since $|w| < |x|$, we have $w \in L(G)$ by our induction hypothesis and thus there is a derivation $S \Rightarrow^* w$. Then we can use this to derive $x$ by $$S \Rightarrow aSb \Rightarrow^* awb = x.$$ Thus, we have $x \in L(G)$ and $L \subseteq L(G)$ as desired.

Now, we consider the other direction. Let $x \in L(G)$. We will show $x \in L$ by $k$, the number of steps in the derivation of $x$.

For our base case, we have $k = 1$, which means $x = \varepsilon$. Since $\varepsilon \in L$, the base case holds.

Now we consider $k \geq 2$ and assume that every word $w \in L(G)$ with fewer than $k$ steps in its derivation is in $L$. Since $k \geq 2$, the first step in the derivation of $x$ must be $S \to aSb$. Let $w \in L(G)$ be a word that can be derived in fewer than $k$ steps and let $x = awb$. We then have $w \in L$ by our induction hypothesis and thus $w = a^n b^n$ for some $n \geq 0$. Then $x = awb = a(a^n b^n)b = a^{n+1} b^{n+1} \in L$ and $L(G) \subseteq L$ as desired.

Thus, $L = L(G)$.

Chomsky Normal Form

One challenging aspect of reasoning about grammars is that the only constraint on production rules $A \to \alpha$ is only that a single variable can be on the left hand side, while $\alpha$ can be any finite string over $V \cup \Sigma$. Compounding the issue is that productions can be applied nondeterministically. Thinking back to finite automata or derivatives, we were able clearly say something about the result of reading a particular word. Here, there seems to be many different ways of getting to the same word.

When we run into situations like these, we typically want to find some sort of standard form which has some structure that we can rely on. For CFGs, this idea is captured in the Chomsky normal form. Unsurprisingly, this normal form and many of its properties are due to Chomsky (1959). That this one is named after Chomsky also implies that there are other normal forms (namely Greibach).

A context-free grammar $G$ is in Chomsky normal form if every production of $G$ has one of the following forms:

  1. $A \to BC$, where $A,B,C \in V$ and $B,C \neq S$,
  2. $A \to a$, for $A \in V$ and $a \in \Sigma$,
  3. $S \to \varepsilon$, where $S$ is the start variable.

Ironically, Chomsky normal form has a number of variants. The one here allows for $\varepsilon$ to be generated by the grammar. Kozen's presentation disallows $\varepsilon$ from being in the language. This is not a huge issue, but it does mean that the rules can be slightly different across presentations.

One very useful property of CNF grammars is that every derivation of a non-empty string $w$ via a grammar in CNF has exactly $2|w| - 1$ steps.

Let $G$ be a context-free grammar in Chomsky normal form. Then for any non-empty word $w \in L(G)$, a derivation of $w$ in $G$ has $2|w| - 1$ steps.

We will show something more general: that for any variable $A$ in $G$, a derivation $A \Rightarrow^* w$ has $2|w| - 1$ steps. We will show this by induction on $|w|$. For the base case, we have $|w| = 1$. Then the only step in the derivation is $A \to w$. Otherwise, a production rule of the form $A \to BC$ leads to a word of length at least two, since there are no rules of the form $B \to \varepsilon$. Thus, we have a derivation of length 1 and $1 = 2 \cdot 1 - 1$ so the base case holds.

Now we consider $|w| > 1$. Our induction hypothesis is that for all words $w'$ with $A \Rightarrow^* w'$ and $|w'| < |w|$, the derivation of $w'$ has length $2|w'| - 1$. Since $|w| > 1$, the first step of the derivation must be on a rule $A \to BC$. This means we can write $w = w_1 w_2$ and we have $B \Rightarrow^* w_1$ and $C \Rightarrow^* w_2$ and $w_1, w_2$ are both non-empty. Since both $w_1$ and $w_2$ are shorter than $w$, by the induction hypothesis, they have derivations of length $2|w_1| - 1$ and $2|w_2| - 1$. Then a derivation of $w$ has length \[2|w_1| - 1 + 2|w_2| - 1 + 1 = 2(|w_1|+|w_2|) - 1 = 2|w| - 1.\]