CMSC 28000 — Lecture 13

Let $x \in L(G)$. We will show $x \in L$ by $k$, the number of steps in the derivation of $x$.

For our base case, we have $k = 1$, which means $x = \varepsilon$. Since $\varepsilon \in L$, the base case holds.

Now we consider $k \geq 2$ and assume that every word $w \in L(G)$ with fewer than $k$ steps in its derivation is in $L$. Since $k \geq 2$, the first step in the derivation of $x$ must be $S \to aSb$. Let $w \in L(G)$ be a word that can be derived in fewer than $k$ steps and let $x = awb$. We then have $w \in L$ by our induction hypothesis and thus $w = a^n b^n$ for some $n \geq 0$. Then $x = awb = a(a^n b^n)b = a^{n+1} b^{n+1} \in L$ and $L(G) \subseteq L$ as desired.

Thus, $L = L(G)$.

Parse trees and ambiguity

As we've seen from our definition of context-free grammars, we never really need to specify a particular order for how we might apply production rules or the order in which we should replace variables. However, it's a good idea to have some sort of systematic approach to how we apply productions and the kinds of derivations that result from such an approach.

We say that a leftmost derivation is a derivation in which the leftmost variable is expanded at each step.

Let's look an example of a leftmost derivation. First, consider the following grammar: \[S \to 0S1S \mid 1S0S \mid \varepsilon\] Then a leftmost deriviation in this grammar is \[S \Rightarrow 0S1S \Rightarrow 01S0S1S \Rightarrow 010S1S \Rightarrow 0101S \Rightarrow 0101.\]

We can define the rightmost derivation similarly. Since, as mentioned above, it doesn't matter which order variables are acted upon by production rules, we can conclude that every word generated by a CFG has a leftmost derivation. As it turns out, leftmost derivations have an interesting correspondence with parse trees. Parse trees are a way for us to visualize how a word is generated from a grammar.

Given a grammar $G$, a tree is a parse tree for $G$ if it satisfies the following conditions:

The yield of a parse tree is the concatenation of the symbols at the leaves of the tree from left to right. A full parse tree is one whose root is the start variable and the leaves are all labeled with terminals. Such a tree yields a word generated by the grammar. The following is the parse tree for the derivation we saw earlier.

As alluded to before, there is a bijection between the set of parse trees of a word and the set of its leftmost derivations. It isn't difficult to see that performing a depth-first traversal of a parse tree allows us to recover a leftmost derivation. However, even with the notions of leftmost derivations, it is clear that having so much flexibility in the production rules and how to apply them makes it difficult to reason about grammars. For instance, parse tree admit other deriviations other than the leftmost one, such as, in this case, \[S \Rightarrow 0S1S \Rightarrow 0S1\varepsilon \Rightarrow 01S0S1 \Rightarrow 01\varepsilon 0S1 \Rightarrow 0101.\]

To make matters worse, we can't even guarantee that any given word has a single parse tree. For example, here's another possible derivation of the same word using the grammar above. \[S \Rightarrow 0S1S \Rightarrow 01S \Rightarrow 010S1S \Rightarrow 0101S \Rightarrow 0101.\]

Notice that this derivation is also a leftmost derivation. In this case, we say the grammar is ambiguous.

A grammar $G$ is ambiguous if there is a word $w \in L(G)$ with more than one leftmost derivation (or, equivalently, parse tree). Otherwise, the grammar is unambiguous.

It's important to note that the definition here says that ambiguity is a property of the grammar, not the language. This may lead yout to wonder whether there's a way to come up with an unambiguous grammar for any given context-free language. Sadly, the answer happens to be no: there are some context-free languages for which none of the grammars are unambiguous. Here is an example, which I will give without proof: \[\{a^k b^k c^m \mid k,m \geq 0\} \cup \{a^k b^m c^m \mid k,m \geq 0\}.\] To see this, think about what the parse trees for $a^n b^n c^n$ might look like. We call these languages inherently ambiguous.

We say that a language is inherently ambiguous if there is no unambiguous grammar that generates it.

Chomsky Normal Form

All of this is to say that this presents challenges for reasoning about grammars. Thinking back to finite automata or derivatives, we were able clearly say something about the result of reading a particular word. Here, there seems to be many different ways of getting to the same word.

When we run into situations like these, we typically want to find some sort of standard form which has some structure that we can rely on. For CFGs, this idea is captured in the Chomsky normal form. Unsurprisingly, this normal form and many of its properties are due to Chomsky (1959). That this one is named after Chomsky also implies that there are other normal forms (namely Greibach).

A context-free grammar $G$ is in Chomsky normal form if every production of $G$ has one of the following forms:

Ironically, Chomsky normal form has a number of variants. The one here allows for $\varepsilon$ to be generated by the grammar. Kozen's presentation disallows $\varepsilon$ from being in the language. This is not a huge issue, but it does mean that the rules can be slightly different across presentations.

One very useful property of CNF grammars is that every parse tree for a string $w \in L(G)$ will have exactly $2|w| - 1$ nodes which are variables and exactly $|w|$ leaf nodes. In other words, every derivation of a non-empty string $w$ via a grammar in CNF has exactly $2|w| - 1$ substitutions.

Let $G$ be a context-free grammar in Chomsky normal form. Then for any non-empty word $w \in L(G)$, a derivation of $w$ in $G$ has $2|w| - 1$ steps.

We will show something more general: that for any variable $A$ in $G$, a derivation $A \Rightarrow^* w$ has $2|w| - 1$ steps. We will show this by induction on $|w|$. For the base case, we have $|w| = 1$. Then the only step in the derivation is $A \to w$. Otherwise, a production rule of the form $A \to BC$ leads to a word of length at least two, since there are no rules of the form $B \to \varepsilon$. Thus, we have a derivation of length 1 and $1 = 2 \cdot 1 - 1$ so the base case holds.

Now we consider $|w| > 1$. Our induction hypothesis is that for all words $w'$ with $A \Rightarrow^* w'$ and $|w'| < |w|$, the derivation of $w'$ has length $2|w'| - 1$. Since $|w| > 1$, the first step of the derivation must be on a rule $A \to BC$. This means we can write $w = w_1 w_2$ and we have $B \Rightarrow^* w_1$ and $C \Rightarrow^* w_2$ and $w_1, w_2$ are both non-empty. Since both $w_1$ and $w_2$ are shorter than $w$, by the induction hypothesis, they have derivations of length $2|w_1| - 1$ and $2|w_2| - 1$. Then a derivation of $w$ has length \[2|w_1| - 1 + 2|w_2| - 1 + 1 = 2(|w_1|+|w_2|) - 1 = 2|w| - 1.\]