CMSC 28000 — Lecture 15

Chomsky Normal Form, continued

The fact that this is described as a normal form hints at the fact that every context-free grammar has an equivalent grammar in CNF, and therefore, every context-free language has a grammar in CNF.

Let $L \subseteq \Sigma^*$ be a context-free language. Then there exists a CFG $G$ in Chomsky normal form such that $L(G) = L$.

We will begin with an arbitrary CFG $H$ and perform the following algorithm to construct an equivalent CFG $G$ in CNF.

  1. Add a new start variable $S_0$ and production $S_0 \to S$. This allows us to guarantee that the starting variable of $G$ never occurs on the right hand side of a production.
  2. Add variables $A_a$ for each terminal $a \in \Sigma$ and add the production $A_a \to a$.
  3. Split up productions of the form $A \to B_1 \cdots B_m$. We do this by replacing these productions with productions $A \to B_1 A_1$, $A_1 \to B_2 A_2, \dots, A_{m-2} \to B_{m-1} B_m$, where $A_i$ for $1 \leq i \leq m-2$ are new variables.
  4. Eliminate productions of the form $A \to \varepsilon$ for $A \neq S_0$. For each occurrence of $A$ in a production, we add a new production rule with $A$ removed.
  5. Eliminate productions of the form $A \to B$. Then for all productions $B \to \alpha$, we add the production $A \to \alpha$.

This completes the construction.

Most of the procedures described above are fairly straightforward. The only two that pose some real questions are the final two: the elimination of $\varepsilon$ and unit productions. A constructive proof that these two transformations work is given in Kozen 21.

We also note that the order in which these transformations are applied can change, but can also create more work. For example, eliminating $\varepsilon$-productions before ensuring that all productions have at most two variables on the right hand side can create exponentially many productions. Observe also that some rules can create unit productions, so eliminating unit productions is best left as the final transformation.

Consider the grammar $G$, defined by $$ S \to \mathtt ( S \mathtt ) S \mid \varepsilon $$ Here, $L(G)$ is the language of well-balanced parentheses.

  1. Applying the first step, adding a new start variable, we obtain \begin{align*} S_0 &\to S \\ S &\to \mathtt (S\mathtt )S \mid \varepsilon \end{align*}
  2. Next, we create replace terminals with new productions: \begin{align*} S_0 &\to S \\ S &\to ASBS \mid \varepsilon \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
  3. Then, we split up productions with more than two variables on the right hand side: \begin{align*} S_0 &\to S \\ S &\to AC_1 \mid \varepsilon \\ C_1 &\to SC_2 \\ C_2 &\to BS \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
  4. Now, we remove $\varepsilon$-rules. \begin{align*} S_0 &\to S \mid \varepsilon \\ S &\to AC_1 \\ C_1 &\to SC_2 \mid C_2 \\ C_2 &\to BS \mid B \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
  5. Finally, we remove unit productions. \begin{align*} S_0 &\to AC_1 \mid \varepsilon \\ S &\to AC_1 \\ C_1 &\to SC_2 \mid BS \mid \mathtt ) \\ C_2 &\to BS \mid \mathtt ) \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}

As you might guess from the construction, grammars in CNF can be quite large. One point of interest is that the order in which the CNF transformations are applied can affect the blowup of the number of rules. The classical presentation due to Hopcroft and Ullman that is usually taught can lead to exponentially many rules. The current order which we've presented now constructs an equivalent grammar in CNF as large as $O(n^2\mathtt )$. Another note is that grammars in CNF can be ambiguous. Since inherently ambiguous context-free languages exist, that must mean that any grammar in CNF that generates those languages must also be ambiguous.

Recognition for grammars

As presented, it's not clear that there's a fast way to determine whether a string is generated by a given grammar $G$. If we have a word $w$, then we know that such a word will have a derivation of length exactly $2|w| - 1$. This gives us an easy but bad algorithm: just check every derivation of length $2|w| - 1$ until you can or can't find one for $w$. Of course, this gives us exponentially many derivations to check.

However, we can exploit the predictable structure of a CNF grammar by considering each substring. For instance, we know that every symbol will have a production that generates it directly and we know that every non-terminal production will have exactly two variables on its right hand side.

This lends itself to a dynamic programming algorithm and such an algorithm was independently proposed by three different individuals, which gives the algorithm its name: the CockeYoungerKasami algorithm.

    \begin{algorithmic}
    \PROCEDURE{cyk}{$G = (V,\Sigma,P,S), w=a_1 a_2 \cdots a_n$}
        \FOR{$i$ from 1 to $n$}
            \STATE $T[i,i] = \{A \in V \mid A \to a_i \in P\}$
        \ENDFOR
        \FOR{$m$ from 1 to $n-1$}
            \FOR{$i$ from 1 to $n-m$}
                \STATE $j = i+m$
                \STATE $T[i,j] = \emptyset$
                \FOR{$k$ from 1 to $j-1$}
                    \STATE $T[i,j] = T[i,j] \cup \{A \in V \mid A \to BC \in P \wedge B \in T[i,k] \wedge C \in T[k+1,j]\}$
                \ENDFOR
            \ENDFOR
        \ENDFOR
    \RETURN $S \in T[1,n]$
    \ENDPROCEDURE
    \end{algorithmic}

Given a CFG $G$ in Chomsky Normal Form and string $w$ of length $n$, we can determine whether $w \in L(G)$ in $O(n^3 |G|)$ time.

Since this is a dynamic programming algorithm, it's worth considering what the underlying recursion is doing. What we're doing is for each substring $x$, we're building a set of variables $A$ from which there is a derivation $A \Rightarrow^* x$. The obvious base case is when we have productions $A \to a$ for $a \in \Sigma$.

What then is our inductive case? We consider a string $w$. If it has a derivation, we know that it must use first apply a rule of the form $A \to BC$. Then we know that $A \Rightarrow^* w$ iff $B \Rightarrow^* u$ and $C \Rightarrow^* v$ for some $w = uv$.

This gives us the following recurrence. \[T(w) = \begin{cases} \{A\} & \text{if $|w| = 1$ and $A \to w \in P$}, \\ \emptyset & \text{if $|w| = 1$ and $A \to w \not\in P$}, \\ \bigcup\limits_{A \to BC \in P} \bigcup\limits_{w = uv} \{A \mid B\in T(u) \wedge C \in T(v)\} & \text{otherwise}. \end{cases}\]

We can make this more explicit, for the algorithm, by referring to $u$ and $v$ as indexed substrings of $w = a_1 \cdots a_n$. \[T(i,j) = \begin{cases} \{A\} & \text{if $i=j$ and $A \to a_i \in P$}, \\ \emptyset & \text{if $i=j$ and $A \to a_i \not\in P$}, \\ \bigcup\limits_{A \to BC \in P} \bigcup\limits_{k=i}^{j-1} \{A \mid B\in T(i,k) \wedge C \in T(k+1,j)\} & \text{otherwise}. \end{cases}\]

This algorithm takes $O(n^3)$ time since it has three loops: for each substring length $\ell$ from 1 to $n$, consider each substring of length $\ell$ (of which there are $O(n)$), and consider all of the ways to split this substring into two (again, $O(n)$). This puts us into polynomial time, which is much better than our "try every derivation" approach. Of course, there are some issues with this.

First as a side note, Leslie Valiant showed in 1975 that computing the CYK table can be reduced to boolean matrix multiplication, which at the time that most current formal languages books were printed had a time complexity of $O(n^{2.376})$-ish, due to Coppersmith and Winograd in 1990. This was true until about 2010 (i.e. 20 years later), when there was a series of small improvements, the most recent of which was due to Alman and Vassilevska Williams in 2021, clocking in at $O(n^{2.37286})$.

However, the same goes the other way around: matrix multiplication can't do any better than whatever algorithm we muster up for parsing context-free grammars, because Lillian Lee showed in 2002 that you can solve matrix multiplication by turning it into context-free grammar parsing. The consequence of this is that we don't really know if we can do parsing better than $O(n^2)$.

The more immediate problem is that $O(n^3)$ is still really big, if you think about your typical large software codebase and the amount of time it takes to build a project and how mad everyone gets when you break the build. Ideally, we would like something that runs in linear time, or in other words, something where we can essentially read our input once and parse it correctly without having to go back over it. This can't be applied to all context-free languages because of inherent ambiguity of some languages, so instead, attention is restricted to various subclasses of context-free languages.

Let's walk through an example of the CYK algorithm.

Consider the following grammar $G$ in CNF. \begin{align*} S &\rightarrow BC \mid b \\ A &\rightarrow BC \mid b \\ B &\rightarrow DC \mid BB \mid a \\ C &\rightarrow BA \mid b \\ D &\rightarrow CA \mid c \end{align*}

We want to determine whether the word $cabab$ is generated by $G$. We will store our results in a table. First, we consider every substring of length 1 and the productions that could have generated each substring. We record which variables are on the left hand side of these productions.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$
2 $B$
3 $S,A,C$
4 $B$
5 $S,A,C$

Next, we consider substrings of length 2. A production can generate substrings of length 2 if they lead to two variables that each generate substrings of length 1. For example, we have that entry $1,2$ doesn't have any variables that can generate it, because there is no production of the form $X \rightarrow DB$. However, we can determine that $ab$ is generated by a production that has either $BA$ or $BC$ on the right hand side. There are such productions: $S \rightarrow BC$, $A \rightarrow BC$, and $C \rightarrow BA$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$
2 $B$ $S,A,C$
3 $S,A,C$ $\emptyset$
4 $B$ $S,A,C$
5 $S,A,C$

Next, we consider strings of length 3. Here, we have to do a bit more work, because strings of length 3, say $xyz$ can be generated by two variables by $xy$ and $z$ or $x$ and $yz$. We must check both possibilities. For example, for the string $cab$, we can generate it either by $ca$ and $b$ or by $c$ and $ab$. In the first case, $ca$ has no derivation, but the second case gives us a valid derivation, since $ab$ can be generated by $S$, $A$, or $C$ and we have $B \Rightarrow DC \Rightarrow cab$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$ $B$
2 $B$ $S,A,C$ $\emptyset$
3 $S,A,C$ $\emptyset$ $D$
4 $B$ $S,A,C$
5 $S,A,C$

We continue with strings of length 4. Again, we have to consider all of the ways to split up a string of length 4 into two substrings.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$ $B$ $B$
2 $B$ $S,A,C$ $\emptyset$ $D$
3 $S,A,C$ $\emptyset$ $D$
4 $B$ $S,A,C$
5 $S,A,C$

Finally, we finish with the string of length 5, our original string $w$. We have constructed a table that tells us which variables can generate each substring $w$. To determine whether $w \in L(G)$, we simply need to see whether the substring corresponding to $w$ (i.e. $w[1..5]$ has a derivation that begins with $S$. We see that it does, so $w \in L(G)$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$ $B$ $B$ $S,A,C$
2 $B$ $S,A,C$ $\emptyset$ $D$
3 $S,A,C$ $\emptyset$ $D$
4 $B$ $S,A,C$
5 $S,A,C$