CMSC 28000 — Lecture 15

Chomsky Normal Form, continued

The fact that this is described as a normal form hints at the fact that every context-free grammar has an equivalent grammar in CNF, and therefore, every context-free language has a grammar in CNF.

As an aside, Chomsky Normal Form is not the only normal form for grammars. Another one covered in the textbook is Greibach normal form, defined by Sheila Greibach in 1965. Production rules in Greibach normal form must be of the form $A \to aB_1 \cdots B_m$, where $a \in \Sigma$ and $A, B_1, \dots, B_m \in V$. We will not be covering Greibach normal form.

Let $L \subseteq \Sigma^*$ be a context-free language. Then there exists a CFG $G$ in Chomsky normal form such that $L(G) = L$.

We will begin with an arbitrary CFG $H$ and perform the following algorithm to construct an equivalent CFG $G$ in CNF.

  1. START: Add a new start variable $S_0$ and production $S_0 \to S$. This allows us to guarantee that the starting variable of $G$ never occurs on the right hand side of a production.
  2. TERM: Add variables $A_a$ for each terminal $a \in \Sigma$ and add the production $A_a \to a$.
  3. BIN: Split up productions of the form $A \to B_1 \cdots B_m$. We do this by replacing these productions with productions $A \to B_1 A_1$, $A_1 \to B_2 A_2, \dots, A_{m-2} \to B_{m-1} B_m$, where $A_i$ for $1 \leq i \leq m-2$ are new variables.
  4. DEL: Eliminate productions of the form $A \to \varepsilon$ for $A \neq S_0$. For each occurrence of $A$ in a production, we add a new production rule with $A$ removed.
  5. UNIT: Eliminate productions of the form $A \to B$. Then for all productions $B \to \alpha$, we add the production $A \to \alpha$.

This completes the construction.

Most of the procedures described above are fairly straightforward. The only two that pose some real questions are the final two: the elimination of $\varepsilon$ and unit productions. A constructive proof that these two transformations work is given by Kozen in Chapter 21.

We also note that the order in which these transformations are applied can change, but can also create more work. For example, eliminating $\varepsilon$-productions (DEL) before ensuring that all productions have at most two variables on the right hand side (BIN) can create exponentially many productions. Observe also that some rules can create unit productions, so eliminating unit productions (UNIT) is best left as the final transformation.

Consider the grammar $G$, defined by $$ S \to \mathtt ( S \mathtt ) S \mid \varepsilon $$ Here, $L(G)$ is the language of well-balanced parentheses.

  1. Applying the first step (START), adding a new start variable, we obtain \begin{align*} S_0 &\to S \\ S &\to \mathtt (S\mathtt )S \mid \varepsilon \end{align*}
  2. Next, we create replace terminals with new productions (TERM). \begin{align*} S_0 &\to S \\ S &\to ASBS \mid \varepsilon \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
  3. Then, we split up productions with more than two variables on the right hand side (BIN). \begin{align*} S_0 &\to S \\ S &\to AC_1 \mid \varepsilon \\ C_1 &\to SC_2 \\ C_2 &\to BS \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
  4. Now, we remove $\varepsilon$-rules (DEL). \begin{align*} S_0 &\to S \mid \varepsilon \\ S &\to AC_1 \\ C_1 &\to SC_2 \mid C_2 \\ C_2 &\to BS \mid B \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
  5. Finally, we remove unit productions (UNIT). \begin{align*} S_0 &\to AC_1 \mid \varepsilon \\ S &\to AC_1 \\ C_1 &\to SC_2 \mid BS \mid \mathtt ) \\ C_2 &\to BS \mid \mathtt ) \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}

As you might guess from the construction, grammars in CNF can be quite large. One point of interest is that the order in which the CNF transformations are applied can affect the blowup of the number of rules. The classical presentation due to Hopcroft and Ullman that is usually taught can lead to exponentially many rules. The current order which we've presented now constructs an equivalent grammar in CNF as large as $O(n^2\mathtt )$. This observation was discussed as part of a larger investigation into the complexity of the CNF procedure by Lange and Leiß in 2009.

Recognition for grammars

As presented, it's not clear that there's a fast way to determine whether a string is generated by a given grammar $G$. For an aribtrary CFG, it's not clear where to even begin with this. If our grammar is in CNF, we can at least bound the length of the derivation of any string $w$ to $2|w| - 1$. This gives us an easy but bad algorithm: just check every derivation of length $2|w| - 1$ until you can or can't find one for $w$. Of course, this gives us exponentially many derivations to check.

However, we can exploit the predictable structure of a CNF grammar to come up with a systematic way of building a derivation. We know that every symbol will have a production that generates it directly and we know that every non-terminal production will have exactly two variables on its right hand side. This means that every derivation of a string longer than a single symbol can be broken up into derivations of exactly two strings.

This naturally gives us some recursive structure to work with. Suppose we have a grammar $G = (V, \Sigma, P, S)$ and a string $w$ that we would like to figure out whether $w \in L(G)$. Recall that this means that $S \Rightarrow^* w$, but as usual, we'll ask the broader question of whether there exists a variable $A$ such that $A \Rightarrow^* w$. We consider this based on the length of $w$.

For each string, what we'll do is remember the set of variables $A$ such that $A \Rightarrow^* w$. Let's call this set $T(w)$. Our discussion above gives us the folowing recurrence. \[T(w) = \begin{cases} \{A \in V \mid A \to w \in P\} & \text{if $|w| = 1$}, \\ \bigcup\limits_{w = uv} \{A \mid A \to BC \in P \wedge B\in T(u) \wedge C \in T(v)\} & \text{if $|w| > 1$}. \end{cases}\]

We can make this more explicit, for the algorithm, by referring to $u$ and $v$ as indexed substrings of $w = a_1 \cdots a_n$. \[T(i,j) = \begin{cases} \{A \in V \mid A \to a_i \in P \} & \text{if $i=j$}, \\ \bigcup\limits_{k=i}^{j-1} \{A \mid A \to BC \in P \wedge B\in T(i,k) \wedge C \in T(k+1,j)\} & \text{if $i < j$}. \end{cases}\]

This recurrence is a dynamic programming recurrence, so what we have is a dynamic programming algorithm. This algorithm is called the CockeYoungerKasami (CYK) algorithm, named after the three individuals who independently proposed it.

    \begin{algorithmic}
    \PROCEDURE{cyk}{$G = (V,\Sigma,P,S), w=a_1 a_2 \cdots a_n$}
        \FOR{$i$ from 1 to $n$}
            \STATE $T[i,i] = \{A \in V \mid A \to a_i \in P\}$
        \ENDFOR
        \FOR{$m$ from 1 to $n-1$}
            \FOR{$i$ from 1 to $n-m$}
                \STATE $j = i+m$
                \STATE $T[i,j] = \emptyset$
                \FOR{$k$ from 1 to $j-1$}
                    \STATE $T[i,j] = T[i,j] \cup \{A \in V \mid A \to BC \in P \wedge B \in T[i,k] \wedge C \in T[k+1,j]\}$
                \ENDFOR
            \ENDFOR
        \ENDFOR
    \RETURN $S \in T[1,n]$
    \ENDPROCEDURE
    \end{algorithmic}

Let's walk through an example of the CYK algorithm.

Consider the following grammar $G$ in CNF. \begin{align*} S &\rightarrow BC \mid b \\ A &\rightarrow BC \mid b \\ B &\rightarrow DC \mid BB \mid a \\ C &\rightarrow BA \mid b \\ D &\rightarrow CA \mid c \end{align*}

We want to determine whether the word $cabab$ is generated by $G$. We will store our results in a table. First, we consider every substring of length 1 and the productions that could have generated each substring. We record which variables are on the left hand side of these productions.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$
2 $B$
3 $S,A,C$
4 $B$
5 $S,A,C$

Next, we consider substrings of length 2. A production can generate substrings of length 2 if they lead to two variables that each generate substrings of length 1. For example, we have that entry $1,2$ doesn't have any variables that can generate it, because there is no production of the form $X \rightarrow DB$. However, we can determine that $ab$ is generated by a production that has either $BA$ or $BC$ on the right hand side. There are such productions: $S \rightarrow BC$, $A \rightarrow BC$, and $C \rightarrow BA$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$
2 $B$ $S,A,C$
3 $S,A,C$ $\emptyset$
4 $B$ $S,A,C$
5 $S,A,C$

Next, we consider strings of length 3. Here, we have to do a bit more work, because strings of length 3, say $xyz$ can be generated by two variables by $xy$ and $z$ or $x$ and $yz$. We must check both possibilities. For example, for the string $cab$, we can generate it either by $ca$ and $b$ or by $c$ and $ab$. In the first case, $ca$ has no derivation, but the second case gives us a valid derivation, since $ab$ can be generated by $S$, $A$, or $C$ and we have $B \Rightarrow DC \Rightarrow cab$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$ $B$
2 $B$ $S,A,C$ $\emptyset$
3 $S,A,C$ $\emptyset$ $D$
4 $B$ $S,A,C$
5 $S,A,C$

We continue with strings of length 4. Again, we have to consider all of the ways to split up a string of length 4 into two substrings.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$ $B$ $B$
2 $B$ $S,A,C$ $\emptyset$ $D$
3 $S,A,C$ $\emptyset$ $D$
4 $B$ $S,A,C$
5 $S,A,C$

Finally, we finish with the string of length 5, our original string $w$. We have constructed a table that tells us which variables can generate each substring $w$. To determine whether $w \in L(G)$, we simply need to see whether the substring corresponding to $w$ (i.e. $w[1..5]$ has a derivation that begins with $S$. We see that it does, so $w \in L(G)$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $D$ $\emptyset$ $B$ $B$ $S,A,C$
2 $B$ $S,A,C$ $\emptyset$ $D$
3 $S,A,C$ $\emptyset$ $D$
4 $B$ $S,A,C$
5 $S,A,C$

Since CYK is a dynamic programming algorithm, the running time is not complicated to figure out.

Given a CFG $G$ in Chomsky Normal Form and string $w$ of length $n$, we can determine whether $w \in L(G)$ in $O(n^3 |G|)$ time.

CYK takes $O(n^3)$ time since it has three loops: for each substring length $\ell$ from 1 to $n$, consider each substring of length $\ell$ (of which there are $O(n)$), and consider all of the ways to split this substring into two (again, $O(n)$). Alternatively, we can see that we are filling out an $n \times n$ table (about $O(n^2)$), where each entry requires the traversal of the associated row and column ($O(n)$), which gives us our $O(n^3)$ bound. This puts us into polynomial time, which is much better than our "try every derivation" approach.

As a side note, Leslie Valiant showed in 1975 that computing the CYK table can be reduced to boolean matrix multiplication, which at the time that most current formal languages books were printed had a time complexity of $O(n^{2.376})$-ish, due to Coppersmith and Winograd in 1990. This was true until about 2010 (i.e. 20 years later), when there was a series of small improvements, the most recent of which was due to Vassilevska Williams, Xu, Xu, and Zhou earlier this year (2024), clocking in at $O(n^{2.371552})$.

However, the same goes the other way around: matrix multiplication can't do any better than whatever algorithm we muster up for parsing context-free grammars, because Lillian Lee showed in 2002 that you can solve matrix multiplication by turning it into context-free grammar parsing. The consequence of this is that we don't really know if we can do parsing better than $O(n^2)$.

The more immediate problem is that $O(n^3)$ is still really big, if you think about your typical large software codebase and the amount of time it takes to build a project and how mad everyone gets when you break the build. Ideally, we would like something that runs in linear time, or in other words, something where we can essentially read our input once and parse it correctly without having to go back over it. This requires that attention is restricted to various subclasses of context-free languages.