CMSC 28000 — Lecture 14

Last time we gave an algorithm for transforming an arbitrary CFG into a grammar in Chomsky normal form. So what does this algorithm look like in action? Let's try it with an example.

Consider the grammar $G$, defined by $$ S \to ( S ) S \mid \varepsilon $$ Here, $L(G)$ is the language of well-balanced parentheses. Applying the first step, adding a new start variable, we obtain \begin{align*} S_0 &\to S \\ S &\to (S)S \mid \varepsilon \end{align*} Then eliminating the $\varepsilon$-production, we have \begin{align*} S_0 &\to S \mid \varepsilon \\ S &\to (S)S \mid () S \mid (S) \mid () \end{align*} Then, we eliminate unit productions to obtain \begin{align*} S_0 &\to (S)S \mid () S \mid (S) \mid () \mid \varepsilon \\ S &\to (S)S \mid () S \mid (S) \mid () \end{align*} Then, we introduce productions for each terminal, \begin{align*} S_0 &\to A_LSA_RS \mid A_LA_R S \mid A_LSA_R \mid A_LA_R \mid \varepsilon \\ S &\to A_LSA_RS \mid A_LA_R S \mid A_LSA_R \mid A_LA_R \\ A_L &\to ( \\ A_R &\to ) \end{align*} Then final step is to ensure every production is broken up into the form $A \to BC$.

As you might guess from the construction, grammars in CNF can be quite large. In fact, if you have a CFG of size $n$, an equivalent grammar in CNF can be as large as $O(n^2)$. This means that while CNF grammars have some nice properties for proofs, it doesn't seem so great for implementation. Secondly, grammars in CNF can be ambiguous. Since inherently ambiguous context-free languages exist, that must mean that any grammar in CNF that generates those languages must also be ambiguous.

Recognition for grammars

As presented, it's not clear that there's a fast way to parse a string given a grammar $G$. If we have a word $w$ and we want to know whether it's generated by a grammar $G$ in CNF, then we know that such a word will have a derivation of length exactly $2|w| - 1$. This gives us an easy but bad algorithm: just check every derivation of length $2|w| - 1$ until you can or can't find one for $w$. Of course, this gives us exponentially many derivations to check.

However, we can exploit the predictable structure of a CNF grammar by considering each substring. For instance, we know that every symbol will have a production that generates it directly and we know that every non-terminal production will have exactly two variables on its right hand side.

This lends itself to a dynamic programming algorithm and such an algorithm was independently proposed by three different individuals, which gives the algorithm its name: the CockeYoungerKasami algorithm.

    \begin{algorithm}
    \begin{algorithmic}
    \PROCEDURE{cyk}{$G = (V,\Sigma,P,S), w=a_1 a_2 \cdots a_n$}
        \FOR{$i$ from 1 to $n$}
            \STATE $T[i,i] = \{A \in V \mid A \to a_i \in P\}$
        \ENDFOR
        \FOR{$m$ from 1 to $n-1$}
            \FOR{$i$ from 1 to $n-m$}
                \STATE $j = i+m$
                \STATE $T[i,j] = \emptyset$
                \FOR{$k$ from 1 to $j-1$}
                    \STATE $T[i,j] = T[i,j] \cup \{A \in V \mid A \to BC \in P \wedge B \in T[i,k] \wedge C \in T[k+1,j]\}$
                \ENDFOR
            \ENDFOR
        \ENDFOR
    \RETURN $S \in T[1,n]$
    \ENDPROCEDURE
    \end{algorithmic}
    \end{algorithm}

Given a CFG $G$ in Chomsky Normal Form and string $w$ of length $n$, we can determine whether $w \in L(G)$ in $O(n^3 |G|)$ time.

This algorithm takes $O(n^3)$ time since it has three loops: for each substring length $\ell$ from 1 to $n$, consider each substring of length $\ell$ (of which there are $O(n)$), and consider all of the ways to split this substring into two (again, $O(n)$). This puts us into polynomial time, which is much better than our "try every derivation" approach. Of course, there are some issues with this.

First as a side note, Leslie Valiant showed in 1975 that computing the CYK table can be reduced to matrix multiplication, which at the time that most current formal languages books were printed had a time complexity of $O(n^{2.376})$-ish, due to Coppersmith and Winograd in 1990. This was true until about 2010 (i.e. 20 years later), when there was a bunch of results from Stothers, Vassilevska-Williams, and Le Gall improved this to $O(n^{2.373})$. These algorithms are not actually feasible, but it does provide a general lower bound. The consequence of this is that we don't really know if we can do parsing better than $O(n^2)$.

The more immediate problem is that $O(n^3)$ is still really big, if you think about your typical large software codebase and the amount of time it takes to build a project and how mad everyone gets when you break the build. Ideally, we would like something that runs in linear time, or in other words, something where we can essentially read our input once and parse it correctly without having to go back over it. This can't be applied to all context-free languages because of inherent ambiguity of some languages, so instead, attention is restricted to various subclasses of context-free languages.

Let's walk through an example.

Consider the following grammar $G$ in CNF. \begin{align*} S &\rightarrow AB \mid b \\ A &\rightarrow CB \mid AA \mid a \\ B &\rightarrow AS \mid b \\ C &\rightarrow BS \mid c \end{align*}

We want to determine whether the word $cabab$ is generated by $G$. We will store our results in a table. First, we consider every substring of length 1 and the productions that could have generated each substring. We record which variables are on the left hand side of these productions.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $C$
2 $A$
3 $S,B$
4 $A$
5 $S,B$

Next, we consider substrings of length 2. A production can generate substrings of length 2 if they lead to two variables that each generate substrings of length 1. For example, we have that entry $1,2$ doesn't have any variables that can generate it, because there is no production of the form $X \rightarrow CA$. However, we can determine that $ab$ is generated by a production that has either $AS$ or $AB$. There are such productions: $S \rightarrow AB$ and $B \rightarrow AS$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $C$ $\emptyset$
2 $A$ $S,B$
3 $S,B$ $\emptyset$
4 $A$ $S,B$
5 $S,B$

Next, we consider strings of length 3. Here, we have to do a bit more work, because strings of length 3, say $xyz$ can be generated by two variables by $xy$ and $z$ or $x$ and $yz$. We must check both possibilities. For example, for the string $cab$, we can generate it either by $ca$ and $b$ or by $c$ and $ab$. In the first case, $ca$ has no derivation, but the second case gives us a valid derivation, since $ab$ can be generated by $S$ or $B$ and we have $A \Rightarrow CB \Rightarrow cab$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $C$ $\emptyset$ $A$
2 $A$ $S,B$ $\emptyset$
3 $S,B$ $\emptyset$ $C$
4 $A$ $S,B$
5 $S,B$

We continue with strings of length 4. Again, we have to consider all of the ways to split up a string of length 4 into two substrings.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $C$ $\emptyset$ $A$ $A$
2 $A$ $S,B$ $\emptyset$ $C$
3 $S,B$ $\emptyset$ $C$
4 $A$ $S,B$
5 $S,B$

Finally, we finish with the string of length 5, our original string $w$. We have constructed a table that tells us which variables can generate each substring $w$. To determine whether $w \in L(G)$, we simply need to see whether the substring corresponding to $w$ (i.e. $w[1..5]$ has a derivation that begins with $S$. We see that it does, so $w \in L(G)$.

$c$ $a$ $b$ $a$ $b$
1 2 3 4 5
1 $C$ $\emptyset$ $A$ $A$ $S,B$
2 $A$ $S,B$ $\emptyset$ $C$
3 $S,B$ $\emptyset$ $C$
4 $A$ $S,B$
5 $S,B$

The Chomsky–Schützenberger Theorem

Consider the grammar $G$ defined by \[S \rightarrow (S) \mid SS \mid \varepsilon.\] Here, our terminal symbols are $\{(,)\}$. This is another grammar for the language of balanced parentheses. A detailed proof of this is given in Kozen, Chapter 20. We can extend this grammar to multiple sets of parentheses: \[S \rightarrow (_1S)_1 \mid (_2 S )_2 \mid \cdots \mid (_n S )_n \mid SS \mid \varepsilon.\]

The family of languages of $n$ sets of balanced parentheses for each $n \geq 1$ are also called the Dyck languages, for Walther von Dyck, known for his contributions to combinatorial group theory. Dyck languages happen to have some interesting combinatorial and algebraic properties. In the theory of context-free languages, the Dyck languages happen to provide an important characterization.

First, we need the notion of homomorphisms on strings.

A homomorphism (also called a morphism, particularly in combinatorics on words) is a function $h:\Sigma^* \to \Gamma^*$ such that $h(uv) = h(u)h(v)$.

Note that this definition implies that $\varphi(\varepsilon) = \varepsilon$ for all homomorphisms $\varphi$ and that it is enough to define $\varphi$ for each symbol in $\Sigma$.

Consider the homomorphism $\varphi : \{a,b\}^* \to \{0,1\}^*$ defined by \[\varphi(a) = 0, \varphi(b) = 01.\] Then $\varphi(abaab) = \varphi(a) \varphi(b) \varphi(a) \varphi(a) \varphi(b) = 0010001$.

The basic idea behind morphisms is that they map words in $\Sigma^*$ to words in $\Gamma^*$, where $\Sigma$ and $\Gamma$ are alphabets, while preserving the structure of the word. This is really just the same thing as homomorphisms in algebra, recalling that the set of strings over $\Sigma$ is just the free monoid on $\Sigma$ (so string homomorphisms are really just a monoid homomorphism).

Two relatively straightforward things to prove about homomorphisiims are the following:

The class of regular languages is closed under homomorphism.

By induction on regular expressions.

The class of context-free languages is closed under homomorphism.

By induction on context-free grammars.

Using this idea, we can get the following result due to Chomsky and Schützenberger in 1963.

Every context-free language is a homomorphic image of the intersection of a Dyck language and a regular language.

Let $G = (V, \Sigma, P, S)$ be a context-free grammar in Chomsky normal form. We will construct a new grammar $G' = (V, \Gamma, P', S)$ in the following way. For each production $p \in P$, we define the following

That is, $\Gamma = \{[_1^p, ]_1^p, [_2^p, ]_2^p \mid p \in P\}$ and $P' = \{p' \mid p \in P\}$. So our new grammar has exactly as many production rules as before, but is over a new alphabet of size $4|P|$.

We've essentially encoded a grammar into a language of balanced parentheses. Let $\mathcal D_\Gamma$ be the language of balanced parentheses over the alphabet of our new grammar (essentially a Dyck language over $2|P|$ sets of parentheses). We observe that $L(G') \subseteq \mathcal D_\Gamma$, since it's pretty clear that our language contains only strings of balanced parentheses. However, $D_\Gamma \not\subseteq L(G')$, since not every string of balanced parentheses will show up in our language.

We can go through the differences between the two:

  1. Every $]^p_1$ is followed immediately by $[^p_2$.
  2. No $]^p_2$ is followed immediately by a $]$ of any type.
  3. If $p = A \to BC$, then $[^p_1$ is followed by $[_1^q$, for some production $q = B \to \alpha \in P$ and $[^p_2$ is followed by $[_1^r$ for some production $r = C \to \beta \in P$.
  4. if $p = A \to a$, then $[_1^p$ is always followed by $]^p_1$ and $[^p_2$ is always followed by $]^p_2$.
  5. For all strings $x \in \Gamma^*$ such that $A \Rightarrow_{G'}^* x$, the first symbol of $x$ is $[^p_1$ for some production $p = A \to \alpha \in P$.

We observe that set of strings described by each of these properties is regular (note that there's no claim here about these strings being balanced). Since these are all regular, their union must be regular, so we can define the regular language \[R_A = \{x \in \Gamma^* \mid \text{$x$ satisfies (i) through (v)}\}.\] You might wonder how many of these we have to construct, since this depends on the variable. But notice that any context-free grammar has only a finite number of variables, so we'll only need finitely many of these $R_A$'s.

We then have that $A \Rightarrow_{G'}^* w \iff w \in \mathcal D_\Gamma \cap R_A$.

All that's left to do is show that there's a homomorphism that we can define that will map words in $\mathcal D_\Gamma \cap R_S$ to $L(G)$. We will define the homomorphism $\varphi: \Gamma^* \to \Sigma^*$ as follows:

We observe that $\varphi(p') = p$ and so $L(G) = \varphi(L(G')) = \varphi(\mathcal D_\Gamma \cap R_S)$.

If you were not convinced by the claim that $A \Rightarrow_{G'}^* w \iff w \in \mathcal D_\Gamma \cap R_A$, here is part of the proof.

$A \Rightarrow_{G'}^* w \iff w \in \mathcal D_\Gamma \cap R_A$.

($\Leftarrow$) Let $x \in \mathcal D_\Gamma \cap R_A$. We will prove that there is a derivation $A \Rightarrow_{G'}^* x$ by induction on $|x|$. From the properties above, we have that $x = [^p_1 y ]^p_1 [^p_2 z ]^p_2$ for some strings $y,z \in \Gamma^*$ and production $p = A \to \alpha \in P$. There are two cases, depending on the form of $\alpha$.

Thus, there is a derivation $A \Rightarrow_{G'}^* x$.

The $(\Rightarrow)$ direction is left as an exercise.

The Chomsky–Schützenberger theorem tells us that every context-free language can be described by a regular language, a Dyck language, and a morphism. Intuitively, what this tells us is that what really separates regular languages from context-free languages is the ability to encode "nestedness", in the sense of brackets or trees.

While a Dyck language typically refers to a language of $n$ types of balanced parentheses, one can also interpret this as a language in which the alphabet symbols are matched or paired. For example, a language that balances $\{(,),[,]\}$ can be considered the same as a language that balances $\{a,\overline a, b, \overline b\}$. Some treatments of the Chomsky–Schützenberger theorem use this interpretation.

While we're probably familiar with Chomsky, Schützenberger is not quite as well known outside of formal languages researchers. Marcel-Paul Schützenberger was a French mathematician who was one of Chomsky's early collaborators and together they proved many important results about context-free languages, including having a part in what we now call the Chomsky hierarchy. Besides his work with Chomsky, Schützenberger produced many influential results in formal languages and combinatorics on words and many of his students became foundational figures in formal language theory in their own right.