CMSC 28000 — Lecture 19

A side note on determinism

At this point, we might want to play around with our model in the same way that we did with our finite automaton. One observation that we've made so far is that we've begun with a nondeterministic machine model for the PDA. An obvious question is what happens when we try to restrict the machine to be deterministic. The next question is how to define this.

A deterministic pushdown automaton (DPDA) is a PDA $M = (Q,\Sigma,\Gamma,\delta,q_0,Z_0,F)$ with the additional conditions:

The class of languages accepted by DPDAs by final state are called the deterministic context-free languages.

It's clear from this definition that a PDA can recognize any language accepted by a DPDA. But note that a DPDA is not simply a PDA with the codomain of the transition function restricted to $Q \times \Gamma^*$—we are allowed both to have undefined transitions (i.e. transitions that end up in $\emptyset$) as well as $\varepsilon$-transitions.

We should parse out what these restrictions imply. The first condition restricts the choices of a transition for any particular state, input symbol, and stack symbol. In this case, there is no ambiguity, since the transition is uniquely determined by the state, input symbol, and stack symbol.

The second condition concerns $\varepsilon$-transitions. We allow $\varepsilon$-transitions because it is possible that we want to take a transition based on the symbol on the top of the stack. In this case, if we must have an $\varepsilon$-transition, we disallow any other transitions on the same stack symbol. Without this condition, we could sneak in some nondeterminism by making a choice between consuming an input symbol or not.

Here is a PDA.

You might recognize it. This is the PDA for $\{a^n b^n \mid n \geq 0\}$. However, it is not a DPDA, since $\delta(q_0,\varepsilon,Z_0) \neq \emptyset$ but $\delta(q_0,a,Z_0) \neq \emptyset$. However, this is easy to fix.

Now, we have a DPDA. A natural question to ask is, just as with NFAs, is there a way to determinize every PDA into a DPDA?

The DPDA model is due to Seymour Ginsburg and Sheila Greibach in 1966. This puts the development of the DPDA a few years after the standard PDA model. Unlike with finite automata, there are some significant differences between DCFLs and CFLs.

Here is one quick distinction. Recall our PDA construction that computed a leftmost derivation for any given CFG $G$ as part of our proof that PDAs and CFGs are equivalent. If we could construct an equivalent DPDA, this would suggest a deterministic way of computing a leftmost derivation for any word under any CFG. But the existence of inherently ambiguous languages suggests otherwise.

One important property of DCFLs that they show is that DCFLs are closed under complementation. We'll see shortly that this is not the case for context-free languages in general. This is another property that implies that the classes of DCFLs and CFLs are different.

So what is the point of DCFLs if they're not as powerful as CFLs? As with many things in computer science, we can talk about tradeoffs. Recall that parsing with a context-free grammar can be done in cubic time via CYK and not significantly better, due to the matrix multiplication lower bound. Along with CYK, we've also seen an example of using a PDA to parse a grammar, when we showed that CFGs were equivalent to PDAs. In that case, we used a top-down method, starting with the start symbol $S$ and nondeterministically choosing productions to construct a leftmost derivation.

This is where DCFLs come in. DCFLs are a class of languages that have associated grammars that can be parsed in linear time. These are the LR(k) grammars, due to Donald Knuth in 1965.

The name here gives away our strategy. In $\mathrm{LR}(k)$, L stands for left-to-right scan, R stands for rightmost derivation, and $k$ is the number of characters we're allowed to "look ahead" in order to decide what to do. Grammars with more lookahead use the same ideas and can be parsed in a similar way.

Knuth gives the following pair of results which links $\mathrm{LR}(k)$ grammars and DCFLs.

If $G$ is an $\mathrm{LR}(k)$ grammar, then there is a deterministic pushdown automaton that accepts $L(G)$. If $L$ is a deterministic context-free language, then there is an $\mathrm{LR}(1)$ grammar that generates $L$.

Practically speaking, most programming languages are really languages that are $\mathrm{LR}(1)$ or so. YACC generates grammars that are $\mathrm{LALR}$, which is somewhere in between $\mathrm{LR}(0)$ and $\mathrm{LR}(1)$.

The Pumping Lemma for context-free languages

Now, we discover why, several weeks ago, when we discussed the pumping lemma, I wrote "pumping lemma for regular languages". Yes, it is because we are going to be talking about the pumping lemma for context-free languages today and have our first taste of the world beyond context-free languages.

Let $L \subseteq \Sigma^*$ be a context-free language. Then there exists a positive integer $n$ such that for every word $w \in L$ with $|w| \geq n$, there exists a factorization $w = uvxyz$ with $u,v,x,y,z \in \Sigma^*$ such that

  1. $vy \neq \varepsilon$,
  2. $|vxy| \leq n$,
  3. $uv^i xy^i z \in L$ for all $i \in \mathbb N$.

This pumping lemma is due to Bar-Hillel, Perles, and Shamir in 1961. As it turns out, there are a lot of language classes for which there are pumping lemmas (deterministic context-free languages, indexed languages, regular tree languages).

The reasoning behind the pumping lemma for context-free languages is in the same vein as the one for regular languages. Since the description of any context-free grammar is finite, for a sufficiently long word, some aspect of its derivation must be repeated. For regular languages, this is analogous to the fact that a state will show up more than once on a path in a DFA. For context-free languages, this will be the fact that a variable appears more than once on a path in a parse tree.

Since $L$ is context-free, there must be a CFG $G$ in Chomsky normal form that generates it. Let $m$ be the number of variables in $G$. Then the pumping length for $L$ will be $n = 2^m$.

Consider a word $w \in L$ with $|w| \geq n = 2^m$. Let $T$ be a parse tree for $w$ and fix it. We saw from a problem set problem that there must be a path in $T$ from the root to a leaf with at least $m+1$ internal nodes.

Pick a longest such path. Since there are at least $m+1$ internal nodes along this path and only $m$ variables, there must be a variable that appears twice in the path. Suppose this variable is $X$ and we have that there must be an occurrence of $X$ that is closest to the leaf on our chosen path.

Let $T_1$ and $T_2$ be two subtrees of $T$ that are rooted at the two occurrences of $X$ on the path closest to the leaf. Let $T_2$ be the tree rooted at the closest occurrence of $X$ so $T_2$ must be a proper subtree of $T_1$. Furthermore, every path from the root of $T_1$ to a leaf has at most $m+1$ internal nodes and therefore $T_1$ has at most $2^m = n$ leaf nodes.

Let $x$ be the word for which $T_2$ is the parse tree rooted at $X$ and let $v$ and $y$ be the words formed by the leaves of the parse tree $T_1$ rooted at $X$ around $T_2$ such that $T_1$ is a parse tree for the word $vxy$. Then we let $u$ and $z$ be the words formed by the leaves of $T$ around $T_1$ such that $T$ is a parse tree for the word $w = uvxyz$.

Now we need to show that $uvxyz$ is a factorization for $w$ such that the conditions of the pumping lemma hold. First, to see that $|vxy| \leq n$, we just observe that $T_1$ can have at most $2^m = n$ leaf nodes.

Next, to see that $vy \neq \varepsilon$, we note that since both $T_1$ and $T_2$ are rooted at $X$, we can construct a parse tree in which $T_1$ is removed and replaced by $T_2$. This would then be a parse tree for $uxz$. However, recall that every parse tree for a word in a grammar in CNF must have exactly the same size. Since $T_2$ was already shown to be smaller than $T_1$ we must have that $uxz \neq uvxyz$ and therefore $vy \neq \varepsilon$.

Finally, to see that $uv^ixy^iz \in L$ for all $i \in \mathbb N$, we first note that we have already shown $uv^0 xy^0 z \in L$ above. To see that this holds for $i > 1$, we can perform a similar substitution by replacing $T_2$ with $T_1$ to give us a parse tree for $uv^2xy^2$. We can then repeat this process any number of times to obtain a parse tree for $uv^ixy^iz$.

Our strategy is very similar to the one we employed for regular languages. We assume that our language is context-free, then show that we are able to find a string that is unable to satisfy the conditions of the pumping lemma by following our template from before.

  1. The adversary chooses a pumping constant $n \gt 0$.
  2. We choose a word $w \in L$ with $|w| \geq n$.
  3. The adversary chooses a factorization $w = uvxyz$ with $|vxy| \leq n$ and $vy \neq \varepsilon$.
  4. We choose $i \in \mathbb N$.

Then we win if $uv^ixy^iz \not \in L$.

However, there are a few more things to be mindful of when carrying out these proofs for context-free languages. The main difference between proofs using the pumping lemma for CFLs and the pumping lemma for regular languages is there is much more we have to do to ensure that we've considered all possible decompositions. The other difference is that the factorization into five factors rather than just three makes explicitly defining the factors much more of a hassle.

Let $L = \{a^m b^m c^m \mid m \geq 0\}$. Then $L$ is not context-free.

Suppose that $L$ is context-free and let $n$ be the pumping length for $L$. Let $w = a^n b^n c^n$. Clearly, $w \in L$ and $|w| = 3n \geq n$. Now, we must consider all factorizations of $w = uvxyz$ such that $|vxy| \leq n$ and $vy \neq \varepsilon$. There are two cases to consider.

  1. $vxy = a^s b^t$ such that $0 < s+t \leq n$ and $vy \neq \varepsilon$. Observe that since $|vxy| \leq n$, it cannot contain any $c$'s. Therefore, choosing $i > 1$, we have that either $|uv^ixy^iz|_a > |uv^ixy^iz|_c$ or $|uv^ixy^iz|_b > |uv^ixy^iz|_c$ and thus $uv^ixy^iz \not \in L$. By the same argument, we can choose $vxy = b^s c^t$ and arrive at the fact that either $|uv^ixy^i|_b > |uv^ixy^iz|_a$ or $|uv^ixy^iz|_c > |uv^ixy^iz|_a$.
  2. $vxy = b^s$ such that $0 < s \leq n$ and $vy \neq \varepsilon$. Again, choosing $i > 1$, we have that $|uv^ixy^iz|_b > |uv^ixy^iz|_c$ and thus $uv^ixy^iz \not \in L$. An analogous argument holds for $vxy = a^s$ and $vxy = c^s$.

Thus, every factorization of $w = uvxyz$ fails to satisfy the pumping lemma and therefore $L$ is not context-free as assumed.

Note that if we wanted to be more specific, we would have to list five different cases in the above proof. Luckily, each of these cases folds nicely into just two broad cases. All of this makes choosing the word $w$ to be pumped much more important in avoiding a lot of work.

Let $L = \{ww \mid w \in \{a,b\}^* \}$. Then $L$ is not context-free.

Suppose $L$ is context-free and let $n$ be the pumping length for $L$. Let $w = a^n b^n a^n b^n$. Then $w = (a^n b^n)^2 \in L$ and $|w| = 4n \geq n$. Now, we must consider factorizations $w = uvxyz$ with $|vxy| \leq n$ and $vy \neq \varepsilon$. There are two cases.

  1. $vxy$ lies entirely in either the first or second half of $w$. There are three sub-cases to deal with here. Since $|vxy| \leq n$, it can contain either all $a$'s, all $b$'s, or some factor $a^j b^k$. In any of these cases, if $vxy$ lies in the first half of $w$, we have $uv^ixy^iz = a^{n+(i-1)\cdot s} b^{n +(i-1) \cdot t} a^n b^n$ for some $s,t \geq 0$ and for all $i \geq 2$. Then $uv^ixy^iz \not \in L$ since it is not a square. The same argument can be applied to the case when $vxy$ is entirely in the second half of $w$.
  2. $vxy$ straddles both halves of $w$. In this situation, one can observe that since $|vxy| \leq n$, if we take $i > 1$, then $uv^ixy^iz = a^n b^{n+(i-1)\cdot j} a^{n+(i-1) \cdot k} b^n$ for some $j,k > 0$. Clearly, $w$ is not a square and thus $w \not \in L$.

Thus, every factorization of $w = uvxyz$ fails to satisfy the pumping lemma and therefore $L$ is not context-free as assumed.

This is the "copy" language that we saw when we were discussing "real" regular expressions vs "theoretical" regular expressions. Being able to identify "copies" within a word is something that we can do with POSIX regular expressions via backreferences. We can see here that backreferences actually take us even beyond the context-free languages. The Câmpeanu, Salomaa, and Yu paper that I mentioned earlier shows us that the class of languages that can be matched POSIX regular expressions is actually incomparable with the context-free languages: there are CFLs that can't be matched by POSIX regular expressions and there are POSIX regular expression languages that aren't context-free.

Again, note that in the above proof, if we were to write out each case exhaustively, we would end up with as many as seven cases to consider. Part of understanding how to go about these proofs will be understanding the process well enough to identify all of the different cases and understanding which ones can be grouped together.

The pumping lemma for regular languages (and various other results) pointed to the idea that when we think about regular languages, we should think about finiteness and the property that regular languages are characterized by only being able remember or keep track of finitely many things. The same idea applies when thinking about context-free languages. Recall from our discussions about the Chomsky—Schützenberger theorem the idea of properly nested brackets. One of the key ideas behind context-free languages is symmetry—notice that languages of strings $x x^R$ are context free, while $xx$ is not.