CMSC 28000 — Lecture 16

Pushdown automata

So we know that CFGs can generate languages that aren't regular, but how exactly are context-free languages more powerful than regular languages? It's helpful to refer to a family of languages we've encountered briefly: the language of balanced parentheses. While we saw the grammar for balancing one type of parentheses, we can define a family of langauges where we have, say, $n$ types of parentheses and we want all the strings of those parentheses properly balaned.

Such a family of languages is called the Dyck languages, for Walther von Dyck, known for his contributions to combinatorial group theory. Dyck languages happen to have some interesting combinatorial and algebraic properties. In formal language theory, the Dyck languages give an important characterization for the context-free languages.

Theorem (Chomsky–Schützenberger). Every context-free language is a homomorphic image of the intersection of a Dyck language and a regular language.

The theorem is due to Chomsky and Schützenberger in 1963. A proof of the Chomsky-Schützenberger theorem can be found in Kozen, Chapter G.

Recall that a homomorphism is a function $h:\Sigma \to \Gamma$ such that $h(uv) = h(u)h(v)$. The basic idea behind morphisms is that they map words in $\Sigma^*$ to words in $\Gamma^*$, where $\Sigma$ and $\Gamma$ are alphabets, while preserving the structure of the word.

The Chomsky–Schützenberger theorem tells us that every context-free language can be described by a regular language, a Dyck language, and a morphism. Since regular languages are closed under homomorphisms (you can try proving this), what this tells us is that what really separates regular languages from context-free languages is the ability to express "nestedness", in the sense of brackets or trees. We will explore this idea by returning to the notion of a machine.

Recall that a finite automaton is limited in two important ways: the input can only be read once and it only has a finite amount of memory. We will be augmenting the NFA to take care of the second problem, by giving it access to a stack with unlimited memory.

Recall that a stack is a data structure in which one can

These are called stacks because of the analogy piling a stack of (plates/books/papers/to-do items). You add items to your metaphorical stack by putting stuff on top and you can only remove items from your stack by taking things off the top.

The idea is that when you read a symbol on the input tape, you also take a symbol off of the stack. The transition depends on three pieces of information: the symbol you read on the input tape, the current state, and the symbol on the top of the stack. These determine what the next state is and what gets put onto the stack. So although the machine now has access to unlimited memory, there are restrictions on how it is able to access this memory.

Going back to the Chomsky-Schützenberger characterization for context-free languages, we can see that a PDA is "almost" an NFA, since all that's changed is having a stack. Of course, the stack is what gives our otherwise-NFA the ability to process nestedness, since stacks are first-in first-out structures.

Much of the fundamental properties of pushdown automata can be attributed to Schützenberger (1963) and Chomsky himself, though many of these discoveries were independently published by others (Oettinger 1961, Evey 1963).

Marcel-Paul Schützenberger was a French mathematician who was one of Chomsky's early collaborators and together they proved many important results about context-free languages. Besides his work with Chomsky, Schützenberger produced many influential results in formal languages and combinatorics on words.

Around this time, research into formal languages and automata theory was picking up and there were all sorts of investigations into modifications and restrictions on grammars and automata. The pushdown automaton is one result of these kinds of questions being pursued.

As an aside, one can think about what other sorts of crazy contraptions you can stick onto a finite automaton. Of course, this really amounts to the question of changing how we're allowed to access unlimited amounts of memory. Since our stack has infinite memory, it's clear that what's limiting its power is how we're allowed to access this memory. What happens if we add more restrictions? What happens if we loosen some restrictions?

A pushdown automaton (PDA) is a 7-tuple $M = (Q,\Sigma,\Gamma,\delta,q_0,Z_0,F)$, where

There are a few features of the transition function that should be mentioned. First, the transition function is nondeterministic. Secondly, on a transition, the machine takes the top symbol off of the stack but can put many symbols onto the stack, as long as it is a finite amount.

Because of the added complexity of representing the stack, some like to treat $\delta$ as a relation instead, which is what Kozen does. In this view, $\delta \subseteq (Q \times (\Sigma \cup \{\varepsilon\}) \times \Gamma) \times (Q \times \Gamma^*)$ is a relation on $Q \times (\Sigma \cup \{\varepsilon\}) \times \Gamma$ and $Q \times \Gamma^*$. So a transition $$((q,a,X),(q',Y_1 Y_2 \cdots Y_k)) \in \delta$$ is equivalent to saying that $(q',Y_1 Y_2 \cdots Y_k) \in \delta(q,a,X)$, where $q,q' \in Q$, $a \in \Sigma \cup \{\varepsilon\}$, and $X, Y_1, \dots, Y_k \in \Gamma$. But, functions are really just special kinds of relations, so in the end, this is just a difference in notational preference (personally speaking, this is the first textbook I've seen that does this).

As with finite automata, we can enumerate the parts of the PDA.

We write transitions in the following way: $$\delta(q,a,X) = \{(q_1,\alpha_1), \dots, (q_n,\alpha_n)\}$$ where $q,q_1,\dots,q_n \in Q$, $a \in \Sigma$, $X \in \Gamma$, and $\alpha_1,\dots,\alpha_n \in \Gamma^*$.

Since $\alpha_i$ is a string, it's important to note the order in which we push symbols onto the stack. If $\alpha_i = X_1 X_2 \cdots X_k$ where $X_j \in \Gamma$, $X_k$ gets pushed onto the stack first and after all of $\alpha_i$ is pushed onto the stack, $X_1$ is on the top of the stack. In other words, the top of the stack should look like we just took $\alpha_i$ and turned it 90 degrees so that $X_k$ is at the bottom and $X_1$ is at the top.

As an example, we write one of the transitions from the above machine as $$\delta(q_0,a,Z_0) = \{(q_0,XZ_0)\}.$$

Of course, as with NFAs, listing the entire transition function explicitly can be quite cumbersome. Also, note that just like for NFAs, we can draw a state diagram for a PDA. The idea is mostly the same, but the difference is how to denote the changes to the stack. For a transition $(q',\alpha) \in \delta(q,a,X)$ we label the transition from $q$ to $q'$ by $a,X/\alpha$.

As with the NFA, if a transition is undefined, we assume that it goes to the empty set (since the image of the transition function is subsets of $Q \times \Gamma^*$). In practice, we treat this as the machine crashing. For a PDA, in addition to not having transitions defined on a state and symbol, we also take into consideration the symbol that we pop from the stack. One particular case of this is when we try to pop an element off of an empty stack (this is undefined, so we go to the empty set).

Here, we have a PDA that we claim recognizes the language $\{a^k b^k \mid k \geq 0\}$. In order to verify our claim, we'll need to formalize notions of what it means for a PDA to "recognize" or "accept" a language. And in order to do that, we'll need to formalize the notion of what it means for a word to move through a PDA.

Describing the computation of an NFA was simple, since it's really just a path and all we needed to know to describe what the state of the NFA was is how far in the input string the NFA is and what state(s) the NFA is on. With the PDA, there is one more piece of information that we need to take into account: the stack.

The description of the state of a PDA is called the configuration of a PDA and it is a 3-tuple $(q,w,\gamma)$, where

Now, we want to describe transitions from one configuration to another in the PDA.

We define the relation $\vdash$ on configurations by $(q,aw, X\gamma) \vdash (q',w,\beta\gamma)$ if and only if $(q',\beta) \in \delta(q,a,X)$. We can then define the closure of $\vdash$ by

A computation in a PDA $P$ is a sequence of configurations $C_1, C_2, \dots, C_n$, where $C_1 \vdash C_2 \vdash \cdots \vdash C_n$.

We say a PDA $P$ accepts a word $x$ by final state if $(q_0,x,Z_0) \vdash^* (q,\varepsilon, \gamma)$ for some $q \in F$ and $\gamma \in \Gamma^*$. The language of a PDA $P$ is $$ L(P) = \{w \in \Sigma^* \mid \text{ $P$ accepts $w$ }\}.$$

Let's look at the PDA from earlier again.

We can describe how this PDA works informally. The PDA begins by reading $a$'s and for each $a$ that's read, an $X$ is put on the stack. Then, once a $b$ is read, and the computation moves onto the next state. For every $b$ that is read, an $X$ is taken off the stack. The machine can only reach the final state if the stack is empty and there is no more input to be read.

We can trace a computation on the word $aabb$ to see how this works. We begin in the initial configuration $(q_0,aabb,Z_0)$. Then our computation looks like this: $$(q_0,aabb,Z_0) \vdash (q_0,abb,XZ_0) \vdash (q_0,bb,XXZ_0) \vdash (q_1,b,XZ_0) \vdash (q_1,\varepsilon,Z_0) \vdash (q_2,\varepsilon,Z_0).$$

Note that the definition of the transition function prevents any other word from reaching the final state. For instance, $a$'s can't be read after moving to $q_1$, otherwise the machine will crash. Similarly, if a $b$ is read and there are is no corresponding stack symbol, the machine crashes, and a $\varepsilon$-transition can only be taken to reach the final state if the stack is empty.

Here, we want to define a PDA that recognizes the language of marked palindromes, $$\{w \# w^R \mid w \in \{a,b\}^* \}.$$

The idea is quite simple: we can read $w$ and keep track of it on the stack until we read the marker $\#$. Once we do that, we just need to make sure that the rest of the word matches the reverse of $w$. Luckily, this is exactly how stacks work.

Like the example above, processing the latter part of the word depends on what is on the stack. If the lengths of the stack and the second half of the word don't match up, then the machine crashes. Similarly, if a symbol is read that doesn't match what's on the stack, then the machine crashes. Or, if more than one marker symbol is read, the machine crashes.

Again, we can trace the computation of a word $ab\#ba$, \begin{align*} (q_0,ab\#ba,Z_0) &\vdash (q_0,b\#ba,aZ_0) \vdash (q_0,\#ba,baZ_0) \\ &\vdash (q_1,ba,baZ_0) \vdash (q_1,a,aZ_0) \vdash (q_1,\varepsilon,Z_0) \\ &\vdash (q_2,\varepsilon,Z_0). \end{align*}

Of course, one can make some modifications to this PDA to get a machine that recognizes the language of palindromes (which aren't marked).