CMSC 28000 — Lecture 8

Let's apply the Thompson construction that we got from the proof of Theorem 7.8 last class.

Consider regular expression $\mathbf r = \mathtt{(0+1)^* 000 (0+1)^*}$, which matches all binary strings that contain $000$ as a substring. To construct an NFA based on the Thompson construction, we begin with our basic NFAs for $0$ and $1$.

Then, we use these in the other constructions. For instance, we can construct the union, followed by the star.

Observe that this machine is quite large and you can already see some states and transitions that are superfluous. Or, put another way, you probably could've come up with a smaller machine right at the start just by thinking about it for maybe half a minute.

Despite this, we can observe that for a regular expression of length $n$, the Thompson construction gives us an NFA that has $O(n)$ states. To see this, we notice that each case of the construction only adds one or two states. Since each case corresponds to an atomic expression or an operation, we can see that each character in the expression is responsible for one or two states (we can also try to prove this formally).

The main downside of the Thompson construction is the use of $\varepsilon$-transitions. There are a number of algorithms that transform regular expressions without $\varepsilon$-transitions and still result in roughly $O(n)$ states.

Finite automata and regular expressions

Now, we complete the characterization by showing the other direction: that the language of any DFA is regular. We will do this by showing that we can take any DFA and construct a regular expression that recognizes the same language.

Let $A$ be a DFA. Then there exists a regular expression $\mathbf r$ such that $L(\mathbf r) = L(A)$.

Let $A = (Q,\Sigma,\delta,q_0,F)$ and without loss of generality, let $Q = \{1,2,\dots,n\}$. In other words, we assign each state an integer label from $1$ through $n$.

For each pair of states $p,q \in Q$ and integer $k \leq n$, we define the language $L_{p,q,k}$ to be the language of words for which there is a path from $p$ to $q$ in $A$ consisting only of states with label $\leq k$. In other words, if $w$ is in $L_{p,q,k}$, then for every factorization $w = uv$, we have $\delta(p,u) = r$ such that $r \leq k$ and $\delta(r,v) = q$.

We will show that every language $L_{p,q,k}$ can be defined by a regular expression $\mathbf r_{p,q,k}$. We will show this by induction on $k$. The idea is that we will begin by building regular expressions for paths between states $p$ and $q$ with restrictions on the states that such paths may pass through and inductively build larger expressions from these paths until we acquire an expression that describes all paths through the DFA with no restrictions.

We begin with $k=0$. Since states are numbered $1$ through $n$, we have three cases:

There is no path from $p$ to $q$, in which case we have $L_{p,q,0} = \emptyset$. Then $\mathbf r_{p,q,0} = \emptyset$.
There exists $a_1, a_2, \dots, a_m \in \Sigma$ with $\delta(p,a_i) = q$ for $1 \leq i \leq m$ in which case we have $L_{p,q} = \{a_1, a_2, \dots, a_m\}$ for each such transition. Then $\mathbf r_{p,q,0} = \mathtt{a_1 + a_2 + \cdots + a_m}$.
We have $p = q$, in which case, in addition to the above, we also have $\varepsilon \in L_{p,q,0}$, and we add $\varepsilon$ to our regular expression $\mathbf r_{p,q,0}$ as appropriate.

For our inductive step, we consider the language $L_{p,q,k}$ and assume that there exist regular expressions $\mathbf r_{r,s,k-1}$ for languages $L_{r,s,k-1}$ for all states $r,s \in Q$. We claim that we have the following: $$L_{p,q,k} = L_{p,q,k-1} \cup L_{p,k,k-1} (L_{k,k,k-1})^* L_{k,q,k-1}.$$ This language expresses one of two possibilities for words traveling on paths between $p$ and $q$.

To see this, we first consider when $w \in L_{p,q,k-1}$. In this case, $w$ does not travel through any state $k$ or greater between $p$ and $q$.

The other term of the union consists of words $w$ with $p,q$ paths that contain state $k$. Recall that the language $L_{p,k,k-1}$ is the set of words with paths from $p$ to $k$ restricted to states with label $\leq k-1$ and the language $L_{k,q,k-1}$ is the set of words with paths from $k$ to $q$ restricted to states with label $\leq k-1$. In between these two languages is $L_{k,k,k-1}$, which is the set of words with paths from $k$ back to $k$ without containing any state with label greater than $k-1$.

Since each of the above languages has a regular expression by our inductive hypothesis, we can have the following regular expression: $$\mathbf r_{p,q,k} = \mathbf r_{p,q,k-1} + \mathbf r_{p,k,k-1} (\mathbf r_{k,k,k-1})^* \mathbf r_{k,q,k-1}.$$

From this, we can conclude that there exists a regular expression $$ \mathbf r = \sum_{q \in F} \mathbf r_{q_0,q,n}.$$ Here, $\mathbf r$ is the regular expression for the language of all words for which there exist a path from $q_0$ to a final state $q \in F$. In other words, $\mathbf r$ is a regular expression such that $L(\mathbf r) = L(A)$.

This proof, which is due to McNaughton and Yamada in 1960, gives us a dynamic programming algorithm for computing a regular expression from a DFA. Notice that we have the recurrence \[\mathbf r_{p,q,k} = \begin{cases} \emptyset & \text{if $k = 0$ and $\delta(p,a) \neq q$ for all $a \in \Sigma$}, \\ \varepsilon & \text{if $k = 0$ and $p=q$}, \\ \mathtt{a_1 + \cdots + a_n} & \text{if $k = 0$ and $\delta(p,a_i) = q$ for $1 \leq i \leq n$}, \\ \mathbf r_{p,q,k-1} + \mathbf r_{p,k,k-1} (\mathbf r_{k,k,k-1})^* \mathbf r_{k,q,k-1} & \text{if $k \gt 0$}, \\ \end{cases}\] The trick that's used here may seem familiar to you if you've seen the Floyd-Warshall algorithm for computing all-pairs shortest paths, because it operates on the same idea of progressively constructing paths by constraining the allowed intermediary vertices. (Why are we able to use basically the same algorithm? Because both the shortest paths and DFA to regular expression problems are based on objects that form semirings)

This result, together with the result from last class prove Kleene's theorem, which says that the classes of languages recognized by DFAs and regular expressions are the same. This result is due to Kleene in 1951, and was later published in 1956.