CMSC 28000 — Lecture 8

Finite automata and regular expressions

Now, we complete the characterization by showing the other direction: that the language of any DFA is regular. We will do this by showing that we can take any DFA and construct a regular expression that recognizes the same language.

Let $A$ be a DFA. Then there exists a regular expression $\mathbf r$ such that $L(\mathbf r) = L(A)$.

Let $A = (Q,\Sigma,\delta,q_0,F)$ and without loss of generality, let $Q = \{1,2,\dots,n\}$. For every pair of states $p,q \in Q$ and integer $k \leq n$, we define the language $L_{p,q}^k$ to be the language of words for which there is a path from $p$ to $q$ in $A$ consisting only of states $\leq k$. In other words, if $w$ is in $L_{p,q}^k$, then for every factorization $w = uv$, we have $\delta(p,u) = r$ such that $r \leq k$ and $\delta(r,v) = q$.

We will show that every language $L_{p,q}^k$ can be defined by a regular expression $\mathbf r_{p.q}^k$. We will show this by induction on $k$. The idea is that we will begin by building regular expressions for paths between states $p$ and $q$ with restrictions on the states that such paths may pass through and inductively build larger expressions from these paths until we acquire an expression that describes all paths through the DFA with no restrictions.

We begin with $k=0$. Since states are numbered $1$ through $n$, we have three cases:

For our inductive step, we consider the language $L_{p,q}^k$ and assume that there exist regular expressions $\mathbf r_{p,q}^{k-1}$ for languages $L_{p,q}^{k-1}$. Then we have $$L_{p,q}^k = L_{p,q}^{k-1} \cup L_{p,k}^{k-1} (L_{k,k}^{k-1})^* L_{k,q}^{k-1}.$$ This language expresses one of two possibilities for words traveling on paths between $p$ and $q$.

First, if $w \in L_{p,q}^{k-1}$, then $w$ does not travel through any state $k$ or greater between $p$ and $q$. The other term of the union consists of words $w$ which travel through $k$. The language $L_{p,k}^{k-1}$ is the set of words travelling from $p$ to $k$ restricted to states $\leq k-1$ and the language $L_{k,q}^{k-1}$ is the set of words travelling from $k$ to $q$ restricted to states $\leq k-1$. In between these two languages is $L_{k,k}^{k-1}$, which is the set of words travelling from $k$ back to $k$ without going to any state greater than $k-1$.

This gives us the regular expression $$\mathbf r_{p,q}^k = \mathbf r_{p,q}^{k-1} + \mathbf r_{p,k}^{k-1} (\mathbf r_{k,k}^{k-1})^* \mathbf r_{k,q}^{k-1}.$$ From this, we can conclude that there exists a regular expression $$ \mathbf r = \sum_{q \in F} \mathbf r_{q_0,q}^n.$$ Here, $\mathbf r$ is the regular expression for the language of all words for which there exist a path from $q_0$ to a final state $q \in F$. In other words, $\mathbf r$ is a regular expression such that $L(\mathbf r) = L(A)$.

This proof, which is due to McNaughton and Yamada in 1960, gives us an algorithm for computing a regular expression from a DFA. The trick that's used here may seem familiar to you if you've seen the Floyd-Warshall algorithm for computing all-pairs shortest paths, because it operates on the same idea of progressively constructing paths by constraining the allowed intermediary vertices.

This result, together with the result from last class prove Kleene's theorem, which says that the classes of languages recognized by DFAs and regular expressions are the same. This result is due to Kleene in 1951, and was later published in 1956.

But let's think a bit more about the algorithm we just saw. It's well known that Floyd-Warshall, and by extension the McNaughton and Yamada algorithm, have a running time of $O(n^3)$. That's pretty nice for computers, but you probably don't want to be executing this by hand. Luckily, there is an alternative, which is fairly intuitive but we won't prove formally.

Consider the following DFA, which is the complement of our "divisible by 3" DFA (i.e. the numbers that aren't divisible by 3). The procedure we're about to carry out is the state-elimination method, due to Brzozowski and McCluskey in 1963.

What we'll do is come up with regular expressions by systematically removing states and renaming the labels with regular expressions. Some texts go through the trouble of formally defining these as generalized finite automata, but that's not really necessary because we're just using these as a helpful way to keep track of the regular expressions that we're building and not using them as a real mathematical object.

To make things simpler, we begin by adding an initial state and final state and connect $\varepsilon$-transitions as appropriate. Otherwise, we have to contend with the issue of what happens when we try to "eliminate" an initial or final state.

Basically, when we eliminate a state, we replace the transitions to and from the state with regular expressions that represent the strings that would've travelled through it. Consider a path from $1$ through $2$ and back to $1$. This is $0$ followed by any number of $1$s and $0$ again. Similarly, if we didn't go back, we'd be in a final state, so we need to consider that alternative. What this amounts to is figuring out regular expressions for all possible paths through the state we eliminate.

We do this again with all of the states of our original automaton. What we should be left with is a "finite automaton" with a single transition labelled by the regular expression for our language.

This process is based on the idea in the proof of Theorem 7.1, constructing regular expressions for part of the machine and combining them, and with very much the same reasoning. The difference is that we're not doing this for all possible pairs and paths.

What are some immediate consequences of this result? One thing you may have noticed about our regular expressions and the regular expressions you've used (namely POSIX regular expressions) have different sets of operators that you can use. Can we augment our regular expressions with more operators without changing their power? One easy way to do this is to make use of finite automata to prove closure properties for those operations—if you can construct a DFA for it, then you can add it to your regular expressions without any worry!

This sounds great, but it turns out that there are some features defined in POSIX that aren't regularity-preserving. In particular, backtracking can allow regexes to recognize languages which are not regular. In other words, this feature means that standard Unix/Posix regular expressions aren't actually regular! A formal study of the power of such "practical" regular expressions was done by Câmpeanu, Salomaa, and Yu in 2003. But what do languages that aren't regular look like and how do we prove this? This will be something we'll see soon.

Another problem that you can come up with is asking what we can take away and then characterizing what kinds of languages we end up with. The most famous of these is the class of star-free regular languages, which are the class of regular languages that can be defined by regular expressions with concatenation, union, intersection, and complement, but not Kleene star. Marcel-Paul Schützenberger, who will come up again later in the course, showed in 1965 that these languages have a connection with aperiodic monoids.

This result is related to a collection of problems on regular expressions called star height problems. In 1963, Lawrence Eggan asked whether every regular language could be expressed using a regular expression without nesting Kleene stars, like $\mathtt{(a_1^* a_2^*)^*}$, which would be said to have star-height 2. The tricky thing about these kinds of problems is just because your language can be expressed via an expression with nested stars doesn't mean that you need those stars. For instance, it's clear that $\mathtt{(a^*)^*} = \mathtt{a^*}$. A less obvious (not the first thing you would think of) example is $\Sigma^* = \overline \emptyset$. It is currently not known whether there exists a regular language with star-height greater than 1.

Now, you might say that since we allow complementation, this is not really fair, so we should go back to the basics with only union, concatenation, and star. Here, we actually get Eggan's original question, which was solved in 1966 by Françoise Dejean and Schützenberger, who proved that there exist regular languages with (restricted) star-height $n$ for every $n \in \mathbb N$. However, this led to another famous problem: coming up with an algorithm for computing the star-height of a language. This was eventually solved 25 years later by Kosaburo Hashiguchi in 1988. This algorithm turns out to be extremely complex, to the point of infeasibility; in 2005, Daniel Kirsten was able to improve the algorithm by bringing it down to $2^{2^{O(n)}}$ worst-case space complexity.

Often, we think of regular languages as being easy and solved and completely understood. However, there are still some old and surprising problems that are a bit more difficult than we tend to associate with regular languages.