CMSC 28000 — Lecture 6

Up until now, we've been dealing with the idea of languages that can be recognized by DFAs. Although we've seen different flavours of NFAs as well, we're still stuck with this notion of things that a DFA can recognize. But what makes a language simple enough that it can be recognized by a DFA? This isn't something that we can answer by looking only at the model of computation, so we'll step back and investigate other properties of these languages.

Recall that languages are sets and that we can perform all the usual set operations on these sets to describe new languages. We also extended some string operations to languages. We will now define a specific set of these operations to be regular operations.

Let $\Sigma$ be an alphabet and let $L,M \subseteq \Sigma^*$ be languages. The following operations are the regular operations:

So what's so special about these particular operations? These operations define what we call the class of regular languages.

So why are these called regular languages? It's mostly because Kleene called them this when he first defined them and no one bothered to come up with a better name since.

That's not actually entirely true, because there is a better name that shows up from the algebraic perspective of formal languages. Recall that we can view our alphabet together with concatenation as a monoid and the free monoid on our alphabet is just the set of strings over that alphabet. If we forget about focusing on free monoids and just consider arbitrary monoids, then the rational subsets of a monoid are those subsets that are closed under the rational operations: $\cup$, $\cdot$, and $*$ (here, $\cdot$ and $*$ refer to the binary operation for the monoid, not necessarily concatenation).

From this, it makes a lot of sense to call regular languages rational languages instead, and there are many who do, particularly algebraic automata theorists and French mathematicians. But the prevailing terminology, especially among computer scientists, is going to be regular languages.

Regular expressions

Up to now, we've only considered finite automata as a way to describe a language. However, using finite automata as a description has its limitations, particularly for humans. While all of the finite automata we've seen so far are reasonably understandable, they are relatively small. Finite automata as used in practical applications can contain hundreds or thousands or more states. Luckily, our definition of regular languages will give us a more succinct and friendlier way to describe a regular language, via regular expressions.

The regular expression $\mathbf r$ over an alphabet $\Sigma$ with language $L(\mathbf r)$, is defined inductively by

Furthermore, we establish an order of operations, like for typical arithmetic expressions:

and as usual, we can explicitly define an order for a particular expression by using parentheses.

This order, along with the notation of $\cdot$ and $+$ suggests that we can view regular expressions in a similar way to ordinary algebraic expressions. Interestingly enough, our definition suggests this in other ways. For example, we can view $\mathtt{\emptyset}$ as the additive identity ($\emptyset + \mathbf r = \mathbf r$) and $\mathtt{\varepsilon}$ as the multiplicative identity ($\varepsilon \cdot \mathbf r = \mathbf r$).

I want to highlight another bit of notation here. I use $\mathbf{boldface}$ to denote regular expressions and $\mathtt{monospace}$ to denote the symbols used in the regular expression. This is because there's often confusion about the difference between the regular expression itself and the language of the regular expression. A regular expression $\mathbf r$ over an alphabet $\Sigma$ should be considered a string over the alphabet $\{\mathtt a \mid a \in \Sigma\} \cup \{\mathtt (,\mathtt ),\mathtt \emptyset,\mathtt *,\mathtt +,\mathtt \varepsilon\}$, while the language of $\mathbf r$, $L(\mathbf r)$ is a set of words over $\Sigma$.

Consider the regular expression $\mathbf r = \mathtt{(a+b)^* b (a+b)(a+b)}$. The language of $\mathbf r$ is $L(\mathbf r)$, the language of strings over the alphabet $\{a,b\}$ with $b$ in the third from last position. So, $abbb \in L(\mathbf r)$. If we want to describe $L(\mathbf r)$ using the regular operations, we would have something like: \[L(\mathbf r) = (\{a\} \cup \{b\})^* \{b\} (\{a\} \cup \{b\}) (\{a\} \cup \{b\}).\]

We can see that this definition mirrors that of the definition for regular languages and we can pretty easily conclude that the regular expressions describe exactly the class of regular languages. It's also important to note that we should separate the notion of a regular expression and the language of the regular expression. Specifically, a regular expression $\mathbf r$ is just a string, while $L(\mathbf r)$ is the set of words described by the regular expression $\mathbf r$.

The term regular expression may sounds familiar to many of you, particularly if you've ever needed to do a bunch of pattern matching. And your guess is correct: these are (mostly) the same thing. For the most part, the regular expressions as used in Unix and various programming languages are based on our theoretically defined regular expressions. This leads to a few points of interest.

The first is that this implies a non-theoretical application for all of the stuff that we've been doing so far: pattern matching. If you want to search for some text, or more generally a pattern, then regular expressions are your tool of choice. But, and I hinted at this earlier, regular expressions are for us to read, while the computer turns it into a finite automaton and uses that for text processing.

Now, that may have been obvious because the name gave it away. However, there's a second classical application of regular languages that is similar: lexical analysis. Lexical analysis is the phase of the compiler that is responsible for breaking up strings of symbols into a sequence of tokens (which is why this is also sometimes called tokenization).

Now, what's important to keep in mind is that tokenization only gives meaning to the parts of the string, but it doesn't check whether the sequence of tokens actually makes "grammatical" sense. This is a sign of things to come.

Of course, there's an obvious analogy to natural language. In computational linguistics, regular languages provide a simple first approach to tokenizing natural language sentences, by identifying discrete words and possibly the kinds of words (nouns, verbs, etc.) they are. Again, such processing only attempts to provide meanings to chunks of strings, but doesn't enforce any particular rules about how these chunks should be arranged.

Here are a few more examples of some regular expressions and the languages they describe.

Regular operations and finite automata

A surprising fact about regular languages and languages recognized by DFAs is that they happen to be exactly the same—the class of DFA languages is the class of regular languages. At first glance, there's no obvious reason why this should be the case. After all, finite automata and regular expressions are totally different formalisms that have no obvious connection. But as we delve further into the class of regular languages, we'll see a surprising number of connections. First, we will describe a way to construct a finite automaton, given a regular language/expression.

Let $L \subseteq \Sigma^*$ be a regular language. Then there exists a DFA $M$ with $L(M) = L$.

We will show this by structural induction. Since we've already shown that they're equivalent to DFAs, we will actually just construct $\varepsilon$-NFAs for convenience.

First, for our base case, we can construct the following $\varepsilon$-NFAs for each of conditions (1–3):

Since these are pretty simple, we won't formally prove that they're correct, but you can try on your own if you wish.

Now, we will show that we can construct an $\varepsilon$-NFA for each of conditions (4–6). For our inductive hypothesis, we let $L_A$ and $L_B$ be regular languages and assume that they can be accepted by DFAs $A$ and $B$, respectively.

Luckily, we've done a lot of the work up to this point.

First, we'll consider $L_A \cup L_B$. We've already seen how to construct a DFA for the union of the two languages by modifying the product construction. Here, we'll take the opportunity to describe an $\varepsilon$-NFA that does the same thing. Informally, we can make an NFA that looks like this:

Basically, we take our idea about running two finite automata in parallel literally. We start in the initial states of each machine and run them. If one of them accepts, then we accept.

Next, we'll consider $L_A L_B$. This is the construction that we used in Theorem 6.4.

Finally, we consider $L_A^*$. For this, we'll break out and prove it as it's own theorem. Once we have proved it, we will have shown that we can construct $\varepsilon$-NFAs for $L_A \cup L_B$, $L_A L_B$, and $L_A^*$ and therefore, we can construct a DFA that accepts any given regular language.

Let $A = (Q_A,\Sigma,\delta,q_{0,A},F_A)$ be a DFA. We can construct an $\varepsilon$-NFA that looks like this:

We define $A'$ by $A' = (Q_A \cup \{q_0',\}, \Sigma, \delta', \{q_0'\}, \{q_0'\})$, where the transition function $\delta' : Q_A \times (\Sigma \cup \{\varepsilon\}) \to 2^{Q_A \cup \{q_0'\}}$ is defined by

$\delta'(q,a) = \{\delta(q,a)\}$ for all $q \in Q_A$ and $a \in \Sigma$,
$\delta'(q,\varepsilon) = \{q_0'\}$ for all $q \in F_A$,
$\delta'(q_0',\varepsilon) = \{q_{0,A}\}$.

We modify $A$ to get $A'$ as follows. First, we need to create a new initial state and make it an initial state, in case $\varepsilon \not \in L_A$. Secondly, for each state in $F_A$, instead of adding it to the set of final states in $A'$, we add $\varepsilon$-transitions to the sole final state of $A'$, $q_0'$. While this is mostly cosmetic, it makes the construction clearer. Each word that is accepted by $A'$ must begin at $q_0'$ and must end at $q_0'$, making zero or more traversals through $A$, and each traversal of $A$ must be a word accepted by $A$. Hence, each word accepted by $A'$ is a concatenation of multiple words accepted by $A$ and thus, a word that belongs to $L_A^*$.

This specific construction is due to Ken Thompson in 1968 and is most commonly called the Thompson construction. Thompson, was one of the early figures working on Unix and was, among other things, responsible for the inclusion of regular expressions in a lot of early Unix software like ed, making regexes ubiquitous among programmers. More recent work of his you may be familiar with include UTF-8 and the Go programming language.

Consider regular expression $\mathbf r = \mathtt{(0+1)^* 000 (0+1)^*}$, which matches all binary strings that contain $000$ as a substring. To construct an NFA based on the Thompson construction, we begin with our basic NFAs for $0$ and $1$.

Then, we use these in the other constructions. For instance, we can construct the union, followed by the star.

Observe that this machine is quite large and you can already see some states and transitions that are superfluous. Or, put another way, you probably could've come up with a smaller machine right at the start just by thinking about it for maybe half a minute.

Are there better algorithms out there for constructing NFAs? Yes! One such method is the position automaton, developed independently by Glushkov and McNaughton and Yamada. This construction takes a more straightforward approach based directly on the position of the letters in the regular expression. The result is an NFA with roughly as many states as the length of the regular expression.

CMSC 28000 — Lecture 6

Regular languages

Regular expressions

Regular operations and finite automata