CMSC 28000 — Lecture 7

Regular languages

So far, we have been focused on languages with respect to a model of computation. In particular, we've been giving our attention to finite automata, whether they're deterministic or not (and as we saw, the distinction doesn't matter when thinking about power). But what makes a language simple enough that it can be recognized by a DFA? This isn't something that we can answer by looking only at the model of computation, so we'll step back and investigate other properties of these languages.

Recall that languages are sets and that we can perform all the usual set operations on these sets to describe or construct new languages. We also extended some string operations to languages. We will now define a specific set of these operations to be regular operations.

Let $\Sigma$ be an alphabet and let $L,M \subseteq \Sigma^*$ be languages. The following operations are the regular operations:

  1. Union: $L \cup M = \{w \in \Sigma^* \mid w \in L \text{ or } w \in M \}.$
  2. Concatenation: $LM = \{uv \mid u \in L, v \in M \}$.
  3. Kleene star: $L^* = \bigcup\limits_{i \geq 0} L^i$.

So what's so special about these particular operations? In some sense, we can think of these as the minimal set of operations needed to have "interesting" things.

These operations define what we call the class of regular languages.

A language $L$ is regular if it satisfies the following:

  1. $\emptyset$ is regular.
  2. $\{\varepsilon\}$ is regular.
  3. $\{a\}$ for $a \in \Sigma$ is regular.

Let $L_1$ and $L_2$ be regular languages.

  1. $L_1 \cup L_2$ is regular.
  2. $L_1 L_2$ is regular.
  3. $L_1^*$ is regular.

This is an interesting definition because this tells us how to construct languages. This is in contrast to when we were talking about DFAs and NFAs—we were primarily focused on how to recognize languages.

What exactly do these definitions tell us? The base cases tell us what we start with: atomic pieces like the empty string and single letters. The inductive cases tell us how we can put these pieces together.

But why are these called regular languages? It's mostly because Kleene called them this when he first defined them in 1956 and no one bothered to come up with a better name since.

That's not actually entirely true, because there is a better name that shows up from the algebraic/French literature. Recall that we can view our alphabet together with concatenation as a monoid and the free monoid on our alphabet is just the set of strings over that alphabet. If we forget about focusing on free monoids and just consider arbitrary monoids, then the rational subsets of a monoid are those subsets that are closed under the rational operations: $\cup$, $\cdot$, and $*$ (here, $\cdot$ and $*$ refer to the binary operation for the monoid, not necessarily concatenation).

From this, it makes a lot of sense to call regular languages rational languages instead, and there are many who do, particularly algebraic automata theorists and French mathematicians. But the prevailing terminology, especially among computer scientists, is going to be regular languages.

Regular expressions

Up to now, we've only considered finite automata as a way to describe a language. However, using finite automata as a description has its limitations, particularly for humans. While all of the finite automata we've seen so far are reasonably understandable, they are relatively small. Finite automata as used in practical applications can contain hundreds or thousands or more states. Luckily, our definition of regular languages will give us a more succinct and friendlier way to describe a regular language, via regular expressions.

The regular expression $\mathbf r$ over an alphabet $\Sigma$ with language $L(\mathbf r)$, is defined inductively by

  1. $\mathbf r = \mathtt{\emptyset}$; $L(\mathbf r) = \emptyset$,
  2. $\mathbf r = \mathtt{\varepsilon}$; $L(\mathbf r) = \{\varepsilon\}$,
  3. $\mathbf r = \mathtt{a}$ for $a \in \Sigma$; $L(\mathbf r) = \{a\}$,

and for regular expressions $\mathbf r_1$ and $\mathbf r_2$,

  1. $\mathbf r = \mathbf r_1+\mathbf r_2$; $L(\mathbf r) = L(\mathbf r_1) \cup L(\mathbf r_2)$,
  2. $\mathbf r = \mathbf r_1\mathbf r_2$; $L(\mathbf r) = L(\mathbf r_1) L(\mathbf r_2)$,
  3. $\mathbf r = \mathbf r_1^*$; $L(\mathbf r) = L(\mathbf r_1)^*$.

Furthermore, we establish an order of operations, like for typical arithmetic expressions:

  1. Kleene star,
  2. concatenation,
  3. union,

and as usual, we can explicitly define an order for a particular expression by using parentheses.

This order, along with the notation of $\cdot$ and $+$ suggests that we can view regular expressions as algebraic expressions. In fact, the regular expressions form a semiring. A semiring is a set equipped with two operations, which we refer to as "addition" and "multiplication" with the condition that both addition and multiplication have identity elements. The regular expressions are a semiring because we take union to be addition with identity $\emptyset$ (i.e. $\mathbf r + \emptyset = \mathbf r$) and concatenation as multiplication with identity $\varepsilon$ (i.e. $\mathbf r \cdot \varepsilon = \mathbf r$).

(If these are semirings, then what makes a ring? It is the requirement that every element has an additive inverse, i.e. for all $n$, there exists an element $-n$ such that $n + (-n) = 0$. An example of a ring that we are all familiar with is the integers.)

I want to highlight another bit of notation here. I use $\mathbf{boldface}$ to denote regular expressions and $\mathtt{monospace}$ to denote the symbols used in the regular expression. This is because there's often confusion about the difference between the regular expression itself and the language of the regular expression. A regular expression $\mathbf r$ over an alphabet $\Sigma$ should be considered a string over the alphabet $\{\mathtt a \mid a \in \Sigma\} \cup \{\mathtt (,\mathtt ),\mathtt \emptyset,\mathtt *,\mathtt +,\mathtt \varepsilon\}$, while the language of $\mathbf r$, $L(\mathbf r)$ is a set of words over $\Sigma$.

Sakarovitch, in Elements of Automata Theory, refers to this distinction as a bit of "tetrapyloctomy" but proceeds to point out that what we're dealing with is really the distinction between a syntactic object (the expression) and the semantic object (the language). As computer scientists, we are constantly having to navigate this distinction.

Consider the regular expression $\mathbf r = \mathtt{(a+b)^* b (a+b)(a+b)}$. The language of $\mathbf r$ is $L(\mathbf r)$, the language of strings over the alphabet $\{a,b\}$ with $b$ in the third from last position. So, $abbb \in L(\mathbf r)$. If we want to describe $L(\mathbf r)$ using the regular operations, we would have something like: \[L(\mathbf r) = (\{a\} \cup \{b\})^* \{b\} (\{a\} \cup \{b\}) (\{a\} \cup \{b\}).\]

We can see that this definition mirrors that of the definition for regular languages and we can pretty easily conclude that the regular expressions describe exactly the class of regular languages. Again, we should note that we need to separate the notion of a regular expression and the language of the regular expression—a regular expression $\mathbf r$ is just a string, while $L(\mathbf r)$ is the set of words described by the regular expression $\mathbf r$.

Why does this matter? As with finite automata, one language can have multiple representations or descriptions. This is particularly true for regular expressions, since they can be viewed algebraically.

Two regular expressions $\mathbf r$ and $\mathbf s$ over $\Sigma$ are equivalent if they describe the same language. That is, $\mathbf r \equiv \mathbf s$ if $L(\mathbf r) = L(\mathbf s)$.

This is the same kind of idea that you may have seen with logical equivalence—two logical formulas are said to be equivalent because they mean the same thing but we don't say they're equal because they are not literally the same.

Here are some basic identities for regular expressions. We let $\mathbf r, \mathbf s, \mathbf t$ be regular expressions over $\Sigma$.

All of the identities except for cyclicity follow from regular expressions forming a semiring.

Here are a few more examples of some regular expressions and the languages they describe.

The term regular expression may sounds familiar to many of you, particularly if you've ever needed to do a bunch of pattern matching. And your guess is correct: these are (mostly) the same thing. For the most part, the regular expressions as used in Unix and various programming languages are based on our theoretically defined regular expressions. This leads to a few points of interest.

The first is that this implies a non-theoretical application for all of the stuff that we've been doing so far: pattern matching. If you want to search for some text, or more generally a pattern, then regular expressions are your tool of choice. But, and I hinted at this earlier, regular expressions are for us to read, while the computer turns it into a finite automaton and uses that for text processing.

Now, that may have been obvious because the name gave it away. However, there's a second classical application of regular languages that is similar: lexical analysis. Lexical analysis is the phase of the compiler that is responsible for breaking up strings of symbols into a sequence of tokens (which is why this is also sometimes called tokenization).

Tokens are discrete units that have some meaning. For instance, the string

if ( x > 2 ) {

could be tokenized as

[(KEYWORD, "if"), (SEPARATOR, "("), (IDENTIFIER, "x"), (OPERATOR, ">"), (LITERAL, "2"), (SEPARATOR, ")"), (SEPARATOR, "{")]

Now, what's important to keep in mind is that tokenization only gives meaning to the parts of the string, but it doesn't check whether the sequence of tokens actually makes "grammatical" sense. This is a sign of things to come.

Of course, there's an obvious analogy to natural language. In computational linguistics, regular languages provide a simple first approach to tokenizing natural language sentences, by identifying discrete words and possibly the kinds of words (nouns, verbs, etc.) they are. Again, such processing only attempts to provide meanings to chunks of strings, but doesn't enforce any particular rules about how these chunks should be arranged.

Regular operations and finite automata

A surprising fact about regular languages and languages recognized by DFAs is that they happen to be exactly the same—the class of DFA languages is the class of regular languages. At first glance, there's no obvious reason why this should be the case. After all, finite automata and regular expressions are totally different formalisms that have no obvious connection. But as we delve further into the class of regular languages, we'll see a surprising number of connections. First, we will describe a way to construct a finite automaton, given a regular language/expression.

Let $L \subseteq \Sigma^*$ be a regular language. Then there exists a DFA $M$ with $L(M) = L$.

We will show this by structural induction. Since we've already shown that they're equivalent to DFAs, we will actually just construct $\varepsilon$-NFAs for convenience.

First, for our base case, we can construct the following $\varepsilon$-NFAs for each of conditions (1–3):

Since these are pretty simple, we won't formally prove that they're correct, but you can try on your own if you wish.

Now, we will show that we can construct an $\varepsilon$-NFA for each of conditions (4–6). For our inductive hypothesis, we let $L_A$ and $L_B$ be regular languages and assume that they can be accepted by DFAs $A$ and $B$, respectively.

Luckily, we've done a lot of the work up to this point.

First, we'll consider $L_A \cup L_B$. We've already seen how to construct a DFA for the union of the two languages by modifying the product construction. Here, we'll take the opportunity to describe an $\varepsilon$-NFA that does the same thing. Informally, we can make an NFA that looks like this:

Basically, we take our idea about running two finite automata in parallel literally. We start in the initial states of each machine and run them. If one of them accepts, then we accept.

Next, we'll consider $L_A L_B$. We can modify the NFA from Theorem 5.4 so that it uses $\varepsilon$-transitions instead.

Finally, we consider $L_A^*$. For this, we'll break out and prove it as it's own theorem. Once we have proved it, we will have shown that we can construct $\varepsilon$-NFAs for $L_A \cup L_B$, $L_A L_B$, and $L_A^*$ and therefore, we can construct a DFA that accepts any given regular language.

The class of DFA languages is closed under Kleene star.

Let $A = (Q_A,\Sigma,\delta,q_{0,A},F_A)$ be a DFA. We can construct an $\varepsilon$-NFA that looks like this:

We define $A'$ by $A' = (Q_A \cup \{q_0',\}, \Sigma, \delta', \{q_0'\}, \{q_0'\})$, where the transition function $\delta' : Q_A \times (\Sigma \cup \{\varepsilon\}) \to 2^{Q_A \cup \{q_0'\}}$ is defined by

We modify $A$ to get $A'$ as follows. First, we need to create a new initial state and make it an initial state, in case $\varepsilon \not \in L_A$. Secondly, for each state in $F_A$, instead of adding it to the set of final states in $A'$, we add $\varepsilon$-transitions to the sole final state of $A'$, $q_0'$. While this is mostly cosmetic, it makes the construction clearer. Each word that is accepted by $A'$ must begin at $q_0'$ and must end at $q_0'$, making zero or more traversals through $A$, and each traversal of $A$ must be a word accepted by $A$. Hence, each word accepted by $A'$ is a concatenation of multiple words accepted by $A$ and thus, a word that belongs to $L_A^*$.

This specific construction is due to Ken Thompson in 1968 and is most commonly called the Thompson construction. Thompson, was one of the early figures working on Unix and was, among other things, responsible for the inclusion of regular expressions in a lot of early Unix software like ed, making regexes ubiquitous among programmers. More recent work of his you may be familiar with include UTF-8 and the Go programming language.