CMSC 28000 — Lecture 9

Non-regular languages

As alluded to last class, the backreference feature found in Posix regular expressions is actually more powerful than finite automata can model. What are backreferences? Consider the following Posix regular expression.

\([0-9]*\) \1

The [0-9]* means match any string that consists of 0 or more digits 0 through 9—so any number. Note that this subexpression is surrounded by escaped parens. This means it will capture whatever is matched by this expression. We can then refer to a matched string by the reference \1.

What does this match? Note that this does not mean that \1 matches anything matched by the subexpression. Rather, \1 will match the exact string that was matched by the subexpression. In other words, 315 346 is not matched by this expression, but 315 315 is. This means that it is possible to "remember" what was read in an early part of a string and use it later. It is this ability that we claim is not possible to express using regular expressions.

But even with this additional power, these regular expressions still can't do things that we might hope, like parse HTML. And as mentioned in the discussion about lexical analysis and tokenization, regular expressions are insufficient for tasks like enforcing the order of a particular series of tokens.

It should have been pretty obvious from the themes of the course that we are eventually going to run into languages that we can't recognize using a finite automaton. From the perspective of computation, this means that there are problems that we can't compute using a DFA. Of course, this is obvious: after all, DFAs have a finite amount of memory.

However, before trying to figure out how we can modify finite automata to give them more power in order to recognize such non-regular languges, we're going to study what non-regular languages might look like. After all, while reaching for a fancier model of computation might solve our problems, it doesn't help us identify and understand what makes a non-regular language impossible for a finite automaton to recognize.

The Pumping Lemma

We'll describe a condition that all regular languages satisfy: the Pumping Lemma. First, I'll give you the statement of the lemma, and then reassure you that it's not really as bad as it looks.

Let $L \subseteq \Sigma^*$ be a regular language. Then there exists a positive integer $n$ such that for every string $w \in L$ with $|w| \geq n$, there exists a factorization $w = xyz$ with $x,y,z \in \Sigma^*$ such that

  1. $y \neq \varepsilon$,
  2. $|xy| \leq n$, and
  3. $xy^iz \in L$ for all $i \in \mathbb N$.

To anyone who isn't familiar with it already, the statement of the pumping lemma can be very intimidating, but the idea behind it is quite simple. Here, $n$ is often referred to as the pumping length of $L$ and it is a property of the regular language $L$. What the pumping lemma intends to say is that for a long enough string, there are some things that we know about it and other similar strings in $L$ if $L$ is regular.

Specifically, if $L$ is regular, then it is recognized by a DFA and the DFA only has so many states, say $n$. If we read a long enough string, with more than $n$ symbols, then it has to reach at least one of the $n$ states more than once some point. This means that there's a loop in the DFA and we should be able to repeat the part of the word that goes through the loop as much as we like, if it really is recognized by a DFA. So let's walk through this in more detail as we try to prove the lemma.

Let $A = (Q,\Sigma,\delta,q_0,F)$ be the DFA that recognizes $L$ and let $n = |Q|$ be the number of states of $A$. If there are no strings that are of length at least $n$, then we're done (But what does this say about the language $L$?).

So, we assume that there is a string in $L$ with length at least $n$, say $w = a_1 a_2 \cdots a_\ell$, with $a_i \in \Sigma$ and $1 \leq i \leq \ell = |w| \geq n$. Since $w \in L$, we know that $w$ is accepted by $A$, meaning that there is a sequence of states $p_0, p_1, \dots, p_\ell \in Q$ where $p_0 = q_0$, $p_\ell \in F$, and $\delta(p_i, a_{i+1}) = p_{i+1}$ for $0 \leq i \leq \ell - 1$.

Note that since $\ell \geq n$, we have a sequence of states $p_0, \dots, p_n$ with $n+1$ states. However, by our definition of $A$, the state set $Q$ only contains $n$ distinct elements. Therefore, at least one of the states is repeated in the sequence. In other words, we have $p_s = p_t$ for some $0 \leq s < t \leq n$.

Now factor $w$ into three words

  1. $x = a_1 \cdots a_s$,
  2. $y = a_{s+1} \cdots a_t$,
  3. $z = a_{t+1} \cdots a_\ell$.

Here, we see that $y$ is non-empty, as required by condition (1) of the lemma, since $|y| = t-s$ by our requirement that $s < t$. Then $xy$ satisfies condition (2) since $|xy| = t$ and we set $t \leq n$.

To see that this factorization of $w$ also satisfies condition (3), we recall the sequence of states $p_0, \dots, p_\ell$. Then it is clear that by reading $x$, $y$, and $z$, we have the following path through states of $A$, $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y} p_t \xrightarrow{z} p_\ell \in F.$$ But since $p_s = p_t$, it is clear that we have $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y^i} p_t \xrightarrow{z} p_\ell \in F$$ for all $i \in \mathbb N$.

Thus, we have shown that any string $w \in L$ of length greater than $n$ satisfies conditions (1-3).

It's important to note that what the Pumping lemma says is that all regular languages will satisfy these conditions. What it doesn't say is that if a language satisfies these conditions, it must be regular. Basically, the Pumping lemma does not give a characterization for the regular languages.

The pumping lemma is another result that was shown by Rabin and Scott in 1959 (along with the definition of NFAs and the subset construction that we saw earlier). As we'll see, next, this result seems to be a fairly important tool, since it has a name and everything. But it turns out that in the paper, this was a fairly minor characterization, sandwiched in between two other theorems. Hence, why this seemingly important result is only a lemma.

We usually refer to this as just "the pumping lemma" but this result is really the pumping lemma for regular languages. There are a bunch of similar results that work in the same way for many other language classes. We'll be seeing another one later on.

Using the Pumping lemma

Now, we're going to take a look at some examples of languages that are not regular and use the pumping lemma to prove that they're not regular. We'll see how to do this by first considering the contrapositive of the pumping lemma.

Let $L \subseteq \Sigma^*$ be a language. If for all $n \in \mathbb N$, there exists a word $w \in L$ such that for every factorization $w = xyz$ with $y \neq \varepsilon$ and $|xy| \leq n$, there exists $i \in \mathbb N$ such that $xy^iz \not \in L$, then $L$ is not a regular language.

Suppose we have a language $L$ that we suspect may not be regular and we would like to confirm our suspicion. Well, if it is regular, then it must satisfy the pumping lemma. What we will do is assume that $L$ is regular and show that it can't satisfy the pumping lemma, obtaining a contradiction to our assumption that $L$ was regular.

Let $L = \{a^k b^k \mid k \geq 0\}$. Then $L$ is not regular.

Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Let $w = a^n b^n$. Since $|w| = 2n \geq n$, we can factor $w = xyz$ with $x = a^s$ and $y = a^t$ with $t > 0$ so $y \neq \varepsilon$ and $s+t \leq n$ to satisfy the conditions of the pumping lemma. Then we can write $w = a^s a^t a^{n-s-t} b^n$. To satisfy the conditions of the pumping lemma, we must have $a^s a^{i \cdot t} a^{n-s-t} b^n \in L$ for all $i \in \mathbb N$. However, $$s + i \cdot t + n - s - t = n + (i-1)t$$ which is not equal to $n$ when $i \neq 1$. Therefore $xy^iz \not\in L$ for $i \neq 1$ and $L$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular.

Here, we note that the choice of $w$ makes a difference in how easy the rest of the proof goes. It's important to go over what the pumping lemma says again: the conditions apply for any string with length greater than $n$. So, according to the contrapositive, we can choose any string we wish as long as it has length greater than $n$. An easy mistake to make is thinking that $w$ must have length exactly $n$.

The reason your choice of $w$ matters is because of the conditions on the factorization of $w$ into $xyz$. Notice that the pumping lemma says that there exists a factorization such that the conditions hold. Again, based on the contrapositive, in order to defeat the pumping lemma, you must show that for your chosen word $w$, every factorization of $w$ cannot satisfy the conditions of the pumping lemma.

Suppose we have the language $L = (ab)^*$, which is regular. For any $n > 0$, choose $w = (ab)^n \in L$. Now choose the factorization $x = a, y = b, z = (ab)^{n-1}$. Then $xy^i z = ab^i (ab)^{n-1}$, which is not in $L$ and we might mistakenly conclude that $L$ is not regular. However, choosing $x = \varepsilon, y = ab, z = (ab)^{n-1}$ gives us a factorization that satisfies the pumping lemma.

In the proof of Proposition 9.3, by choosing $w = a^n b^n$, we simplified the rest of the proof, since all possible factorizations $xyz$ had $y = a^t$ for some $t > 0$. On the other hand, if we had chosen, say, $w = a^{\frac n 2} b^{\frac n 2}$, then we would have had to show that the pumping lemma would not be satisfied for factorizations with $y = a^t$, $y = a^s b^t$, and $y = b^t$ for $s,t > 0$. Of course, it is possible to prove that the pumping lemma doesn't hold via $w = a^{\frac n 2} b^{\frac n 2}$, but it is a lot more work.

The contrapositive form of the pumping lemma gives us a proof template in the form of a quantifier game, where the adversary chooses the $\forall$s and we choose the $\exists$s. We then view a proof in the following way:

  1. The adversary chooses a pumping constant $n \gt 0$.
  2. We choose a word $w \in L$ with $|w| \geq n$.
  3. The adversary chooses a factorization $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$.
  4. We choose $i \in \mathbb N$.

Then we win if $xy^iz \not \in L$.

Let $L = \{a^p \mid \text{$p$ is a prime number} \}$. Then $L$ is not regular.

Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Since there are infinitely many primes, for any possible pumping length $n$, we can always choose a prime $p \geq n$. Let $w = a^p$.

Since $|w| = p \geq n$, we can factor $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$ and we have $y = a^k$ with $1 \leq k \leq n$, satisfying the conditions of the pumping lemma. Then if $L$ is regular, we must have $a^{p+(i-1)k} \in L$ for every $i \in \mathbb N$. However, if $i = p+1$, we have $p+pk = (k+1)p$, which is not prime. Therefore, $a^{p+(i-1)k} \not \in L$ for $i = p+1$ and therefore $w$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular.

The non-regularity of the language of binary strings representing prime numbers was shown by Hartmanis and Shank in 1967. While they mention the pumping lemma, they actually go on to show something even stronger, that this language is not even context-free.

We can also make use of the fact that we know that certain operations, when applied to regular languages, guarantee that the result is also regular. By assuming that our languages are regular, we can obtain a contradiction.

Let $L = \{ a^i b^j \mid i \neq j \}$. Then $L$ is not regular.

Suppose $L$ is regular. Then $\overline L$ is regular, since regular languages are closed under complement. Consider the language $a^* b^*$. We have $$\overline L \cap a^* b^* = \{a^k b^k \mid k \geq 0\},$$ which we showed was non-regular above. However, $a^* b^*$ is a regular language and the class of regular languages is closed under intersection, a contradiction. Therefore, $L$ is not regular.

In this example, it is important to note that simply taking the complement would not have given us the result we wanted.

Note that the pumping lemma is not an if and only if statement. In other words, the pumping lemma doesn't characterize the class of regular languages. There are languages that satisfy the pumping lemma but are not regular. However, this is a good starting point into how we might think about giving necessary and sufficient conditions for regularity.