CMSC 28000 — Lecture 10

The fact that we can continue to compute derivatives of derivatives suggests that we can define an obvious extension for derivatives to words. Let $w = ax$, where $x \in \Sigma^*$ and $a \in \Sigma^*$. Then $\frac{d}{dw}\left(\mathbf r\right) = \frac{d}{dx}\left(\frac{d}{da}\left(\mathbf r\right)\right)$. This leads to the following very interesting observation.

Let $\mathbf r$ be a regular expression and let $w \in \Sigma^*$. Then $w \in L\left(\mathbf r\right)$ if and only if $\varepsilon \in L\left(\frac{d}{dw}\left(\mathbf r\right)\right)$.

We have $\varepsilon \in L\left(\frac{d}{dw}\left(\mathbf r\right)\right) \iff \varepsilon \in w^{-1} L\left(\mathbf r\right) \iff w \varepsilon \in L\left(\mathbf r\right) \iff w \in L\left(\mathbf r\right).$

Let's see where this idea takes us.

So we seem to have arrived at two regular expressions that we've seen before. In fact, Brzozowski shows that this is guaranteed to happen.

Every regular expression $\mathbf r$ has a finite number of distinct derivatives.

This has a few implications. First of all, to compute all the derivatives of a regular expression $\mathbf r$, we can do what we've done and just keep on computing them until we start getting the same expressions. Obviously, this is not the most efficient way to do this, but we're guaranteed the computation will stop.

But thinking back to the theoretical point of view, recall that these are sets of suffixes of words in our language. And when we consume prefixes of words belonging to these sets, we get different sets, but there are only finitely many of these sets. And finally, we know that those sets that contain $\varepsilon$ correspond to words in the language. This looks suspiciously like the following.

Unlike the Thompson construction we saw before or the position automaton I mentioned briefly, this directly gives us a DFA. However, if we want to use derivatives to compute an NFA, one way to do this was proposed by Antimirov, who, in 1996, defined partial derivatives of a regular expression. This method, like the position automaton that was mentioned earlier, produces an NFA of roughly the same size as the length of the regular expression.

Here's another example.

Let's compute the derivatives for the regular expression $\mathbf r = \mathtt{\left(10^*1 + 0\right)^*}$. We have for $0$, \begin{align*} \frac{d}{d0}\left(\mathbf r\right) &= \frac{d}{d0}\left(\mathtt{10^*1+0}\right) \mathbf r \\ &= \left(\frac{d}{d0}\left(\mathtt{10^*1}\right) + \frac{d}{d0}\left(\mathtt 0\right)\right) \mathbf r \\ &= \left(\frac{d}{d0}\left(\mathtt{10^*1}\right) + \varepsilon\right) \mathbf r \\ &= \left(\frac{d}{d0}\left(\mathtt 1\right)\mathtt{0^*1} + \mathsf c\left(\mathtt 1\right)\frac{d}{d0}\left(\mathtt {0^*1}\right) + \varepsilon\right) \mathbf r \\ &= \left(\emptyset \mathtt{0^*1} + \emptyset \frac{d}{d0}\left(\mathtt{0^*1}\right) + \varepsilon\right) \mathbf r \\ &= \varepsilon \mathbf r \\ &= \mathbf r \end{align*} So, $\frac{d}{d0}\left(\mathbf r\right) = \mathbf r$, which makes sense: reading $0$ would allows us to start from the beginning of the regular expression since we can just take the star. But for $1$, we have \begin{align*} \frac{d}{d1}\left(\mathbf r\right) &= \frac{d}{d1}\left(\mathtt{10^*1+0}\right) \mathbf r \\ &= \left(\frac{d}{d1}\left(\mathtt {10^*1}\right) + \frac{d}{d1}\left(\mathtt 0\right)\right) \mathbf r \\ &= \left(\frac{d}{d1}\left(\mathtt {10^*1}\right) + \emptyset\right) \mathbf r \\ &= \left(\frac{d}{d1}\left(\mathtt 1\right)\mathtt{0^*1} + \mathsf c\left(\mathtt 1\right)\frac{d}{d1}\left(\mathtt{0^*1}\right)\right) \mathbf r \\ &= \left(\mathtt{\varepsilon 0^*1} + \emptyset \frac{d}{d1}\left(\mathtt{0^*1}\right)\right) \mathbf r \\ &= \mathtt{0^*1} \mathbf r \\ &= \mathtt{0^*1 \left(10^*1 + 0\right)^*} \end{align*} Here, this says that if we read $1$, then we are going to be partially matched to the first term inside the star, and we must finish the rest of the term before going for another round through the star. Let's call this new RE $\mathbf r_1$ and consider its derivatives. For $0$, we have \begin{align*} \frac{d}{d0}\left(\mathbf r_1\right) &= \frac{d}{d0}\left(\mathtt{0^*1} \mathbf r\right) \\ &= \frac{d}{d0}\left(\mathtt{0^*}\right)1 \mathbf r + \mathsf c\left(\mathtt{0^*}\right) \frac{d}{d0}\left(\mathtt 1\mathbf r\right) \\ &= \frac{d}{d0}\left(\mathtt{0^*}\right)1 \mathbf r + \varepsilon \frac{d}{d0}\left(\mathtt 1\right)\mathbf r + \mathsf c\left(\mathtt 1\right)\frac{d}{d0}\left(\mathbf r\right) \\ &= \frac{d}{d0}\left(\mathtt{0^*}\right)1 \mathbf r + \emptyset \mathbf r + \emptyset \frac{d}{d0}\left(\mathbf r\right) \\ &= \frac{d}{d0}\left(\mathtt{0^*}\right)1 \mathbf r \\ &= \frac{d}{d0}\left(\mathtt 0\right) \mathtt{0^* 1} \mathbf r \\ &= \varepsilon 0^* 1 \mathbf r \\ &= \mathtt{0^* 1} \mathbf r \\ &= \mathbf r_1 \end{align*} Again, we have that reading $0$ doesn't change the regular expression, because we're still in the middle of reading $0$s. Then for $1$, we have \begin{align*} \frac{d}{d1}\left(\mathbf r_1\right) &= \frac{d}{d1}\left(\mathtt{0^*1} \mathbf r\right) \\ &= \frac{d}{d1}\left(\mathtt{0^*}\right)\mathtt 1 \mathbf r + \mathsf c\left(\mathtt{0^*}\right) \frac{d}{d1}\left(\mathtt 1\mathbf r\right) \\ &= \frac{d}{d1}\left(\mathtt{0^*}\right)\mathtt 1 \mathbf r + \varepsilon \frac{d}{d1}\left(\mathtt 1\right)\mathbf r + \mathsf c\left(\mathtt 1\right)\frac{d}{d1}\left(\mathbf r\right) \\ &= \frac{d}{d1}\left(\mathtt{0^*}\right)\mathtt 1 \mathbf r + \varepsilon \mathbf r + \emptyset \frac{d}{d1}\left(\mathbf r\right) \\ &= \frac{d}{d1}\left(\mathtt{0^*}\right)\mathtt 1 \mathbf r + \mathbf r \\ &= \frac{d}{d1}\left(\mathtt 0\right) \mathtt{0^* 1} \mathbf r + \mathbf r \\ &= \emptyset \mathtt{0^* 1} \mathbf r + \mathbf r \\ &= \mathbf r \end{align*} These derivatives give us the following DFA.

This method lends itself quite nicely to implementation in functional programming languages. In fact, Stuart Kurtz's CMSC 28000 course notes on derivatives pointed me to a paper detailing the use of derivatives in the regular expression implementations of Standard ML and Scheme (which some of you probably know now as Racket).

How I first learned about derivatives was as an undergrad stumbling upon the first version of Matt Might's post about his paper on using derivatives for parsing. It's quite accessible if you have some formal languages and algorithms knowledge. He also has his own implementation of a very simple regular expression matcher that uses derivatives in Scheme, which should be easy enough to follow if you've done some functional programming.

Non-regular languages

There is a theme emerging here: finite automata are characterized by the fact that they can only remember a fixed, finite amount, via its states. And now we've seen that regular expressions, which characterize the regular languages, also exhibit some notion of finiteness through derivatives and these derivatives correspond with states of a DFA.

This suggests that regular languages structurally organize themselves around some property that is finite. We can reason that this property for any language that isn't regular must be infinite. We've run into some of these barriers when trying to construct finite automata and regular expressions, so it's time to investigate more formally what it is that we can't accomplish with regular languages.

Recall that Kleene's theorem let us show that certain operations can be added to the regular expression language without changing the power of regular expressions. In other words, we've shown a number of closure properties—that the regular languages are closed under certain operations. This is why POSIX regular expressions have so many more operators than just union, concatenation, and star.

This sounds great, but it turns out that there are some features defined in POSIX that aren't regularity-preserving. A formal study of the power of such "practical" regular expressions was done by Câmpeanu, Salomaa, and Yu in 2003. In particular, backreferences can allow regexes to recognize languages which are not regular. What are backreferences? Consider the following Posix regular expression.

\([0-9]*\) \1

The [0-9]* means match any string that consists of 0 or more digits 0 through 9—so any number. Note that this subexpression is surrounded by escaped parentheses. This means it will capture whatever is matched by this expression. We can then refer to a matched string by the reference \1.

What does this match? Note that this does not mean that \1 matches anything matched by the subexpression. Rather, \1 will match the exact string that was matched by the subexpression. In other words, 315 346 is not matched by this expression, but 315 315 is. This means that it is possible to "remember" what was read in an early part of a string and use it later. It is this ability that we claim is not possible to express using regular expressions.

But even with this additional power, these regular expressions still can't do things that we might hope, like parse HTML. And as mentioned in the discussion about lexical analysis and tokenization, regular expressions are insufficient for tasks like enforcing the order of a particular series of tokens.

The Pumping Lemma

We'll describe a condition that captures this idea that all regular languages must have this "finiteness" property, the Pumping Lemma for Regular Languages.

Let $L \subseteq \Sigma^*$ be a regular language. Then there exists a positive integer $n$ such that for every string $w \in L$ with $|w| \geq n$, there exists a factorization $w = xyz$ with $x,y,z \in \Sigma^*$ such that

  1. $y \neq \varepsilon$,
  2. $|xy| \leq n$, and
  3. $xy^iz \in L$ for all $i \in \mathbb N$.

On first contact, the statement of the pumping lemma can be very intimidating with a ton of quantifiers, but the idea behind it is quite simple. Here, $n$ is often referred to as the pumping length of $L$ and it is considered an innate property of the regular language $L$. What the pumping lemma intends to say is that for a long enough string, there are some things that we know about it and other similar strings in $L$ if $L$ is regular.

Specifically, if $L$ is regular, then it is recognized by a DFA and the DFA only has so many states, say $n$. If we read a long enough string, with more than $n$ symbols, then it has to reach at least one of the $n$ states more than once some point. This means that there's a loop in the DFA and we should be able to repeat the part of the word that goes through the loop as much as we like, if it really is recognized by a DFA. So let's walk through this in more detail as we try to prove the lemma.

Since $L$ is regular, there exists a DFA $A = (Q,\Sigma,\delta,q_0,F)$ that recognizes $L$. Let $n = |Q|$ be the number of states of $A$. If there are no strings that are of length at least $n$, then we're done (But what does this say about the language $L$?).

So, we assume that there is a string in $L$ with length at least $n$, say $w = a_1 a_2 \cdots a_\ell$, with $a_i \in \Sigma$ and $1 \leq i \leq \ell = |w| \geq n$. Since $w \in L$, we know that $w$ is accepted by $A$, meaning that there is a sequence of states $p_0, p_1, \dots, p_\ell \in Q$ where $p_0 = q_0$, $p_\ell \in F$, and $\delta(p_i, a_{i+1}) = p_{i+1}$ for $0 \leq i \leq \ell - 1$.

Note that since $\ell \geq n$, we have a sequence of states $p_0, \dots, p_n$ with $n+1$ states. However, by our definition of $A$, the state set $Q$ only contains $n$ distinct elements. Therefore, at least one of the states is repeated in the sequence. In other words, we have $p_s = p_t$ for some $0 \leq s \lt t \leq n$.

Now factor $w$ into three words

  1. $x = a_1 \cdots a_s$,
  2. $y = a_{s+1} \cdots a_t$,
  3. $z = a_{t+1} \cdots a_\ell$.

Here, we see that $y$ is non-empty, as required by condition (1) of the lemma, since $|y| = t-s$ by our requirement that $s \lt t$. Then $xy$ satisfies condition (2) since $|xy| = t$ and we set $t \leq n$.

To see that this factorization of $w$ also satisfies condition (3), we recall the sequence of states $p_0, \dots, p_\ell$. Then it is clear that by reading $x$, $y$, and $z$, we have the following path through states of $A$, $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y} p_t \xrightarrow{z} p_\ell \in F.$$ But since $p_s = p_t$, it is clear that we have $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y^i} p_t \xrightarrow{z} p_\ell \in F$$ for all $i \in \mathbb N$.

Thus, we have shown that any string $w \in L$ of length greater than $n$ satisfies conditions (1-3).

It's important to note that what the Pumping lemma says is that all regular languages will satisfy these conditions. What it doesn't say is that if a language satisfies these conditions, it must be regular—in fact, there are languages out there that do satisfy the pumping lemma, but aren't regular.

The pumping lemma is another result that was shown by Rabin and Scott in 1959 (along with the definition of NFAs, the product construction, and the subset construction that we saw earlier). As we'll see, next, this result seems to be a fairly important tool, since it has a name and everything. But it turns out that in the paper, this was a fairly minor characterization, sandwiched in between two other theorems. Hence, why this seemingly important result is only a lemma.

We usually refer to this as just "the pumping lemma" but this result is really the pumping lemma for regular languages. There are a bunch of similar results that work in the same way for many other language classes and this is now a standard tool in the theory of formal languages. We'll be seeing another one later on.

Using the Pumping lemma

Now, we're going to take a look at some examples of languages that are not regular and use the pumping lemma to prove that they're not regular. We will be using contradiction to do this, which means that we'll need to negate that very heavily quantified statement. First, an example.

Let $L = \{a^k b^k \mid k \geq 0\}$. Then $L$ is not regular.

Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Let $w = a^n b^n$. Since $|w| = 2n \geq n$, we can factor $w = xyz$ with $x = a^s$ and $y = a^t$ with $t > 0$ so $y \neq \varepsilon$ and $s+t \leq n$ to satisfy the conditions of the pumping lemma. Then we can write $w = a^s a^t a^{n-s-t} b^n$. To satisfy the conditions of the pumping lemma, we must have $a^s a^{i \cdot t} a^{n-s-t} b^n \in L$ for all $i \in \mathbb N$. However, $$s + i \cdot t + n - s - t = n + (i-1)t$$ which is not equal to $n$ when $i \neq 1$. So take $i = 2$. Then $xy^2z = a^{n+t}b^n$. Since $n+t \neq n$, $xy^2z \not \in L$. Therefore $xy^iz \not\in L$ for $i \neq 1$ and $L$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular.

Let's use this to break down what choices were being made in this proof. This involves looking carefully at the quantification of the statement and its negation.

Pumping lemma First-order sentence Negation
There exists a positive integer $n$ $\exists n, n \gt 0$ $\forall n, n \gt 0$
for every string $w \in L$ with $|w| \geq n$ $\forall w, w \in L \wedge |w| \geq n$ $\exists w, w \in L \wedge |w| \geq n$,
there exists a factorization $w = xyz$ such that $y \neq \varepsilon$, $|xy \leq n$, $\exists w = xyz, y \neq \varepsilon \wedge |xy| \leq n$ $\forall w = xyz, y \neq \varepsilon \wedge |xy| \leq n$
and $xy^iz \in L$ for all $n \in \mathbb N$. $\forall i, xy^iz \in L$ $\exists i, xy^iz \not \in L$.

The negated form of the pumping lemma gives us a proof template in the form of a quantifier game, where your opponent chooses the $\forall$s and you choose the $\exists$s. We then view a proof in the following way:

  1. The opponent chooses a pumping constant $n \gt 0$.
  2. We choose a word $w \in L$ with $|w| \geq n$.
  3. The opponent chooses a factorization $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$.
  4. We choose $i \in \mathbb N$.

Then we win if $xy^iz \not \in L$.

Here, we note that because we get to choose $w$, the choice of $w$ can make a difference in how easy the rest of the proof goes. Notice that we can choose any string in the language that we wish as long as it has length greater than $n$. An easy mistake to make is thinking that $w$ must have length exactly $n$.

The reason the choice of $w$ matters is because it sets the conditions on the factorization of $w$ into $xyz$. In order to defeat the pumping lemma, you must show that for your chosen word $w$, every factorization of $w$ cannot satisfy the conditions of the pumping lemma.

Suppose we have the language $L = (ab)^*$, which is regular. For any $n \gt 0$, choose $w = (ab)^n \in L$. Now choose the factorization $x = a, y = b, z = (ab)^{n-1}$. Then $xy^i z = ab^i (ab)^{n-1}$, which is not in $L$ and we might mistakenly conclude that $L$ is not regular. However, choosing $x = \varepsilon, y = ab, z = (ab)^{n-1}$ gives us a factorization that satisfies the pumping lemma.

In the proof of Proposition 10.5, by choosing $w = a^n b^n$, we simplified the rest of the proof, since all possible factorizations $xyz$ had $y = a^t$ for some $t \gt 0$. On the other hand, if we had chosen, say, $w = a^{\frac n 2} b^{\frac n 2}$, then we would have had to show that the pumping lemma would not be satisfied for factorizations with $y = a^t$, $y = a^s b^t$, and $y = b^t$ for $s,t \gt 0$. Of course, it is possible to prove that the pumping lemma doesn't hold via $w = a^{\frac n 2} b^{\frac n 2}$, but it is a lot more work.

Finally, we can also make use of the fact that we know that certain operations, when applied to regular languages, guarantee that the result is also regular. By assuming that our languages are regular, we can obtain a contradiction.

Let $L = \{ a^i b^j \mid i \neq j \}$. Then $L$ is not regular.

Suppose $L$ is regular. Then $\overline L$ is regular, since regular languages are closed under complement. Consider the language $a^* b^*$. We have $$\overline L \cap a^* b^* = \{a^k b^k \mid k \geq 0\},$$ which we showed was non-regular above. However, $a^* b^*$ is a regular language and the class of regular languages is closed under intersection, a contradiction. Therefore, $L$ is not regular.

In this example, it is important to note that simply taking the complement would not have given us the result we wanted—$\overline L$ is not the strings $a^k b^k$ because there are many other strings that are not of the form $a^i b^j$ with $i \neq j$. However, we can "filter" out some strings by taking advantage of the fact that regular languages are closed under intersection.