CMSC 28000 — Lecture 11

Here's one more pumping lemma example before we move on.

Let $L = \{a^p \mid \text{$p$ is a prime number} \}$. Then $L$ is not regular.

Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Since there are infinitely many primes, for any possible pumping length $n$, we can always choose a prime $p \geq n$. Let $w = a^p$.

Since $|w| = p \geq n$, we can factor $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$ and we have $y = a^k$ with $1 \leq k \leq n$, satisfying the conditions of the pumping lemma. Then if $L$ is regular, we must have $a^{p+(i-1)k} \in L$ for every $i \in \mathbb N$. However, if $i = p+1$, we have $p+pk = (k+1)p$, which is not prime. Therefore, $a^{p+(i-1)k} \not \in L$ for $i = p+1$ and therefore $w$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular.

The non-regularity of the language of binary strings representing prime numbers was shown by Hartmanis and Shank in 1967. While they mention the pumping lemma, they actually go on to show something even stronger, that this language is not even context-free.

The Myhill–Nerode Theorem

We've seen now that regular languages have some finiteness property that we've waved our hands around. This property is clearly exhibited by the states of a finite automaton and it comes out when we study the derivatives of a regular expression. We see that this property is present in the statement of the pumping lemma and that if a language doesn't have this property then we can conclude it is not regular.

What is this property? It's time to try to clearly articulate it. We see that the states of finite automata and derivatives of regular expressions both try to "group" strings together and every string belongs to one of these groups. Suppose $L$ is a regular language. Then there's a DFA $A$ that accepts $L$ and there's a regular expression $\mathbf r$ that describes $L$. Then we know for any string $w \in \Sigma^*$, we always have that $\delta(q_0, w) \in Q$ and we can always compute the derivative $\frac{d}{dw} \mathbf r$. This suggests that we should start thinking of some kind of equivalence between these strings that is related to our language $L$.

Let's recall some definitions surrounding equivalence.

A relation $R$ over a set $S$ is a subset $R \subseteq S \times S$ and we write $x \mathrel R y$ if $(x,y) \in R$. A relation $R$ is an equivalence relation if it satisfies the following:

  1. Reflexivity. $\forall x, x \mathrel R x$.
  2. Symmetry. $\forall x,y, x \mathrel R y \rightarrow y \mathrel R x$.
  3. Transitivity. $\forall x,y,z, (x \mathrel R y) \wedge (y \mathrel R z) \rightarrow x \mathrel R z$.

An equivalence relation partitions $S$ into equivalence classes. An equivalence class with representative $x$ is $$[x] = \{y \in S \mid x \mathrel R y\}.$$ The index of a relation $R$ is the number of equivalence classes of $R$; $R$ is of finite index if it has finitely many equivalence clases. For two relations $R_1, R_2$ over $S$, we say $R_1$ is a refinement of, or refines $R_2$ if $R_1 \subseteq R_2$ (i.e. $\forall x,y \in S, x \mathrel{R_1} y \rightarrow x \mathrel{R_2} y$). If $R_1$ refines $R_2$, then each equivalence class of $R_2$ is a union of some equivalence classes of $R_1$.

The equivalence relation that most of you should be familiar with is $\equiv_m$, or equivalence modulo $m$ for the integers. For instance, we know that $\equiv_2$ has two equivalence classes: the odd and even integers. This is a partition because every integer must be odd or even.

Another equivalence relation is $\equiv_4$, equivalence modulo 4, which has four equivalence classes $[0], [1], [2], [3]$ (so $\equiv_4$ has index 4). Observe that $\equiv_4$ is a refinement of $\equiv_2$. Why? Suppose $u \equiv_4 v \equiv_4 3$. Then they're clearly odd numbers, so this must mean that $u \equiv_2 v \equiv_2 1$. Similarly, if $u \equiv_4 v \equiv_4 2$, then they're even. But $\equiv_3$, equivalence modulo 3, is not a refinement of $\equiv_2$—we have $3 \equiv_3 6$ but $3 \not\equiv_2 6$.

In other words, we say that $\equiv_4$ refines $\equiv_2$ because it "breaks up" the equivalence classes into more equivalence classes: $[0]_2$ into $[0]_4$ and $[2]_4$, and $[1]_2$ into $[1]_4$ and $[3]_4$. The integers mod 4 are "finer" than the integers mod 2 because we are more "specific" (there are more equivalence classes) about the relationships between those integers.

Now, let's define an equivalence relation based on a DFA.

Let $A$ be a DFA. We define the relation $\sim_A$ for words $u,v \in \Sigma^*$ by $u \sim_A v$ if and only if $\delta(q_0,u) = \delta(q_0,v)$.

In other words, we can consider two words to be equivalent under $\sim_A$ if they reach the same state. Notice that a DFA naturally partitions all strings—every string gets taken to some state of the DFA. And the notion that the DFA considers two strings equivalent if they reach the same state makes sense: all it knows is that the two strings reached a particular state, but once a string enters the state, it doesn't know anything more about it, like how it got there.

The relation $\sim_A$ is an equivalence relation on $\Sigma^*$.

  1. $\sim_A$ is reflexive, since $x \sim_A x$ if and only if $\delta(q_0,x) = \delta(q_0,x)$.
  2. $\sim_A$ is symmetric: $$x \sim_A y \iff \delta(q_0,x) = \delta(q_0,y) \iff \delta(q_0,y) = \delta(q_0,x) \iff y \sim_A x.$$
  3. $\sim_A$ is transitive: if $x \sim_A y$ and $y \sim_A z$, then $\delta(q_0,x) = \delta(q_0,y)$ and $\delta(q_0,y) = \delta(q_0,z)$, and therefore, $\delta(q_0,x) = \delta(q_0,z)$ and $x \sim_A z$.

Therefore, $\sim_A$ is an equivalence relation.

The relation $\sim_A$ has finite index.

There is an equivalence class for each state of $A$, therefore the index of $\sim_A$ is at most $|Q|$ and is finite.

We say that a relation $\sim$ is right invariant if $\forall x,y,z, x \sim y \rightarrow xz \sim yz$.

The relation $\sim_A$ is right-invariant.

Suppose $x \sim_A y$ and let $z \in \Sigma^*$. Then \begin{align*} \delta(q_0,xz) &= \delta(\delta(q_0,x), z) \\ &= \delta(\delta(q_0,y), z) \\ &= \delta(q_0,yz) \end{align*} Therefore, $xz \sim_A yz$ and therefore $\sim_A$ is right-invariant.

The language $L(A)$ is the union of equivalence classes of $\sim_A$.

To show this, we show if $x \sim_A y$, then $x \in L \iff y \in L$. That is, if $x$ and $y$ are in the same equivalence class, they must both be in the language. This means that if $x \in L$, then the entire equivalence class $[x]$ is contained in $L$. Suppose $x \sim_A y$. Then $$x \in L \iff \delta(q_0,x) \in F \iff \delta(q_0,y) \in F \iff y \in L.$$

 

The fact that an automaton will naturally partition all the strings and group them into equivalence classes is interesting but it suggests that this should really be a property of the language. Why? Because a langauge can have many DFAs that accept it. But if you think about it, all of these DFAs should have similar kinds of partitions so there should be a way to think about how strings are related with respect to a language and that should be where the automaton gets its structure from. So we'll try to define a similar kind of relation based on the language rather than a particular automaton.

Let $L \subseteq \Sigma^*$ be a language. We define the relation $\equiv_L$ for words $x,y \in \Sigma^*$ by $x \equiv_L y$ if and only if $\forall z \in \Sigma^*, xz \in L \iff yz \in L$. $\equiv_L$ is the Myhill–Nerode equivalence relation.

Note that we can define $\equiv_L$ for any language, not just those that are regular. It's also important to note that the equivalence relation applies to all words, not just those in $L$.

$\equiv_L$ is an equivalence relation.

  1. Reflexivity: $xu \in L \iff xu \in L$ and therefore $x \equiv_L x$.
  2. Symmetry: \begin{align*} x \equiv_L y &\iff \forall u \in \Sigma^*, (xu \in L \iff yu \in L) \\ &\iff \forall u \in \Sigma^*, (yu \in L \iff xu \in L) \\ &\iff y \equiv_L x \\ \end{align*}
  3. Transitivity: Suppose $x \equiv_L y$ and $y \equiv_L z$. Then for all $u \in \Sigma^*$, $xu \in L \iff yu \in L$ and $yu \in L \iff zu \in L$. Thus, $xu \in L \iff zu \in L$ and therefore $x \equiv_L z$.

 

$\equiv_L$ is right-invariant.

Suppose that $x \equiv_L y$. Then $xu \in L \iff yu \in L$ for all $u \in \Sigma^*$. Let $u = vz$. Then, $$x(vz) \in L \iff (xv)z \in L \iff (yv)z \in L \iff y(vz) \in L.$$

 

What this leads to is the major result that we're interested in: a characterization of regular languages that directly invokes the finite property we've been looking for.

Let $L \subseteq \Sigma^*$ be a language. Then $L$ is regular if and only if the Myhill–Nerode relation $\equiv_L$ is of finite index. That is, $\equiv_L$ has finitely many equivalence classes.

Let's put a few things together. DFAs define a right-invariant relation where strings that enter the same state are equivalent and that attaching a suffix to those strings makes them enter the same state. So this means that if $x \sim_A y$, then $xz \sim_A yz$ and $xz \in L$ iff $yz \in L$—that is, they either both enter a final state or they both don't. So there seems to be a connection between a DFA and the Myhill–Nerode relation.

We can also see how the Myhill–Nerode relation seems to say something very similar about two strings that happen to have the same derivative. Suppose for two strings $x$ and $y$ we have $\frac{d}{dx} \mathbf r = \frac{d}{dy} \mathbf r$. This says that the strings $z$ such that $xz \in L(\mathbf r)$ are exactly the same as for $yz \in L(\mathbf r)$.

So what's the finite property that the pumping lemma assumes of regular languages? It's really how many Myhill–Nerode equivalence classes they have. And so if a language has infinitely many Myhill—Nerode equivalence classes, it is not regular.