CMSC 28000 — Lecture 8

Brzozowski Derivatives

Let's take a deeper dive into regular expressions. While we have a pretty good understanding of how a DFA processes a word, we've been comparatively vague about how to think about the same process when matching a word against a regular expression. For example, how would we talk about a word that has partially matched a regular expression, in the same way that we could think about a DFA that's halfway through reading a word?

To that end, we'll be discussing derivatives of regular expressions. These are derivatives in the sense that we'll be considering a regular expression after it has partially matched a word, or capturing how the expression changes as it matches a word.

First, we will need the following definition.

The quotient of a language $L$ by a word $w$ is the language $$w^{-1} L = \{x \in \Sigma^* \mid wx \in L\}.$$ Much like how ideas like concatenation, powers, and factors (subwords) invoke analogies to multiplication, the use of the term quotient in this case can be thought of in the same way. The quotient $w^{-1} L$ is what's left when we "factor out" $w$ from the front of the word.

Note that we can take quotients on any language, not just those that are regular. However, a key observation that Janusz Brzozowski made in 1964 was that one can compute a quotient for a regular language via its regular expression. This is the genesis of the derivative.

First, we define a "helper" function to produce a regular expression depending on whether the language of a given regular expression contains $\varepsilon$.

Given a regular expression $\mathbf r$, we define a function $\delta$, which produces a regular expression, by $$\delta(\mathbf r) = \begin{cases} \varepsilon & \text{if $\varepsilon \in L(\mathbf r)$,} \\ \emptyset & \text{if $\varepsilon \not \in L(\mathbf r)$.} \end{cases}$$ We can apply this definition to each case of the definition for regular expressions to get

and for regular expressions $R_1$ and $R_2$,

Again, observe that the result of $\delta(\mathbf r)$ according to this definition is always a regular expression: it is always either $\varepsilon$ or $\emptyset$, depending on whether or not $\varepsilon$ is matched by $\mathbf r$. We use $\delta$ in our definition of the the derivative as sort of a control, which activates certain terms depending on whether an expression matched $\delta$.

So what exactly is a derivative? It is a regular expression. Suppose we have a regular expression $\mathtt{(a+b)ab(a+b)}$ and I have the first letter of a word $b$. I want to know what the possible matches are after I have read/matched a $b$. For this expression, it is easy to tell: it is the words $aba$ and $abb$. But we can also use our regular expression to describe this set: it is the original expression without the first factor, $\mathtt{ab(a+b)}$. Now, this was simple to do for this expression, but we may want a more systematic way of computing these derivatives. The following definition tells us how.

Let $\mathbf r$ be a regular expression. Then the derivative of $\mathbf r$ with respect to $a \in \Sigma$, denoted $D_a(\mathbf r)$, is defined recursively by

and for regular expressions $\mathbf r_1$ and $\mathbf r_2$,

The claim here is that if I have a regular expression $\mathbf r$, applying this definition for a symbol $a \in \Sigma$ will give me a regular expression $D_a(\mathbf r)$ that describies all words $x$ such that $ax$ was matched by $\mathbf r$. Let's prove this.

Let $\mathbf r$ be a regular expression and let $a \in \Sigma$ be a symbol. Then, $$L(D_a(\mathbf r)) = a^{-1} L(\mathbf r) = \{x \in \Sigma^* \mid ax \in L(\mathbf r)\}.$$

We will prove this by induction on $\mathbf r$.

For the base cases, we have the following

Now, let $\mathbf s$ and $\mathbf t$ be regular expressions and assume that for all $a \in \Sigma$, $L(D_a(\mathbf s)) = a^{-1}L(\mathbf s)$ and $L(D_a(\mathbf t)) = a^{-1}L(\mathbf t)$.

First, we have $\mathbf r = \mathbf s + \mathbf t$: \begin{align*} L(D_a(\mathbf s + \mathbf t)) &= L(D_a(\mathbf s)+ D_a(\mathbf t)) \\ &= L(D_a(\mathbf s)) \cup L(D_a(\mathbf t)) \\ &= a^{-1} L(\mathbf s) \cup a^{-1} L(\mathbf t) \\ &= \{x \in \Sigma^* \mid ax \in L(\mathbf s)\} \cup \{x \in \Sigma^* \mid ax \in L(\mathbf t)\} \\ &= \{x \in \Sigma^* \mid ax \in L(\mathbf s) \cup L(\mathbf t)\} \\ &= \{x \in \Sigma^* \mid ax \in L(\mathbf s + \mathbf t)\} \\ &= a^{-1} L(\mathbf s + \mathbf t) \end{align*}

Next, we have $\mathbf r = \mathbf s \mathbf t$. We will define $L(\mathbf s)^\circ = L(\mathbf s) - \{\varepsilon\}$. Note that $L(\mathbf s) = L(\mathbf s)^\circ \cup L(\delta(\mathbf s))$. \begin{align*} a^{-1} L(\mathbf s \mathbf t) &= \{x \in \Sigma^* \mid ax \in L(\mathbf s)L(\mathbf t)\} \\ &= \{x \in \Sigma^* \mid ax \in (L(\mathbf s)^\circ \cup L(\delta(\mathbf s))) L(\mathbf t)\} \\ &= \{x \in \Sigma^* \mid ax \in (L(\mathbf s)^\circ L(\mathbf t) \cup L(\delta(\mathbf s)) L(\mathbf t))\} \\ &= \{x \in \Sigma^* \mid ax \in L(\mathbf s)^\circ L(\mathbf t)\} \cup \{x \in \Sigma^* \mid ax \in L(\delta(\mathbf s)) L(\mathbf t)\} \\ &= \{uv \in \Sigma^* \mid au \in L(\mathbf s)^\circ, v \in L(\mathbf t)\} \cup a^{-1} L(\delta(\mathbf s)\mathbf t) \\ &= (a^{-1} L(\mathbf s)^\circ) L(\mathbf t) \cup a^{-1} L(\delta(\mathbf s)\mathbf t) \\ \end{align*} Here, we make the observation that $a^{-1} L(\mathbf s)^\circ = a^{-1} L(\mathbf s)$, since $a^{-1} \{\varepsilon\} = \emptyset$. Therefore, $(a^{-1} L(\mathbf s)^\circ) L(\mathbf t) = L(D_a(\mathbf s)\mathbf t)$. We also observe that $a^{-1} L(\delta(\mathbf s) \mathbf t) = L(\delta(\mathbf s) D_a(\mathbf t))$, since $\delta(\mathbf s)$ is either $\emptyset$ or $\varepsilon$. Putting this together, we continue, to get the following: \begin{align*} a^{-1} L(\mathbf s \mathbf t) &= (a^{-1} L(\mathbf s)^\circ) L(\mathbf t) \cup a^{-1} L(\delta(\mathbf s)\mathbf t) \\ &= L(D_a(\mathbf s) \mathbf t) \cup L(\delta(\mathbf s) D_a(\mathbf t)) \\ &= L(D_a(\mathbf s) \mathbf t + \delta(s) D_a(\mathbf t)) \\ &= L(D_a(\mathbf s \mathbf t)) \end{align*}

Therefore, $L(D_a(\mathbf s \mathbf t)) = a^{-1}L(\mathbf s \mathbf t)$.

Finally, we have $\mathbf r = \mathbf s^*$. \begin{align*} L(D_a(\mathbf s^*)) &= L(D_a(\mathbf s)\mathbf s^*) \\ &= (a^{-1}L(\mathbf s)) L(\mathbf s^*) \\ &= \{xy \in \Sigma^* \mid ax \in L(\mathbf s), y \in L(\mathbf s^*)\} \\ &= \{w \in \Sigma^* \mid aw \in L(\mathbf s^*)\} \\ &= a^{-1} L(\mathbf s^*) \end{align*}

 

In other words, if $\mathbf r$ is a regular expression, then $D_a(\mathbf r)$ is a regular expression representing the language of quotients $a^{-1} L(\mathbf r)$. We can define an obvious extension to words. Let $w = ax$, where $x \in \Sigma^*$ and $a \in \Sigma^*$. Then $D_w(\mathbf r) = D_{ax}(\mathbf r) = D_x(D_a(\mathbf r))$. This leads to the following very interesting observation.

Let $\mathbf r$ be a regular expression and let $w \in \Sigma^*$. Then $w \in L(\mathbf r)$ if and only if $\varepsilon \in L(D_w(\mathbf r))$.

We have $\varepsilon \in L(D_w(\mathbf r)) \iff \varepsilon \in w^{-1} L(\mathbf r) \iff w \varepsilon \in L(\mathbf r) \iff w \in L(\mathbf r).$

Let's see where this idea takes us.

Let's compute the derivatives for the regular expression $\mathbf r = \mathtt{(0+1)^* 0}$. You may already recognize this as the language of binary strings that end in 0. This is admittedly pretty simple, but we'll see that the computations are quite long to do by hand. Let's start with $D_0(\mathbf r)$. \begin{align*} D_0(\mathbf r) &= D_0(\mathtt{(0+1)^*0}) \\ &= D_0(\mathtt{(0+1)^*}) \mathtt 0 + \delta(\mathtt{(0+1)^*})D_0(\mathtt 0) \\ &= D_0(\mathtt{(0+1)^*})\mathtt 0 + \varepsilon \varepsilon \\ &= D_0(\mathtt{(0+1)^*})\mathtt 0 + \varepsilon \\ &= D_0(\mathtt {0+1})\mathtt{(0+1)^*0} + \varepsilon \\ &= (D_0(\mathtt 0) + D_0(\mathtt 1))\mathtt{(0+1)^*0} + \varepsilon \\ &= \mathtt{(\varepsilon + \emptyset)(0+1)^*0 + \varepsilon} \\ &= \mathtt{\varepsilon (0+1)^*0 + \varepsilon} \\ &= \mathtt{(0+1)^*0 + \varepsilon} \end{align*} Here, the derivative tells us that if we read a $0$, we can either stop ($\varepsilon$) because we skipped the star term $\mathtt{(0+1)^*}$, so we have a word that matches our regular expression, or we can go back for another round through the star term. Let's call this regular expression $\mathbf r_1$. Before considering the derivatives of $\mathbf r_1$, let's consider $D_1(\mathbf r)$ first. \begin{align*} D_1(\mathbf r) &= D_1(\mathtt{(0+1)^*0}) \\ &= D_1(\mathtt{(0+1)^*})\mathtt 0 + \delta(\mathtt{(0+1)^*})D_1(\mathtt 0) \\ &= D_1(\mathtt{(0+1)^*})\mathtt 0+ \varepsilon \emptyset \\ &= D_1(\mathtt{(0+1)^*})\mathtt 0 \\ &= D_1(\mathtt{0+1})\mathtt{(0+1)^*0} \\ &= \mathtt{(\emptyset + \varepsilon)(0+1)^*0} \\ &= \mathtt{\varepsilon (0+1)^*0} \\ &= \mathtt{(0+1)^*0} \\ &= \mathbf r \end{align*} This derivative says that if we read a $1$, we can go through another round of the star term $\mathtt{(0+1)^*}$ but we haven't completed a match (i.e. $\varepsilon$ is not a possible match in this derivative). Now let's consider the derivatives on $\mathbf r_1$. We will see that our job will be slightly easier. \begin{align*} D_0(\mathbf r_1) &= D_0(\mathtt{(0+1)^*0 + \varepsilon}) \\ &= D_0(\mathtt{(0+1)^*0}) + D_0(\varepsilon) \\ &= D_0(\mathbf r) + \emptyset \\ &= \mathbf r_1 \end{align*} \begin{align*} D_1(\mathbf r_1) &= D_1(\mathtt{(0+1)^*0 + \varepsilon}) \\ &= D_1(\mathtt{(0+1)^*0}) + D_1(\varepsilon) \\ &= D_1(\mathbf r) + \emptyset \\ &= \mathbf r \end{align*}

So we seem to have arrived at two regular expressions that we've seen before. In fact, Brzozowski shows that this is guaranteed to happen.

Every regular expression $\mathbf r$ has a finite number of distinct derivatives.

This has a few implications. First of all, to compute all the derivatives of a regular expression $R$, we can do what we've done and just keep on computing them until we start getting the same expressions. Obviously, this is not the most efficient way to do this, but we're guaranteed the computation will stop.

But thinking back to the theoretical point of view, recall that these are sets of suffixes of words in our language. And when we consume prefixes of words belonging to these sets, we get different sets, but there are only finitely many of these sets. And finally, we know that those sets that contain $\varepsilon$ correspond to words in the language. This looks suspiciously like the following.

Unlike the Thompson construction we saw before or the position automaton I mentioned briefly, this directly gives us a DFA. However, if we want to use derivatives to compute an NFA, one way to do this was proposed by Antimirov, who, in 1996, defined partial derivatives of a regular expression. This method, like the position automaton that was mentioned earlier, produces an NFA of roughly the same size as the length of the regular expression.

Here's another example.

Let's compute the derivatives for the regular expression $\mathbf r = \mathtt{(10^*1 + 0)^*}$. We have for $0$, \begin{align*} D_0(\mathbf r) &= D_0(\mathtt{10^*1+0}) \mathbf r \\ &= (D_0(\mathtt{10^*1}) + D_0(\mathtt 0)) \mathbf r \\ &= (D_0(\mathtt{10^*1}) + \varepsilon) \mathbf r \\ &= (D_0(\mathtt 1)\mathtt{0^*1} + \delta(\mathtt 1)D_0(\mathtt {0^*1}) + \varepsilon) \mathbf r \\ &= (\emptyset \mathtt{0^*1} + \emptyset D_0(\mathtt{0^*1}) + \varepsilon) \mathbf r \\ &= \varepsilon \mathbf r \\ &= \mathbf r \end{align*} So, $D_0(\mathbf r) = \mathbf r$, which makes sense: reading $0$ would allows us to start from the beginning of the regular expression since we can just take the star. But for $1$, we have \begin{align*} D_1(\mathbf r) &= D_1(\mathtt{10^*1+0}) \mathbf r \\ &= (D_1(\mathtt {10^*1}) + D_1(\mathtt 0)) \mathbf r \\ &= (D_1(\mathtt {10^*1}) + \emptyset) \mathbf r \\ &= (D_1(\mathtt 1)\mathtt{0^*1} + \delta(\mathtt 1)D_1(\mathtt{0^*1})) \mathbf r \\ &= (\mathtt{\varepsilon 0^*1} + \emptyset D_1(\mathtt{0^*1})) \mathbf r \\ &= \mathtt{0^*1} \mathbf r \\ &= \mathtt{0^*1 (10^*1 + 0)^*} \end{align*} Here, this says that if we read $1$, then we are going to be partially matched to the first term inside the star, and we must finish the rest of the term before going for another round through the star. Let's call this new RE $\mathbf r_1$ and consider its derivatives. For $0$, we have \begin{align*} D_0(\mathbf r_1) &= D_0(\mathtt{0^*1} \mathbf r) \\ &= D_0(\mathtt{0^*})1 \mathbf r + \delta(\mathtt{0^*}) D_0(\mathtt 1\mathbf r) \\ &= D_0(\mathtt{0^*})1 \mathbf r + \varepsilon D_0(\mathtt 1)\mathbf r + \delta(\mathtt 1)D_0(\mathbf r) \\ &= D_0(\mathtt{0^*})1 \mathbf r + \emptyset \mathbf r + \emptyset D_0(\mathbf r) \\ &= D_0(\mathtt{0^*})1 \mathbf r \\ &= D_0(\mathtt 0) \mathtt{0^* 1} \mathbf r \\ &= \varepsilon 0^* 1 \mathbf r \\ &= \mathtt{0^* 1} \mathbf r \\ &= \mathbf r_1 \end{align*} Again, we have that reading $0$ doesn't change the regular expression, because we're still in the middle of reading $0$s. Then for $1$, we have \begin{align*} D_1(\mathbf r_1) &= D_1(\mathtt{0^*1} \mathbf r) \\ &= D_1(\mathtt{0^*})\mathtt 1 \mathbf r + \delta(\mathtt{0^*}) D_1(\mathtt 1\mathbf r) \\ &= D_1(\mathtt{0^*})\mathtt 1 \mathbf r + \varepsilon D_1(\mathtt 1)\mathbf r + \delta(\mathtt 1)D_1(\mathbf r) \\ &= D_1(\mathtt{0^*})\mathtt 1 \mathbf r + \varepsilon \mathbf r + \emptyset D_1(\mathbf r) \\ &= D_1(\mathtt{0^*})\mathtt 1 \mathbf r + \mathbf r \\ &= D_1(\mathtt 0) \mathtt{0^* 1} \mathbf r + \mathbf r \\ &= \emptyset \mathtt{0^* 1} \mathbf r + \mathbf r \\ &= \mathbf r \end{align*} These derivatives give us the following DFA.

This method lends itself quite nicely to implementation in functional programming languages. In fact, Stuart Kurtz's CMSC 28000 course notes on derivatives pointed me to a paper detailing the use of derivatives in the regular expression implementations of Standard ML and Scheme (which you probably now know as Racket).

How I first learned about derivatives was as an undergrad stumbling upon the first version of Matt Might's post about his paper on using derivatives for parsing. It's quite accessible if you have some formal languages and algorithms knowledge. He also has his own implementation of a very simple regular expression matcher that uses derivatives in Scheme, which should be easy enough to follow if you made it through CS 151 (or maybe don't look at it and try to implement it yourself).