MPCS 55001 — Lecture 11

Edit distance

A common problem in processing any data in the form of sequences (string data, streams, etc.) is determining the similarity of two sequences. A very common example of this nowadays is the AutoCorrect feature: on some level, your phone helpfully fixes your words based on some notion of similarity between two words. Intuitively, the words potato and tomato seem more similar to each other than the words eggplant and aubergine. But is there a way to quantify this?

Definition 11.1. An alphabet $\Sigma$ is a finite set of symbols and we define the set of words or strings over $\Sigma$ to be $\Sigma^*$. We denote by $\varepsilon$ the empty string, or the string of length 0. A distance on strings is a function $d: \Sigma^* \times \Sigma^* \to \mathbb R^+$ between strings to $\mathbb R^+$ that satisfies the following:

  1. $d(x,y) = 0$ if and only if $x = y$,
  2. $d$ is symmetric; i.e. for all $x,y \in \Sigma^*, d(x,y) = d(y,x)$,
  3. $d$ satisfies the triangle inequality; i.e. for all $x,y,z \in \Sigma^*, d(x,z) \leq d(x,y) + d(y,z)$.

Such a function in other contexts may also be called a metric, and together with the set they operate on define a metric space.

The most common string distance is the edit distance or Levenshtein distance, due to Levenshtein in 1966 and is based on the idea of operations which transform one word into another.

Definition 11.2. An edit operation is a pair $(a,b)$ where $a, b \in \Sigma \cup \{\varepsilon\}$, where $a,b$ are not both $\varepsilon$. An edit operation $(a,b)$ with $a \neq b$ is called an error. There are three basic types of errors:

  1. insertions $(\varepsilon,b)$,
  2. deletions $(a,\varepsilon)$, and
  3. substitutions $(a,b)$.

An edit string is a string of edit operations. The cost of an edit string is the number of errors in the edit string. The edit distance $d(u,v)$ of two words $u$ and $v$ is the cost of the minimum cost edit string for $u$ and $v$.

From this definition, the idea of an edit string is that it is a string that describes a series of operations that transforms a string into some other string. Of course, there could be multiple ways to do this. The edit distance is the cost of the minimal such edit string.

Example 11.3. Consider the word neighbourhood and the misspelling neighborhood. One possible edit string is \[(n,n) (e,e) (i,i) (g,g) (h,h) (b,b) (o,o) (u,r) (r,h) (h,o) (o,o) (o,d) (d,\varepsilon).\] This is hard to read, so we can write this vertically instead of horizontally and we get \[\frac nn \frac ee \frac ii \frac gg \frac hh \frac bb \frac oo \frac ur \frac rh \frac ho \frac oo \frac od \frac{d}{\varepsilon}.\] This edit string has 5 errors in it, but we can definitely find an edit string that gets us the same words with fewer errors, \[\frac nn \frac ee \frac ii \frac gg \frac hh \frac bb \frac oo \frac u{\varepsilon} \frac rr \frac hh \frac oo \frac oo \frac dd.\] So the edit distance of these words is 1, which matches our intution that one of the words [is missing a/has an extra] u in it.

The notion of an edit string makes for a very flexible way to define different kinds of distances on strings. For example, gap penalty, transpositions, etc.

Problem 11.4. The edit distance problem is: Given two strings $u = a_1 \cdots a_m$ and $v = b_1 \cdots b_n$, find the edit distance $d(u,v)$.

Edit strings are more commonly called alignments in computational biology, which is why this problem is often also called the sequence alignment problem. The dynamic programming algorithm that we're going to see was rediscovered a few times within the span of a few years and who you attributed it to depends on your field.

Since I'm a formal language theorist, I knew this from Wagner and Fisher in 1974. If you're a computational biologist, you would know this as Needleman-Wunsch, from 1970 (this is the citation that Kleinberg and Tardos use). Even earlier was Vintsyuk's algorithm in 1968, which was introduced in the context of speech processing.

It is clear that we want to define our subproblems around the cost of the alignment. But how do we break down our subproblems? Luckily, we've set up our problem so that it is clear that all we need to consider is the final edit operation in our edit string. This maps onto one of four possibilities: it can be a non-error/identity operation (i.e. $(a,a)$) or one of the three errors (an insertion, deletion, or substitution).

Let $OPT(i,j)$ be the minimum cost of an alignment $E_{i,j}$ for strings $a_1 \cdots a_i$ and $b_1 \cdots b_j$. Ultimately, we want to compute $OPT(m,n)$. So consider $OPT(i,j)$ and consider the following possibilities for the final edit operation of $E_{i,j}$:

  1. $(\varepsilon,b_j)$ is an insertion. In this case, we incur the cost $d(\varepsilon,b_j)$ and the cost of an alignment on $a_1 \cdots a_i$ and $b_1 \cdots b_{j-1}$, since no symbols of $a_1 \cdots a_i$ were involves. This gives us $OPT(i,j) \geq d(\varepsilon,b_j) + OPT(i,j-1)$ and we have $OPT(i,j) \leq d(\varepsilon,b_j) + OPT(i,j-1)$ by choosing an optimal alignment for $a_1 \cdots a_i$ and $b_1 \cdots b_{j-1}$. Therefore, $OPT(i,j) = d(\varepsilon,b_j) + OPT(i,j-1)$.
  2. $(a_i,\varepsilon)$ is a deletion. In this case, we incur the cost $d(a_i,\varepsilon)$ and the cost of an alignment on $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_j$, since no symbols of $b_1 \cdots b_j$ were involved. This gives us $OPT(i,j) \geq d(a_i,\varepsilon) + OPT(i-1,j)$ and we have $OPT(i,j) \leq d(a_i,\varepsilon) + OPT(i-1,j)$ by choosing an optimal alignment for $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_j$. Therefore, $OPT(i,j) = d(a_i,\varepsilon) + OPT(i-1,j)$.
  3. $(a_i,b_j)$ is a substitution or an identity, depending on whether or not $a_i = b_j$. In this case, we incur the cost $d(a_i,b_j)$ together with the cost of the alignment on $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_{j-1}$. This gives us $OPT(i,j) \geq d(a_i,b_j) + OPT(i-1,j-1)$ and we have $OPT(i,j) \leq d(a_i,b_j) + OPT(i-1,j-1)$ by choosing an optimal alignment for $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_{j-1}$. Therefore, $OPT(i,j) = d(a_i,b_j) + OPT(i-1,j-1)$.

This gives us the following recurrence.

Proposition 11.5. $OPT(i,j)$ satisfies the following recurrence. \[OPT(i,j) = \begin{cases} 0 & \text{if $i = j = 0$,} \\ \min \begin{cases} d(a_i,\varepsilon) + OPT(i-1,j) \\ d(\varepsilon,b_j) + OPT(i,j-1) \\ d(a_i,b_j) + OPT(i-1,j-1) \end{cases} & \text{otherwise.} \end{cases}\]

As mentioned earlier, if we wanted to compute the Levenshtein distance, we set $d(s,t) = 1$ for all $s \neq t$, so we could have just put 1s in the appropriate places in our recurrence. However, this formulation gives us the flexibility to do some fancier things like define different costs depending on whether the operation is an insertion, deletion, or substitution, or even define different costs for different pairs of symbols.

This recurrence gives us the following algorithm.

    \begin{algorithm}
    \caption{Compute edit distance}
    \begin{algorithmic}
    \PROCEDURE{distance}{$u = a_1 \cdots a_m,v = b_1 \cdots b_n$,d}
        \STATE $C[0,0] \gets 0$
        \FOR{$i$ from $1$ to $m$}
            \STATE $C[i,0] \gets d(a_i,\varepsilon) + C[i-1,0]$
        \ENDFOR
        \FOR{$i$ from $1$ to $m$}
            \FOR{$j$ from $0$ to $n$}
                \STATE $C[i,j] \gets \min\{d(a_i,b_j) + C[i-1,j-1], d(a_i,\varepsilon) + C[i-1,j], d(\varepsilon, b_j) + C[i,j-1]\}$
            \ENDFOR
        \ENDFOR
        \RETURN{$C[m,n]$}
    \ENDPROCEDURE
    \end{algorithmic}
    \end{algorithm}

Proposition 11.7. Algorithm 11.6 computes $OPT(i,j)$.

Proof. We will show that $C[i,j] = OPT(i,j)$ for $i \leq m$ and $j \leq n$ by induction on $(i,j)$. For $(i,j) = (0,0)$, we have $C[0,0] = 0 = OPT(0,0)$. Now consider $(i,j)$ with $i \gt 0$ or $j \gt 0$, and assume that $C[i',j'] = OPT(i',j')$ for all $(i',j')$ with $C[i',j']$ computed before $C[i,j]$. We have \[C[i,j] = \min\{d(a_i,b_j) + C[i-1,j-1], d(a_i,\varepsilon) + C[i-1,j], d(\varepsilon, b_j) + C[i,j-1]\}.\] By our inductive hypothesis, we have \[C[i,j] = \min\{d(a_i,b_j) + OPT(i-1,j-1), d(a_i,\varepsilon) + OPT(i-1,j), d(\varepsilon, b_j) + OPT(i,j-1)\}.\] Then by definition, we have $C[i,j] = OPT(i,j)$. $$\tag*{$\Box$}$$

Proposition 11.8. Algorithm 11.6 has a running time of $O(mn)$.

Proof. It is clear that the computation of $C[i,j]$ is done in $O(1)$ time. Then the two loops together give a running time of $O(mn)$. $$\tag*{$\Box$}$$

Finally, we are faced with the question of how to recover our alignment from this computation. Let's consider an example.

Example 11.9. Consider a run of the algorithm on the words color and colours.

c o l o u r s
0 1 2 3 4 5 6 7
c 1 0 1 2 3 4 5 6
o 2 1 0 1 2 3 4 5
l 3 2 1 0 1 2 3 4
o 4 3 2 1 0 1 2 3
r 5 4 3 2 1 1 1 2

Observe that just like in our previous problems, we can recover the actual alignment that gives us the optimal cost based on our decisions from the recurrence. Observe that each entry in our table is determined by the three entries immediately above, to the left, and to the upper left of it.

As with a lot of other fundamental problems in computer science, the edit distance problem is one that people have been working on for a long time. In particular, the question of whether we can do better than $O(n^2)$ time has been open, and there was some recent progress that puts it in the category of probably solved.

In 2015, Backurs and Indyk showed that if we could compute the edit distance in under $O(n^2)$ time (more specifically, in $O(n^{2 - \varepsilon})$ for some constant $\varepsilon \gt 0$), then that would disprove the Strong Exponential Time Hypothesis, which most of us believe is true.

More practically speaking, there is also the question of space usage. One of the things that night bug you about this algorithm is that it takes a lot of space. Imagine trying to compute how similar two words are in English and having to construct a table like we did. Intuitively, that seems like a lot of space to use up, even if it is still polynomial.

This becomes a practical issue as we move from something maybe kind of silly like words to something far more substantial like DNA sequences. In that case, quadratic space goes from being an annoyance to an impracticality. An algorithm due to Hirschberg in 1975 remedies this and is able to compute the edit distance of two strings using $O(n)$ space. It uses a combination of divide and conquer, shortest paths, and dynamic programming to achieve this and is discussed in KT 7.6.