MPCS 55001 — Lecture 6

Hashing, continued

Based on our definition, it's not actually clear that universal families of hash functions even exist, much less that they can be effectively computed and that we can efficiently make a random choice for them. Luckily, it is possible to define such a family of functions. For this, we will need to break out some number theory.

First, we select a prime $p$ to be the size of our hash table. If we want to have a table of size $n$, then simply choose some prime $p \geq n$. We assume that all of our items are encoded as integers. To ensure that we have an encoding for each element in our universe $U$, we choose an encoding length $r$ such that $p^r \geq |U|$. Then each element $u \in U$ is encoded as an integer \[x = x_1 x_2 \cdots x_r\] where each $x_i \in \{0,1,\dots,p-1\}$. In other words, $x$ is an integer in base $p$ of length $r$. By our choice of $r$, there are enough encodings for every element in $U$.

To define a hash function, we let $\mathcal A = \{0,\dots,p-1\}^r$ be the set of all $r$-length base $p$ integers. These are the same integers as our encodings, so an element of $a = \mathcal A$ has the form $a = a_1 \cdots a_r$, where $a_i \in \{0,\dots,p-1\}$. Then we define functions \[h_a(x) = \left( \sum_{i=1}^r a_i x_i \right) \pmod p\] and take $H = \{h_a \mid a \in A\}$ to be our family of functions.

First, we note that while the idea of choosing a function at random is still a bit strange, what we've really done here is define a bunch of functions where the computation is pretty much the same for all of them, but the thing that changes is our choice of $a$. Here, $a$ is called a salt. In other words, we can think of our family of functions as a function that's parametrized by a choice of the salt. So to choose a function uniformly at random from this family, all we have to do is choose a salt uniformly at random, which is much easier to conceive of.

It remains to show that this family is universal. First, we'll remind ourselves of a very useful fact about the field of integers over a prime modulus.

If $p$ is a prime and $a \neq 0 \pmod p$, then $a$ has a unique multiplicative inverse mod $p$.

The family of functions $H = \{h_a \mid a \in \mathcal A\}$ is a universal family of hash functions.

Let $x = x_1 \cdots x_r$ and $y = y_1 \cdots y_r$ be encodings for two distinct elements of $U$. We want to show that \[\Pr(h_a(x) = h_a(y)) \leq \frac 1 p.\] Since $x \neq y$, there exists an index $j$ at which $x_j \neq y_j$. Suppose that $h_a(x) = h_a(y)$. Then we have \begin{align*} h_a(x) &\equiv h_a(y) &\pmod p \\ h_a(x) - h_a(y) &\equiv 0 &\pmod p \\ \sum_{i=1}^r a_i x_i - \sum_{i=1}^r a_i y_i &\equiv 0 &\pmod p \\ \sum_{i=1}^r a_i (x_i - y_i) &\equiv 0 &\pmod p \\ \sum_{i\neq j} a_i (x_i - y_i) + a_j (x_j-y_j) &\equiv 0 &\pmod p \\ \sum_{i\neq j} a_i (x_i - y_i) &\equiv - a_j (x_j-y_j) &\pmod p \\ \sum_{i\neq j} a_i (x_i - y_i) &\equiv a_j (y_j-x_j) &\pmod p \end{align*} We let $k = y_j - x_j$ and let $M = \sum_{i\neq j} a_i (x_i - y_i)$. This gives us $a_j k \equiv M \pmod p$ from the above equation.

Now consider the choice of our salt $a$. Since $a$ was chosen uniformly at random, we can assume that each digit $a_i$ was chosen independently, uniformly at random. Thus, we consider only the probability of choosing $a_j$ such that $a_j k \equiv M \pmod p$. Since $k$ is nonzero, we know that this congruence admits only one solution. Since there are $p$ possible choices for $a_j$, this means that there was a $\frac 1 p$ probability of choosing $a_j$. Therefore, \[\Pr(h_a(x) = h_a(y)) \leq \frac 1 p.\]

 

Dynamic Programming

Other than randomization, the algorithmic design paradigms we've seen already are natural approaches to problems. Greedy algorithms operate on the idea of applying a single rule to making choices. Divide and conquer algorithms operate on the idea of dividing a problem into subproblems recursively and combining the results. But both of these approaches have weaknesses. Greedy strategies are often too simple for many problems, while divide and conquer algorithms may not be able to give meaningful improvements for problems with large solution spaces.

Our next algorithm design approach will seek to address these weaknesses. Dynamic programming is an approach first defined by Bellman in the 1950s. The apocryphal story behind the name suggests that Bellman wanted to make it sound more exciting to trick his superiors at RAND into thinking that this wasn't math. The programming part of the name is more instructive: it refers to planning (as it would have been used in the '50s) rather than our modern interpretation of coding.

Interval scheduling, revisted

Recall the interval scheduling problem:

The interval scheduling problem is: Given a list of $n$ jobs $j_1, j_2, \dots, j_n$, and each job $j_i$ has a start time $s(i)$ and a finish time $f(i)$, find the largest subset of mutually compatible jobs.

We will consider the weighted variant of this problem, where each job $j_i$ is also assigned a weight $w_i \gt 0$.

The weighted interval scheduling problem is: Given a list of $n$ jobs $j_1, j_2, \dots, j_n$, where each job $j_i$ has a start time $s(i)$, a finish time $f(i)$ and a weight $w(i)$, find the maximum weight subset of mutually compatible jobs.

Of course, here weight can be any metric you like: cash monies, importance, etc. In other words, we don't consider every job to be equally valuable and there are jobs that we would really like to get done more than others.

Consider the following instance.

The earliest finish time algorithm will choose the jobs on the bottom row, each of weight 1, which causes the job above, with weight 1000, to be blocked, since it is incompatible.

In fact, no greedy approach will work for this problem, so we need to think about how to come up with a different approach. First, it helps to fix the order of our jobs, so that they are in order of finish time, $f(1) \leq \cdots \leq f(n)$.

Something that helps in these kinds of problems is to think backwards. For instance, suppose we have an optimal solution. Which jobs are in the solution? Is the last job, with finish time $f(n)$ in our solution?

We will need some definitions.

Let $p(j)$ be the largest index $i \lt j$ such that job $i$ is compatible with job $j$. Let $\mathcal O_j$ be the subset of mutually compatible jobs chosen from jobs $1$ through $j$ with maximum weight, denoted $OPT(j)$.

What we want to compute is $OPT(n)$ by considering the possible subsets of maximum weight jobs on fewer jobs. Let's return to our question about the last job. If $n$ is in $\mathcal O_n$, then this means that it can't contain any jobs that aren't compatible with $n$. However, if $n$ isn't in $\mathcal O_n$, then we would have to ask the same question of job $n-1$.

Let's generalize this. Consider the following two possibilities for $j$ and $\mathcal O_j$.

  1. $j \not \in \mathcal O_j$: This means that $\mathcal O_j$ will consist of the remaining jobs $1$ through $j-1$. This means that $\mathcal O_j$ is a possible solution for choosing a maximum weight set of jobs out of jobs $1$ through $j-1$. Therefore, $OPT(j) \leq OPT(j-1)$.
  2. $j \in \mathcal O_j$: Then we know that we can add $w(j)$ to $OPT(j)$. We also know that $\mathcal O_j$ will not contain jobs that are incompatible with $j$, which means that all jobs $p(j) +1, \dots, j-1$ will not be in $\mathcal O_j$. But again, our selection of jobs from $1$ through $p(j)$ constitutes a possible solution for choosing a maximum weight set of jobs out of jobs $1$ through $p(j)$. Together with the weight of job $j$, we have $OPT(j) - w(j) \leq OPT(p(j))$, which gives us $OPT(j) \leq w(j) + OPT(p(j))$.

These two observations give us the following recurrence: \[OPT(j) \leq \begin{cases} 0 & \text{if $j=0$,} \\ \max\{OPT(j-1), w(j) + OPT(p(j))\} & \text{if $j \gt 0$.} \end{cases} \]

This is not quite enough to get to what we want. So far, we've made observations about how our optimal solution $\mathcal O_j$ is constructed, which gives us the bounds that we've obtained so far. However, we still need to explicitly argue what many of you might have noticed already: that the solutions we're piecing together are made from the optimal solutions for the corresponding subproblems. Again, there are two cases.

  1. Consider the optimal solution for jobs $1$ through $j-1$, the set $\mathcal O_{j-1}$, which has cost $OPT(j-1)$. Since $\mathcal O_{j-1}$ is a subset of jobs from $1$ through $j-1$, it is also a subset of jobs from $1$ through $j$. But this means that $\mathcal O_{j-1}$ is a possible subset of jobs of maximum weight. Therefore, $OPT(j-1) \leq OPT(j)$.
  2. Consider the optimal solution for jobs $1$ through $p(j)$, the set $\mathcal O_{p(j)}$, which has cost $OPT(p(j))$. But since the final job is $p(j)$, I know I can safely add job $j$, to this set. So I can consider the set $\mathcal O_{p(j)} \cup \{j\}$, which now has weight $OPT(p(j)) + w(j)$. This set is a subset of jobs $1$ through $j$, so it is a possible subset of jobs of maximum weight. Therefore, $OPT(p(j)) + w(j) \leq O(j)$.

These observations now give us the following recurrence: \[OPT(j) \geq \begin{cases} 0 & \text{if $j=0$,} \\ \max\{OPT(j-1), w(j) + OPT(p(j))\} & \text{if $j \gt 0$.} \end{cases} \]

Together with the previous recurrence, we can finally conclude \[OPT(j) = \begin{cases} 0 & \text{if $j=0$,} \\ \max\{OPT(j-1), w(j) + OPT(p(j))\} & \text{if $j \gt 0$.} \end{cases} \]

Of course to prove this formally, we would need to add all the dressings that come with an inductive proof. Luckily, our argument above really constitutes most of the inductive case of such a proof—we just need to add our inductive hypothesis, that for all $k \lt j$, the set $\mathcal O_k$ is the maximum weight subset of jobs $1, \dots, k$ with weight $OPT(k)$.

As with all recurrences, our result is most straightforwardly implemented as a recursive algorithm. Here, we assume that there is some bigger program that has already sorted the jobs and computed $p(j)$ and has it readily available.

    \begin{algorithmic}
    \PROCEDURE{max-sched}{$j$}
        \IF{$j = 0$}
            \RETURN{$0$}
        \ELSE
            \RETURN{$\max(w(j) +$ \CALL{max-sched}{$p(j)$}, \CALL{max-sched}{$j-1$})}
        \ENDIF
    \ENDPROCEDURE
    \end{algorithmic}

This is a pretty simple algorithm that you can prove the correctness for using induction. Great! Except this algorithm has a worst case running time of $O(2^n)$. Wow, terrible!

So let's take a look at what's going on here. Consider the following instance.

Now, let's consider the following tree of calls that the algorithm makes in computing $OPT(n)$.

Note that when calculating $OPT(6)$ in this instance, we need to make a call to $OPT(5)$ and $OPT(4)$, which in turn make their own recursive calls. The result is a very large tree, which I've cut off for space purposes. However, the result is that we get something close to a full tree. Since each node is a recursive call, there is only a constant amount of work being done. However, recall that there are exponentially many such nodes, which means that we get something approaching $O(2^n)$.

However, it seems like there's a lot of repeated work going on here. Maybe we can take advantage of this. There are two main ways to do this.

Memoization

The first is called memoization, which if you say out loud sounds like how you'd pronounce "uwu memorization". The idea behind memoization is that instead of recomputing each level of recursion, you simply store it once you've done it once, and future calls can look up the result rather than recompute it.

By giving access to a global array $M$ that stores the result of $OPT(j)$ in $M[j]$, we make an easy modification to our original algorithm to memoize it.

    \begin{algorithmic}
    \PROCEDURE{max-sched}{$j$}
        \IF{$M[j] = \varnothing$}
            \STATE $M[j] \gets \max(w(j) +$ \CALL{max-sched}{$p(j)$}, \CALL{max-sched}{$j-1$})
        \ENDIF
        \RETURN{$M[j]$}
    \ENDPROCEDURE
    \end{algorithmic}

How much better is this algorithm?

Algorithm 6.9 has worst-case running time $O(n)$.

There are one of two possible behaviours for a call to optmemo(j). Either it immediately returns $M[j]$ because it was already initialized, or it initializes $M[j]$ and makes two recursive calls. Therefore, we need to determine how many recursive calls this algorithm can make.

We observe that the calls are made only if $M[j]$ is not initialized and after $M[j]$ is initialized. The number of uninitialized entries of $M$ is 0 at the beginning of the algorithm and increases by one whenever recursive calls are made. This means there can only be at most $2n$ recursive calls, which gives a running time of $O(n)$.

Iterating over subproblems

Memoization was nice for augmenting the obvious but bad recursive algorithm that we came up with, but it is not really what we think of when we think of dynamic programming.

The key observation is that our initial view of the problem was top-down: we wanted to compute $OPT(n)$ and $OPT(n)$ was computed based on computing the subproblems $OPT(n-1)$ and $OPT(p(n))$. However, we can compute these subproblems from the other direction, bottom-up, starting from 0.

    \begin{algorithmic}
    \PROCEDURE{max-sched}{$j$}
        \STATE $M[0] \gets 0$
        \FOR{$j$ from $1$ to $n$}
            \STATE $M[j] \gets \max(w(j) + M[p(j)], M[j-1])$
        \ENDFOR
    \ENDPROCEDURE
    \end{algorithmic}

We can prove this algorithm is correct by induction, similar to how we would for a recursive algorithm.

Algorithm 6.11 correctly computes $OPT(n)$.

We will prove that $M[n] = OPT(n)$ by induction on $n$. First, consider $n = 0$. Then $M[0] = 0 = OPT(0)$ by definition.

Now assume that $M[j] = OPT(j)$ for all $j \lt n$. By definition, we have \[OPT(n) = \max\{w(n) + OPT(p(n)), OPT(n-1)\}.\] By our inductive hypothesis, we have \[OPT(n) = \max\{w(n) + M[p(n)], M[n-1]\},\] and therefore $OPT(n) = M[n]$.

One thing to keep in mind when doing these proofs is to take care not to mix up $M[j]$ and $OPT(j)$. That is, the crux of the proof is to show that $M[j]$ is $OPT(j)$—we only substitute them once we make this assumption (via the inductive hypothesis). The proofs that we'll be doing later on will clearly denote the value of the solution we want by $OPT$, and involve showing that our program computes and stores this value as intended.

Algorithm 6.11 clearly runs in $O(n)$ time: it iterates through a loop $n$ times, with the work inside each iteration taking $O(1)$ time. The only other observation that is necessary is that each iteration only refers to computations that have already been performed in previous iterations.

The approach that we've used for this problem will be the same one that we'll deploy for other problems that we solve using dynamic programming. The most crucial step is to treat your problem as a recursive problem and to describe the solution to this problem recursively—i.e. as a recurrence.

In describing your problem as a recurrence, you identify the subproblems you need to solve. Once you have done this, it is a simple matter of storing the solutions these subproblems, which are the early values of your recurrence. Once you have done this, proving that your algorithm is correct is a matter of proving that the algorithm correctly stores the solutions to your recurrence.

The process is summarized as follows:

  1. Solve your problem recursively and identify the subproblems that are required to solve your problem.
  2. Define and prove a recurrence for your problem based on the above.
  3. Define an algorithm that iteratively solves your recurrence.
  4. Prove that your algorithm works by a straightforward induction showing that the values that are computed correspond to the solutions of the recurrence.

Knapsack

The knapsack problem was first formulated by Dantzig in 1957 and he gives a basic dynamic programming algorithm for it. The idea is that you have a knapsack that can hold a certain amount of weight. You would like to fill it with items such that the value of the items you choose are maximized. Of course, we don't really care about filling literal knapsacks full of stuff—this just happens to be a useful metaphor.

Dantzig is well known for the Simplex algorithm for solving linear programming problems. He is also the source of the urban legend of the student who was late to class, copied down some equations, did them for homework, and inadvertently solved two unsolved problems: he was the (grad) student.

The 0-1 knapsack problem is given a list of $n$ items, where item $i$ has value $v_i \gt 0$ and weight $w_i \gt 0$, and a knapsack weight limit of $W$, choose the subset of items with the maximum total value.

For this problem, we will assume that values and weights are positive integers.

Consider the following items.

$i$ 1 2 3 4 5
$w_i$ 1 3 4 7 11
$v_i$ 1 4 9 12 19

Suppose we can carry weight $12$. Then $\{1,5\}$ has weight 12 and value 20, while $\{1,3,4\}$ has weight 11 and value 22.

The question is, similar to our greedy algorithm, how should we consider how to choose items? We can look for greedy algorithms for a bit of inspiration: do we choose based on weight? On value? On weight and value?

We can think through similar choices. We want to maximize the value of a solution to the problem, but the question is which choices we should be making. Suppose that we only consider the optimal value of a solution by considering the items $1, \dots, i$. This seems like it should work, but there is a crucial piece missing: how do we ensure that we don't exceed our weight limit? Similarly, we can try to optimize around the value of a solution considering only the possible weight, but how do we formulate how this changes with respect to each item that's chosen?

It turns out the answer is to do both.

Let $OPT(i,w)$ denote the value of an optimal solution $\mathcal O_{i,w}$ for the knapsack problem on items $1, \dots, i$ subject to weight limit $w$.

In essence, what we do is to consider two different subproblems at the same time: the subproblem of which items are chosen out of $1, \dots, i$ and the subproblem of which items to choose with the given weight limit $w$. The subproblem of selecting a subset of items $1, \dots, i$ is clear, since it's similar to the scheduling problem we saw earlier.

But how do we consider subproblems on weight? Luckily, we made the helpful assumption that all weights are integers, so we have subproblems ranging on $0 \leq w \leq W$. This gives us the equation with two variables, \[OPT(i,w) = \max_{S \subseteq \{1,\dots,i\}} \sum_{j \in S} v_j \quad \text{subject to } \sum_{j \in S} w_j \leq w.\]

Under this formulation, we want to compute $OPT(n,W)$. So, now we need to figure out how to express $OPT(i,w)$ in terms of smaller subproblems. Let us consider what our choices are regarding $OPT(i,w)$ when considering whether $i \in \mathcal O_{i,w}$.

  1. If $i \not \in \mathcal O_{i,w}$, then $\mathcal O_{i,w}$ must contain only items chosen from $1, \dots, i-1$, and we have $OPT(i,w) = OPT(i-1,w)$.
  2. If $i \in \mathcal O_{i,w}$, then this means that $v_i$ is added to our solution and we must add the weight $w_i$, which decreases the allowed weight in any subproblem. So we have $OPT(i,w) = v_i + OPT(i-1,w-w_i)$.

We then take $OPT(i,w) = \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\}$.

This gives us the following recurrence, which we'll prove formally.

$OPT(i,w)$ satisfies the following recurrence. \[OPT(i,w) = \begin{cases} 0 & \text{if $i = 0$ or $w = 0$}, \\ \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\} & \text{for $i,w \gt 0$}. \end{cases} \]

We will prove this by induction on $i$ and $w$.

Our base case will be when $i = 0$ or $w = 0$. If $i = 0$, then there are no items to add, so $OPT(i,w) = 0$ for all $w$. If $w = 0$, then there is no space to add any items, since the weight of all items are strictly positive, so $OPT(i,w) = 0$ for all $i$.

Now, suppose $i, w \gt 0$. Assume for all $i' \lt i$ and $w' \lt w$ that $\mathcal O_{i',w'}$ is an optimal solution (i.e. with maximum value) with value $OPT(i',w')$. Let $\mathcal O(i,w)$ be an optimal solution on $i$ items and weight $w$ and value $OPT(i,w)$. Let $w_i$ be the weight of item $i$ and let $v_i$ be the value of item $i$.

First, we argue that \[OPT(i,w) \leq \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\}\] There are two cases, depending on whether or not $i$ is in $\mathcal O_{i,w}$.

Since we take the maximum over the two cases, the inequality is proved.

Next, we argue that \[OPT(i,w) \geq \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\}\] There are two cases, but this time, we consider the value of the subproblems.

Again, since we take the maximum over these two cases, the inequality is proved.

Together, both inequalities show that \[OPT(i,w) = \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\}\]

 

This recurrence leads to the following algorithm.

    \begin{algorithmic}
    \PROCEDURE{knapsack}{$W,I = \{(w_i,v_i) \mid 1 \leq i \leq n\}$}
        \FOR{$w$ from $1$ to $W$}
            \STATE $V[0,w] \gets 0$
        \ENDFOR
        \FOR{$i$ from $1$ to $n$}
            \FOR{$w$ from $0$ to $W$}
                \IF{$w_i \gt w$}
                    \STATE $V[i,w] \gets V[i-1,w]$
                \ELSE
                    \STATE $V[i,w] \gets \max\{V[i-1,w],v_i + V[i-1,w-w_i]\}$
                \ENDIF
            \ENDFOR
        \ENDFOR
        \RETURN{$V[n,W]$}
    \ENDPROCEDURE
    \end{algorithmic}

Recall Example 6.14 from above. Running this algorithm constructs the following table $V$.

$i$/$w$ 0 1 2 3 4 5 6 7 8 9 10 11 12
0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 1 1 1 1 1 1 1
2 0 1 1 4 5 5 5 5 5 5 5 5 5
3 0 1 1 4 9 10 10 13 14 14 14 14 14
4 0 1 1 4 9 10 10 13 14 14 16 21 22
5 0 1 1 4 9 10 10 13 14 14 16 21 22

Our solution, $OPT(n,W)$, is highlighted in red.

knapsack computes $OPT(n,W)$.

We will show that $V[i,w] = OPT(i,w)$ for $i \leq n$ and $w \leq W$ by induction on $(i,w)$. First, we consider as our base cases $i = 0$ and $w = 0$. We have that $V[0,w] = 0 = OPT(0,w)$ for all $w \leq W$ and $V[i,0] = 0 = OPT(i,0)$ for all $i \leq n$.

Now consider $i,w \gt 0$ and assume that for all $(i',w')$ that are computed before $(i,w)$ that $V[i',w'] = OPT(i',w')$. First, suppose $w_i \gt w$. Then by line 7, we have $V[i,w] = V[i-1,w]$. By our inductive hypothesis, $V[i-1,w] = OPT(i-1,w)$, so $V[i,w] = OPT(i-1,w)$ and therefore $V[i,w] = OPT(i,w)$ as desired.

Then if $w_i \leq w$, by line 9, we have \[V[i,w] = \max(V[i-1,w], v_i + V[i-1,w-w_i])\] and by our inductive hypothesis, we have \[V[i,w] = \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\},\] so $V[i,w] = OPT(i,w)$.

knapsack has running time $O(nW)$.

The work is done in the nested loops. Each of the computations within the loops can be done in $O(1)$ time. The inner loop is run $W$ times, while the outer loop is run $n$ times. This gives $O(nW)$ time in total.

Now, at this point, some of you may remember my note on the solution to this problem that there is no known polynomial time algorithm for knapsack, but here, we have an algorithm that runs in $O(nW)$ time! What's up with that?

The key here is to remember that the complexity of algorithms are functions of the size of the input, so it's worth stepping back to ask what the size of this problem instance is. One part is obvious: the number of items that we have. The other part is not so obvious: what is the size of $W$?

A common mistake, as in this case, is to take $W$ as the size of $W$. But $W$ is the input, a number. The size of $W$ is the size of its representation, which is the number of bits of $W$ (or digits if you prefer to use some other number system). Either way, for any representation that isn't unary, the size of $W$ is $O(\log W)$.

The issue here then becomes clear: $W$ is not polynomial in $\log W$, so our $O(nW)$ bound is not polynomial in the size of the input. Algorithms that are polynomial in the numeric value of the input but not the size of the input are said to be pseudopolynomial.

Knapsack problems belong to a family of packing problems, where we are given a set of items with various properties and constraints and we are asked to choose a selection of items subject to constraints while tying to maximize or minimize some property:

It becomes pretty clear that many of these problems are quite important, in that they can be repurposed very easily as various computational and real life applications. However, they also happen to share the same unfortunate downside: we don't have efficient algorithms for many of these problems. This is where we begin to see glimpses of intractability rear its ugly head.

Edit distance

A common problem in processing any data in the form of sequences (string data, streams, etc.) is determining the similarity of two sequences. A very common example of this nowadays is the AutoCorrect feature: on some level, your phone helpfully fixes your words based on some notion of similarity between two words. Intuitively, the words potato and tomato seem more similar to each other than the words eggplant and aubergine. But is there a way to quantify this?

An alphabet $\Sigma$ is a finite set of symbols and we define the set of words or strings over $\Sigma$ to be $\Sigma^*$. We denote by $\varepsilon$ the empty string, or the string of length 0. A distance on strings is a function $d: \Sigma^* \times \Sigma^* \to \mathbb R^+$ between strings to $\mathbb R^+$ that satisfies the following:

  1. $d(x,y) = 0$ if and only if $x = y$,
  2. $d$ is symmetric; i.e. for all $x,y \in \Sigma^*, d(x,y) = d(y,x)$,
  3. $d$ satisfies the triangle inequality; i.e. for all $x,y,z \in \Sigma^*, d(x,z) \leq d(x,y) + d(y,z)$.

Such a function in other contexts may also be called a metric, and together with the set they operate on define a metric space.

The most common string distance is the edit distance or Levenshtein distance, due to Levenshtein in 1966 and is based on the idea of operations which transform one word into another.

An edit operation is a pair $(a,b)$ where $a, b \in \Sigma \cup \{\varepsilon\}$, where $a,b$ are not both $\varepsilon$. An edit operation $(a,b)$ with $a \neq b$ is called an error. There are three basic types of errors:

  1. insertions $(\varepsilon,b)$,
  2. deletions $(a,\varepsilon)$, and
  3. substitutions $(a,b)$.

An edit string is a string of edit operations. The cost of an edit string is the number of errors in the edit string. The edit distance $d(u,v)$ of two words $u$ and $v$ is the cost of the minimum cost edit string for $u$ and $v$.

From this definition, the idea of an edit string is that it is a string that describes a series of operations that transforms a string into some other string. Of course, there could be multiple ways to do this. The edit distance is the cost of the minimal such edit string.

Consider the word neighbourhood and the misspelling neighborhood. One possible edit string is \[(n,n) (e,e) (i,i) (g,g) (h,h) (b,b) (o,o) (u,r) (r,h) (h,o) (o,o) (o,d) (d,\varepsilon).\] This is hard to read, so we can write this vertically instead of horizontally and we get \[\frac nn \frac ee \frac ii \frac gg \frac hh \frac bb \frac oo \frac ur \frac rh \frac ho \frac oo \frac od \frac{d}{\varepsilon}.\] This edit string has 5 errors in it, but we can definitely find an edit string that gets us the same words with fewer errors, \[\frac nn \frac ee \frac ii \frac gg \frac hh \frac bb \frac oo \frac u{\varepsilon} \frac rr \frac hh \frac oo \frac oo \frac dd.\] So the edit distance of these words is 1, which matches our intution that one of the words [is missing a/has an extra] u in it.

The notion of an edit string makes for a very flexible way to define different kinds of distances on strings. For example, gap penalty, transpositions, etc.

The edit distance problem is: Given two strings $u = a_1 \cdots a_m$ and $v = b_1 \cdots b_n$, find the edit distance $d(u,v)$.

Edit strings are more commonly called alignments in computational biology, which is why this problem is often also called the sequence alignment problem. The dynamic programming algorithm that we're going to see was rediscovered a few times within the span of a few years and who you attributed it to depends on your field.

Since I'm a formal language theorist, I knew this from Wagner and Fisher in 1974. If you're a computational biologist, you would know this as Needleman-Wunsch, from 1970 (this is the citation that Kleinberg and Tardos use). Even earlier was Vintsyuk's algorithm in 1968, which was introduced in the context of speech processing.

It is clear that we want to define our subproblems around the cost of the alignment. But how do we break down our subproblems? Luckily, we've set up our problem so that it is clear that all we need to consider is the final edit operation in our edit string. This maps onto one of four possibilities: it can be a non-error/identity operation (i.e. $(a,a)$) or one of the three errors (an insertion, deletion, or substitution).

Let $OPT(i,j)$ be the minimum cost of an alignment $E_{i,j}$ for strings $a_1 \cdots a_i$ and $b_1 \cdots b_j$. Ultimately, we want to compute $OPT(m,n)$. So consider $OPT(i,j)$ and consider the following possibilities for the final edit operation of $E_{i,j}$:

  1. $(\varepsilon,b_j)$ is an insertion. In this case, we incur the cost $d(\varepsilon,b_j)$ and the cost of an alignment on $a_1 \cdots a_i$ and $b_1 \cdots b_{j-1}$, since no symbols of $a_1 \cdots a_i$ were involves. This gives us $OPT(i,j) = d(\varepsilon,b_j) + OPT(i,j-1)$.
  2. $(a_i,\varepsilon)$ is a deletion. In this case, we incur the cost $d(a_i,\varepsilon)$ and the cost of an alignment on $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_j$, since no symbols of $b_1 \cdots b_j$ were involved. This gives us $OPT(i,j) = d(a_i,\varepsilon) + OPT(i-1,j)$.
  3. $(a_i,b_j)$ is a substitution or an identity, depending on whether or not $a_i = b_j$. In this case, we incur the cost $d(a_i,b_j)$ together with the cost of the alignment on $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_{j-1}$. This gives us $OPT(i,j) = d(a_i,b_j) + OPT(i-1,j-1)$.

This gives us the following recurrence.

$OPT(i,j)$ satisfies the following recurrence. \[OPT(i,j) = \begin{cases} 0 & \text{if $i = j = 0$,} \\ \min \begin{cases} d(a_i,\varepsilon) + OPT(i-1,j) \\ d(\varepsilon,b_j) + OPT(i,j-1) \\ d(a_i,b_j) + OPT(i-1,j-1) \end{cases} & \text{otherwise.} \end{cases}\]

As mentioned earlier, if we wanted to compute the Levenshtein distance, we set $d(s,t) = 1$ for all $s \neq t$, so we could have just put 1s in the appropriate places in our recurrence. However, this formulation gives us the flexibility to do some fancier things like define different costs depending on whether the operation is an insertion, deletion, or substitution, or even define different costs for different pairs of symbols.

This recurrence gives us the following algorithm.

    \begin{algorithmic}
    \PROCEDURE{distance}{$u = a_1 \cdots a_m,v = b_1 \cdots b_n$,d}
        \STATE $C[0,0] \gets 0$
        \FOR{$i$ from $1$ to $m$}
            \STATE $C[i,0] \gets d(a_i,\varepsilon) + C[i-1,0]$
        \ENDFOR
        \FOR{$j$ from $1$ to $n$}
            \STATE $C[0,j] \gets d(\varepsilon,b_j) + C[0,j-1]$
        \ENDFOR
        \FOR{$i$ from $1$ to $m$}
            \FOR{$j$ from $1$ to $n$}
                \STATE $C[i,j] \gets \min\{d(a_i,b_j) + C[i-1,j-1], d(a_i,\varepsilon) + C[i-1,j], d(\varepsilon, b_j) + C[i,j-1]\}$
            \ENDFOR
        \ENDFOR
        \RETURN{$C[m,n]$}
    \ENDPROCEDURE
    \end{algorithmic}

distance computes $OPT(i,j)$.

We will show that $C[i,j] = OPT(i,j)$ for $i \leq m$ and $j \leq n$ by induction on $(i,j)$. For $(i,j) = (0,0)$, we have $C[0,0] = 0 = OPT(0,0)$. Now consider $(i,j)$ with $i \gt 0$ or $j \gt 0$, and assume that $C[i',j'] = OPT(i',j')$ for all $(i',j')$ with $C[i',j']$ computed before $C[i,j]$. We have \[C[i,j] = \min\{d(a_i,b_j) + C[i-1,j-1], d(a_i,\varepsilon) + C[i-1,j], d(\varepsilon, b_j) + C[i,j-1]\}.\] By our inductive hypothesis, we have \[C[i,j] = \min\{d(a_i,b_j) + OPT(i-1,j-1), d(a_i,\varepsilon) + OPT(i-1,j), d(\varepsilon, b_j) + OPT(i,j-1)\}.\] Then by definition, we have $C[i,j] = OPT(i,j)$.

distance has a running time of $O(mn)$.

It is clear that the computation of $C[i,j]$ is done in $O(1)$ time. Then the two loops together give a running time of $O(mn)$.

As with a lot of other fundamental problems in computer science, the edit distance problem is one that people have been working on for a long time. In particular, the question of whether we can do better than $O(n^2)$ time has been open, and there was some recent progress that puts it in the category of probably solved.

In 2015, Backurs and Indyk showed that if we could compute the edit distance in under $O(n^2)$ time (more specifically, in $O(n^{2 - \varepsilon})$ for some constant $\varepsilon \gt 0$), then that would disprove the Strong Exponential Time Hypothesis, which most of us believe is true.

More practically speaking, there is also the question of space usage. One of the things that night bug you about this algorithm is that it takes a lot of space. Imagine trying to compute how similar two words are in English and having to construct a table like we did. Intuitively, that seems like a lot of space to use up, even if it is still polynomial.

This becomes a practical issue as we move from something maybe kind of silly like words to something far more substantial like DNA sequences. In that case, quadratic space goes from being an annoyance to an impracticality. An algorithm due to Hirschberg in 1975 remedies this and is able to compute the edit distance of two strings using $O(n)$ space. It uses a combination of divide and conquer, shortest paths, and dynamic programming to achieve this and is discussed in KT 6.7.