MPCS 55001 — Lecture 10

Knapsack

The knapsack problem was first formulated by Dantzig in 1957 and he gives a basic dynamic programming algorithm for it. The idea is that you have a knapsack that can hold a certain amount of weight. You would like to fill it with items such that the value of the items you choose are maximized. Of course, we don't really care about filling literal knapsacks full of stuff—this just happens to be a useful metaphor.

Dantzig is well known for the Simplex algorithm for solving linear programming problems. He is also the source of the urban legend of the student who was late to class, copied down some equations, did them for homework, and inadvertently solved two unsolved problems: he was the (grad) student.

The 0-1 knapsack problem is given a list of $n$ items, where item $i$ has value $v_i \gt 0$ and weight $w_i \gt 0$, and a knapsack weight limit of $W$, choose the subset of items with the maximum total value.

For this problem, we will assume that values and weights are positive integers.

Consider the following items.

$i$ 1 2 3 4 5
$w_i$ 1 3 4 7 11
$v_i$ 1 4 9 12 19

Suppose we can carry weight $12$. Then $\{1,5\}$ has weight 12 and value 20, while $\{1,3,4\}$ has weight 11 and value 22.

The question is, similar to our greedy algorithm, how should we consider how to choose items? We can look for greedy algorithms for a bit of inspiration: do we choose based on weight? On value? On weight and value?

We can think through similar choices. We want to maximize the value of a solution to the problem, but the question is which choices we should be making. Suppose that we only consider the optimal value of a solution by considering the items $1, \dots, i$. This seems like it should work, but there is a crucial piece missing: how do we ensure that we don't exceed our weight limit? Similarly, we can try to optimize around the value of a solution considering only the possible weight, but how do we formulate how this changes with respect to each item that's chosen?

It turns out the answer is to do both.

Let $OPT(i,w)$ denote the value of an optimal solution $\mathcal O_{i,w}$ for the knapsack problem on items $1, \dots, i$ subject to weight limit $w$.

In essence, what we do is to consider two different subproblems at the same time: the subproblem of which items are chosen out of $1, \dots, i$ and the subproblem of which items to choose with the given weight limit $w$. The subproblem of selecting a subset of items $1, \dots, i$ is clear, since it's similar to the scheduling problem we saw earlier. But how do we consider subproblems on weight? Luckily, we made the helpful assumption that all weights are integers, so we have subproblems ranging on $0 \leq w \leq W$. This gives us the equation \[OPT(i,w) = \max_{S \subseteq \{1,\dots,i\}} \sum_{j \in S} v_j \quad \text{subject to } \sum_{j \in S} w_j \leq w.\]

Under this formulation, we want to compute $OPT(n,W)$. So, now we need to figure out how to express $OPT(i,w)$ in terms of smaller subproblems. Let us consider what our choices are regarding $OPT(i,w)$ when considering whether $i \in \mathcal O_{i,w}$.

  1. Suppose that $w_i \gt w$. Then $i \not \in \mathcal O_{i,w}$, since choosing $i$ would exceed our allowed weight. Then $\mathcal O_{i,w}$ must contain only items chosen from $1, \dots, i-1$. Therefore, $OPT(i,w) \leq OPT(i-1,w)$. On the other hand, we have $OPT(i,w) \geq OPT(i-1,w)$, since we can choose $\mathcal O_{i,w} = \mathcal O_{i-1,w}$. Thus, we have $OPT(i,w) = OPT(i-1,w)$.
  2. Suppose $w_i \leq w$.
    1. If $i \not \in \mathcal O_{i,w}$, then similar to the above, $\mathcal O_{i,w}$ must contain only items chosen from $1, \dots, i-1$, and we have $OPT(i,w) = OPT(i-1,w)$.
    2. If $i \in \mathcal O_{i,w}$, then this means that $v_i$ is added to our solution and we must add the weight $w_i$, which decreases the allowed weight in any subproblem. In other words, we have $OPT(i,w) \leq v_i + OPT(i-1,w-w_i)$. And by taking $\mathcal O_{i,w} = \mathcal O_{i-1,w-w_i} \cup \{i\}$, we have $OPT(i,w) \geq v_i + OPT(i-1,w-w_i)$, so we have $OPT(i,w) = v_i + OPT(i-1,w-w_i)$.
    We then take $OPT(i,w) = \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\}$.

This gives us the following recurrence.

$OPT(i,w)$ satisfies the following recurrence. \[OPT(i,w) = \begin{cases} 0 & \text{if $i = 0$ or $w = 0$}, \\ OPT(i-1,w) & \text{if $w_i \gt w$}, \\ \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\} & \text{otherwise}. \end{cases} \]

This recurrence leads to the following algorithm.

    \begin{algorithmic}
    \PROCEDURE{knapsack}{$W,I = \{(w_i,v_i) \mid 1 \leq i \leq n\}$}
        \FOR{$w$ from $1$ to $W$}
            \STATE $V[0,w] \gets 0$
        \ENDFOR
        \FOR{$i$ from $1$ to $n$}
            \FOR{$w$ from $0$ to $W$}
                \IF{$w_i \gt w$}
                    \STATE $V[i,w] \gets V[i-1,w]$
                \ELSE
                    \STATE $V[i,w] \gets \max\{V[i-1,w],v_i + V[i-1,w-w_i]\}$
                \ENDIF
            \ENDFOR
        \ENDFOR
        \RETURN{$V[n,W]$}
    \ENDPROCEDURE
    \end{algorithmic}

Recall Example 10.2 from above. Running this algorithm constructs the following table $V$.

$i$/$w$ 0 1 2 3 4 5 6 7 8 9 10 11 12
0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 1 1 1 1 1 1 1
2 0 1 1 4 5 5 5 5 5 5 5 5 5
3 0 1 1 4 9 10 10 13 14 14 14 14 14
4 0 1 1 4 9 10 10 13 14 14 16 21 22
5 0 1 1 4 9 10 10 13 14 14 16 21 22

Our solution, $OPT(n,W)$, is highlighted in red. This table also gives us a way to figure out which items were chosen: if $V[i,w] \gt V[i-1,w]$, then item $i$ was chosen.

knapsack computes $OPT(n,W)$.

We will show that $V[i,w] = OPT(i,w)$ for $i \leq n$ and $w \leq W$ by induction on $(i,w)$. First, we consider as our base cases $i = 0$ and $w = 0$. We have that $V[0,w] = 0 = OPT(0,w)$ for all $w \leq W$ and $V[i,0] = 0 = OPT(i,0)$ for all $i \leq n$.

Now consider $i,w \gt 0$ and assume that for all $(i',w')$ that are computed before $(i,w)$ that $V[i',w'] = OPT(i',w')$. First, suppose $w_i \gt w$. Then by line 7, we have $V[i,w] = V[i-1,w]$. By our inductive hypothesis, $V[i-1,w] = OPT(i-1,w)$, so $V[i,w] = OPT(i-1,w)$ and therefore $V[i,w] = OPT(i,w)$ as desired.

Then if $w_i \leq w$, by line 9, we have \[V[i,w] = \max(V[i-1,w], v_i + V[i-1,w-w_i])\] and by our inductive hypothesis, we have \[V[i,w] = \max\{OPT(i-1,w), v_i + OPT(i-1,w-w_i)\},\] so $V[i,w] = OPT(i,w)$.

knapsack has running time $O(nW)$.

The work is done in the nested loops. Each of the computations within the loops can be done in $O(1)$ time. The inner loop is run $W$ times, while the outer loop is run $n$ times. This gives $O(nW)$ time in total.

Now, at this point, some of you may remember my note on the solution to this problem that there is no known polynomial time algorithm for knapsack, but here, we have an algorithm that runs in $O(nW)$ time! What's up with that?

The key here is to remember that the complexity of algorithms are functions of the size of the input, so it's worth stepping back to ask what the size of this problem instance is. One part is obvious: the number of items that we have. The other part is not so obvious: what is the size of $W$?

A common mistake, as in this case, is to take $W$ as the size of $W$. But $W$ is the input, a number. The size of $W$ is the size of its representation, which is the number of bits of $W$ (or digits if you prefer to use some other number system). Either way, for any representation that isn't unary, the size of $W$ is $O(\log W)$.

The issue here then becomes clear: $W$ is not polynomial in $\log W$, so our $O(nW)$ bound is not polynomial in the size of the input. Algorithms that are polynomial in the numeric value of the input but not the size of the input are said to be pseudopolynomial.

Knapsack problems belong to a family of packing problems, where we are given a set of items with various properties and constraints and we are asked to choose a selection of items subject to constraints while tying to maximize or minimize some property:

It becomes pretty clear that many of these problems are quite important, in that they can be repurposed very easily as various computational and real life applications. However, they also happen to share the same unfortunate downside: we don't have efficient algorithms for many of these problems. This is where we begin to see glimpses of intractability rear its ugly head.

Edit distance

A common problem in processing any data in the form of sequences (string data, streams, etc.) is determining the similarity of two sequences. A very common example of this nowadays is the AutoCorrect feature: on some level, your phone helpfully fixes your words based on some notion of similarity between two words. Intuitively, the words potato and tomato seem more similar to each other than the words eggplant and aubergine. But is there a way to quantify this?

An alphabet $\Sigma$ is a finite set of symbols and we define the set of words or strings over $\Sigma$ to be $\Sigma^*$. We denote by $\varepsilon$ the empty string, or the string of length 0. A distance on strings is a function $d: \Sigma^* \times \Sigma^* \to \mathbb R^+$ between strings to $\mathbb R^+$ that satisfies the following:

  1. $d(x,y) = 0$ if and only if $x = y$,
  2. $d$ is symmetric; i.e. for all $x,y \in \Sigma^*, d(x,y) = d(y,x)$,
  3. $d$ satisfies the triangle inequality; i.e. for all $x,y,z \in \Sigma^*, d(x,z) \leq d(x,y) + d(y,z)$.

Such a function in other contexts may also be called a metric, and together with the set they operate on define a metric space.

The most common string distance is the edit distance or Levenshtein distance, due to Levenshtein in 1966 and is based on the idea of operations which transform one word into another.

An edit operation is a pair $(a,b)$ where $a, b \in \Sigma \cup \{\varepsilon\}$, where $a,b$ are not both $\varepsilon$. An edit operation $(a,b)$ with $a \neq b$ is called an error. There are three basic types of errors:

  1. insertions $(\varepsilon,b)$,
  2. deletions $(a,\varepsilon)$, and
  3. substitutions $(a,b)$.

An edit string is a string of edit operations. The cost of an edit string is the number of errors in the edit string. The edit distance $d(u,v)$ of two words $u$ and $v$ is the cost of the minimum cost edit string for $u$ and $v$.

From this definition, the idea of an edit string is that it is a string that describes a series of operations that transforms a string into some other string. Of course, there could be multiple ways to do this. The edit distance is the cost of the minimal such edit string.

Consider the word neighbourhood and the misspelling neighborhood. One possible edit string is \[(n,n) (e,e) (i,i) (g,g) (h,h) (b,b) (o,o) (u,r) (r,h) (h,o) (o,o) (o,d) (d,\varepsilon).\] This is hard to read, so we can write this vertically instead of horizontally and we get \[\frac nn \frac ee \frac ii \frac gg \frac hh \frac bb \frac oo \frac ur \frac rh \frac ho \frac oo \frac od \frac{d}{\varepsilon}.\] This edit string has 5 errors in it, but we can definitely find an edit string that gets us the same words with fewer errors, \[\frac nn \frac ee \frac ii \frac gg \frac hh \frac bb \frac oo \frac u{\varepsilon} \frac rr \frac hh \frac oo \frac oo \frac dd.\] So the edit distance of these words is 1, which matches our intution that one of the words [is missing a/has an extra] u in it.

The notion of an edit string makes for a very flexible way to define different kinds of distances on strings. For example, gap penalty, transpositions, etc.

The edit distance problem is: Given two strings $u = a_1 \cdots a_m$ and $v = b_1 \cdots b_n$, find the edit distance $d(u,v)$.

Edit strings are more commonly called alignments in computational biology, which is why this problem is often also called the sequence alignment problem. The dynamic programming algorithm that we're going to see was rediscovered a few times within the span of a few years and who you attributed it to depends on your field.

Since I'm a formal language theorist, I knew this from Wagner and Fisher in 1974. If you're a computational biologist, you would know this as Needleman-Wunsch, from 1970 (this is the citation that Kleinberg and Tardos use). Even earlier was Vintsyuk's algorithm in 1968, which was introduced in the context of speech processing.

It is clear that we want to define our subproblems around the cost of the alignment. But how do we break down our subproblems? Luckily, we've set up our problem so that it is clear that all we need to consider is the final edit operation in our edit string. This maps onto one of four possibilities: it can be a non-error/identity operation (i.e. $(a,a)$) or one of the three errors (an insertion, deletion, or substitution).

Let $OPT(i,j)$ be the minimum cost of an alignment $E_{i,j}$ for strings $a_1 \cdots a_i$ and $b_1 \cdots b_j$. Ultimately, we want to compute $OPT(m,n)$. So consider $OPT(i,j)$ and consider the following possibilities for the final edit operation of $E_{i,j}$:

  1. $(\varepsilon,b_j)$ is an insertion. In this case, we incur the cost $d(\varepsilon,b_j)$ and the cost of an alignment on $a_1 \cdots a_i$ and $b_1 \cdots b_{j-1}$, since no symbols of $a_1 \cdots a_i$ were involves. This gives us $OPT(i,j) \geq d(\varepsilon,b_j) + OPT(i,j-1)$ and we have $OPT(i,j) \leq d(\varepsilon,b_j) + OPT(i,j-1)$ by choosing an optimal alignment for $a_1 \cdots a_i$ and $b_1 \cdots b_{j-1}$. Therefore, $OPT(i,j) = d(\varepsilon,b_j) + OPT(i,j-1)$.
  2. $(a_i,\varepsilon)$ is a deletion. In this case, we incur the cost $d(a_i,\varepsilon)$ and the cost of an alignment on $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_j$, since no symbols of $b_1 \cdots b_j$ were involved. This gives us $OPT(i,j) \geq d(a_i,\varepsilon) + OPT(i-1,j)$ and we have $OPT(i,j) \leq d(a_i,\varepsilon) + OPT(i-1,j)$ by choosing an optimal alignment for $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_j$. Therefore, $OPT(i,j) = d(a_i,\varepsilon) + OPT(i-1,j)$.
  3. $(a_i,b_j)$ is a substitution or an identity, depending on whether or not $a_i = b_j$. In this case, we incur the cost $d(a_i,b_j)$ together with the cost of the alignment on $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_{j-1}$. This gives us $OPT(i,j) \geq d(a_i,b_j) + OPT(i-1,j-1)$ and we have $OPT(i,j) \leq d(a_i,b_j) + OPT(i-1,j-1)$ by choosing an optimal alignment for $a_1 \cdots a_{i-1}$ and $b_1 \cdots b_{j-1}$. Therefore, $OPT(i,j) = d(a_i,b_j) + OPT(i-1,j-1)$.

This gives us the following recurrence.

$OPT(i,j)$ satisfies the following recurrence. \[OPT(i,j) = \begin{cases} 0 & \text{if $i = j = 0$,} \\ \min \begin{cases} d(a_i,\varepsilon) + OPT(i-1,j) \\ d(\varepsilon,b_j) + OPT(i,j-1) \\ d(a_i,b_j) + OPT(i-1,j-1) \end{cases} & \text{otherwise.} \end{cases}\]

As mentioned earlier, if we wanted to compute the Levenshtein distance, we set $d(s,t) = 1$ for all $s \neq t$, so we could have just put 1s in the appropriate places in our recurrence. However, this formulation gives us the flexibility to do some fancier things like define different costs depending on whether the operation is an insertion, deletion, or substitution, or even define different costs for different pairs of symbols.

This recurrence gives us the following algorithm.

    \begin{algorithmic}
    \PROCEDURE{distance}{$u = a_1 \cdots a_m,v = b_1 \cdots b_n$,d}
        \STATE $C[0,0] \gets 0$
        \FOR{$i$ from $1$ to $m$}
            \STATE $C[i,0] \gets d(a_i,\varepsilon) + C[i-1,0]$
        \ENDFOR
        \FOR{$j$ from $1$ to $n$}
            \STATE $C[0,j] \gets d(\varepsilon,b_j) + C[0,j-1]$
        \ENDFOR
        \FOR{$i$ from $1$ to $m$}
            \FOR{$j$ from $1$ to $n$}
                \STATE $C[i,j] \gets \min\{d(a_i,b_j) + C[i-1,j-1], d(a_i,\varepsilon) + C[i-1,j], d(\varepsilon, b_j) + C[i,j-1]\}$
            \ENDFOR
        \ENDFOR
        \RETURN{$C[m,n]$}
    \ENDPROCEDURE
    \end{algorithmic}

distance computes $OPT(i,j)$.

We will show that $C[i,j] = OPT(i,j)$ for $i \leq m$ and $j \leq n$ by induction on $(i,j)$. For $(i,j) = (0,0)$, we have $C[0,0] = 0 = OPT(0,0)$. Now consider $(i,j)$ with $i \gt 0$ or $j \gt 0$, and assume that $C[i',j'] = OPT(i',j')$ for all $(i',j')$ with $C[i',j']$ computed before $C[i,j]$. We have \[C[i,j] = \min\{d(a_i,b_j) + C[i-1,j-1], d(a_i,\varepsilon) + C[i-1,j], d(\varepsilon, b_j) + C[i,j-1]\}.\] By our inductive hypothesis, we have \[C[i,j] = \min\{d(a_i,b_j) + OPT(i-1,j-1), d(a_i,\varepsilon) + OPT(i-1,j), d(\varepsilon, b_j) + OPT(i,j-1)\}.\] Then by definition, we have $C[i,j] = OPT(i,j)$.

distance has a running time of $O(mn)$.

It is clear that the computation of $C[i,j]$ is done in $O(1)$ time. Then the two loops together give a running time of $O(mn)$.

Finally, we are faced with the question of how to recover our alignment from this computation. Let's consider an example.

Consider a run of the algorithm on the words color and colours.

c o l o u r s
0 1 2 3 4 5 6 7
c 1 0 1 2 3 4 5 6
o 2 1 0 1 2 3 4 5
l 3 2 1 0 1 2 3 4
o 4 3 2 1 0 1 2 3
r 5 4 3 2 1 1 1 2

Observe that just like in our previous problems, we can recover the actual alignment that gives us the optimal cost based on our decisions from the recurrence. Observe that each entry in our table is determined by the three entries immediately above, to the left, and to the upper left of it.

As with a lot of other fundamental problems in computer science, the edit distance problem is one that people have been working on for a long time. In particular, the question of whether we can do better than $O(n^2)$ time has been open, and there was some recent progress that puts it in the category of probably solved.

In 2015, Backurs and Indyk showed that if we could compute the edit distance in under $O(n^2)$ time (more specifically, in $O(n^{2 - \varepsilon})$ for some constant $\varepsilon \gt 0$), then that would disprove the Strong Exponential Time Hypothesis, which most of us believe is true.

More practically speaking, there is also the question of space usage. One of the things that night bug you about this algorithm is that it takes a lot of space. Imagine trying to compute how similar two words are in English and having to construct a table like we did. Intuitively, that seems like a lot of space to use up, even if it is still polynomial.

This becomes a practical issue as we move from something maybe kind of silly like words to something far more substantial like DNA sequences. In that case, quadratic space goes from being an annoyance to an impracticality. An algorithm due to Hirschberg in 1975 remedies this and is able to compute the edit distance of two strings using $O(n)$ space. It uses a combination of divide and conquer, shortest paths, and dynamic programming to achieve this and is discussed in KT 6.7.