MPCS 55001 — Lecture 9

Now, we'll prove something about a simple, but important graph problem. The following notion was due to the Irish mathematician William Rowan Hamilton in the early 1800s.

Let $G$ be a graph. A cycle is a Hamiltonian cycle if every vertex of $G$ is in the cycle. A path is a Hamiltonian path if every vertex of $G$ is in the path.

Hamiltonian cycles and paths have some very interesting applications. One is a connection to Gray codes, which are a binary code on $n$ digits that are defined such that the successor of a number only has a one digit difference from the current number.

The obvious question that we can ask about Hamiltonian cycles is whether our graph contains one.

The Hamiltonian Cycle problem is: Given a graph $G$, does $G$ contain a Hamiltonian cycle?

There is are directed and undirected flavours of this problem. Let's start with a simple reduction. We'll consider the undirected version, though it's not hard to show that the directed version reduces to the undirected one. Instead, we'll do something a bit more surprising: we'll show that 3-SAT reduces to Hamiltonian Cycle.

Let $\varphi$ be an instance of 3-SAT with $n$ variables and $k$ clauses. We will construct an instance of Hamiltonian Cycle. What we will do is try to construct a graph so that each possible assignment of variables corresponds to a Hamiltonian cycle. We then modify the graph by adding paths through clauses in such a way that we'll have a Hamiltonian cycle for those assignments that are satisfying.

The idea is that we need a way to tell whether we're setting a variable $x_i$ to True or False. So we create a series of vertices so that a path goes through the vertices from left to right if we set $x_i$ to True and we set it to False if the path goes the other way.

This gives us a way to define Hamiltonian paths through our graph that correspond to a particular assignment of variables. We then need to use this to indicate whether a clause has been satisfied by a particular assignment of variables it contains. We do this by hooking up vertices representing clauses to each of the paths for the corresponding variables.

If a literal $x_i$ is in a clause $C_j$, then the edges are such that $C_j$ is entered via the left from row $i$. Similarly, if the literal $\neg x_i$ is in a clause $C_k$, the edges are such that $C_k$ is entered via the right on row $i$.

These clause vertices are connected to the variable paths according to the literal. In order to space out these paths, we need $3k+3$ vertices in each variable path.

We need to argue that there is a satisfying assignment for $\varphi$ if and only if this graph $G$ we constructed has a Hamiltonian cycle.

Suppose $\varphi$ has a satisfying assignment. We can define our Hamiltonian cycle $\mathcal C$ by the following. If variable $x_i$ has been set to True, then the path moves through row $i$ from left to right; if $x_i$ is set to False, the path moves through row $i$ from right to left. Then for each clause $C_j$, there is at least one row for which the cycle can visit $C_j$. This only needs to happen once per clause, so the path only goes through $C_j$ once.

Now, suppose that $G$ has a Hamiltonian cycle. We need to recover an assignment for $\varphi$. For each row $i$, if the cycle goes from left to right, then $x_i$ is set to True; if it goes from right to left, we set $x_i$ to False. This assignment is satisfying because each clause vertex is visited by the Hamiltonian cycle. This vertex can only be visited by entering and exiting via vertices on the same row. This row corresponds to the literal that satisfies the clause. Since every clause is satisfied, $\varphi$ is satisfied.

It remains to show that this construction can be performed in polynomial time. We observe that the graph $G$ has $n(3k+3) + 2$ vertices, so the size of the graph is polynomial in the length of $\varphi$.

Classification of problems

So at this point, we have the tools to show that there are problems that are about as hard as each other, but we still don't know whether they have efficient algorithms or whether we should expect them to.

Ultimately, what we want to do is classify those problems that we know have efficient algorithms. But first, what is a problem? We have a vague notion of what this is, but we would like to formalize this notion. It might be surprising that we would want to formalize what a problem is mathematically or that we can even do this, but this is really what computer science is about: the mathematics of how to solve problems.

Recall that we mentioned that a decision problem is one that, when given an input, answers a Yes or no. This sounds suspiciously like a function, so we might say that a problem is some function that maps inputs from $\mathbb N$ to $\{Y,N\}$. But as we've seen above, our domains can get much more complicated than $\mathbb N$.

Viewing such problems as functions is a good start, but we want something more general to work with. In the above examples, there's no unification in terms of the number of inputs or the domains of the functions, which makes it a bit difficult to talk about the two questions in the same way.

One of the first insights you will have internalized already is that we can represent pretty much anything we would like with the proper encoding. Generally speaking, this means that we can treat any input as a string. We can even treat "multiple" inputs, as in the case of our graph problem, as a single string, by encoding the delimiters appropriately as well. In this way, we can now define decision problems a functions from strings to Yes or No.

Let $\Sigma$ be an alphabet. A decision problem is a Boolean function $P:\Sigma^* \to \{0,1\}$ that maps input strings over $\Sigma$ to a single bit.

Another way to look at a Boolean function is as a set $S$, where a string $x$ is in $S$ if $P(x) = 1$ and $x \not \in S$ if $P(x) = 0$.

Let $\Sigma$ be an alphabet. A decision problem is a set $P \subseteq \Sigma^*$. A string $x$ is a Yes instance of $P$ if and only if $x \in P$.

Then we can treat an algorithm as a function that tells us whether an input is a Yes or No instance and we can associate it with a running time.

An algorithm $A$ solves a problem $P$ if \[A(x) = \begin{cases} \mathrm{Yes} & \text{if $x \in P$,} \\ \mathrm{No} & \text{if $x \not \in P$.} \end{cases}\] We say $A$ runs in polynomial time if for every input string $x$, $A(x)$ terminates in $\leq p(|x|)$ steps, where $p$ is a polynomial.

This allows us to define a class of problems based on the time complexity of the algorithms that solve them.

The class $\mathbf P$ is the class of decision problems for which there exists a polynomial time algorithm.

What problems are in $\mathbf P$? Almost every problem we've seen so far is a problem in $\mathbf P$, because we have polynomial time algorithms that solve them (with a few exceptions). As a reminder, the following problems that we've seen that are in $\mathbf P$ are

However, we've also seen a few problems where we don't know whether there are polynomial time algorithms.

A big problem with this is that we don't know whether there is or isn't an efficient algorithm for these problems. As mentioned before, a definitive answer one way or the other would be helpful, but we don't have one. But maybe we can try something less ambitious first. What if, instead of trying to solve the problem, we could check whether a proposed solution is correct?

Imagine again that we have a mysterious cube that can solve a particularly difficult problem for us. A good question to ask about our mysterious cube is how we can tell if it's correct. If it can solve a hard problem, how can we possibly check it?

So the mysterious cube goes okay, I can give you proof and along with a Yes answer to the problem it prints out a receipt that we can use to check whether it's correct or not. And after being spooked by the cube that was apparently listening in on us, we have to figure out what to do with this new information.

An algorithm $V(x,c)$ is a verifier for a problem $P$ if for every input string $x$, $x \in P$ if and only if there exists a string $c$ such that $V(x,c) = \mathrm{Yes}$. We say that $c$ is a certificate.

Consider the Independent Set problem. An instance of Independent Set is $\langle G,k \rangle$, where $G$ is a graph and $k$ is an integer. If a graph $G$ contains an independent set of size $k$, a certificate for this is the independent set $S$. Given $S$, we can check if $S$ is an independent set in polynomial time: for each vertex in $S$, check its neighbours and verify that none of them are in $S$. Then $V(\langle G,k \rangle, S)$ reports Yes if $G$ contains an independent set of size $k$.

Of course, we can ask the same sorts of questions about verification algorithms as we do of any algorithm, particularly what its time complexity is. This leads us to the following idea.

The class $\mathbf{NP}$ is the class of decision problems for which there exists a polynomial time verifier $V$. That is, $V(x,c)$ is a polynomial-time algorithm and the certificate $c$ has size polynomial in the size of $x$; $|c| \leq p(|x|)$ for some polynomial $p$.

So which problems are in $\mathbf{NP}$? One easy thing to see is have that any problem that was efficiently solvable to begin with is also efficiently verifiable.

Suppose we have a problem $P \in \mathbf P$. Then there is a polynomial time algorithm $A$ that solves it. We will define the following verifier $V$ for $P$:

On input $x$ and certificate $c$, Return $A(x)$

Essentially, this verifier ignores whatever certificate is provided and just calls $A$ to compute the answer itself. This is fine—we can use the certificate however we wish. The only requirement is that our answer is correct. Since we simply call $A$, this will always give us the correct answer for any certificate $c$. This verifier runs in polynomial time, since $A$ runs in polynomial time. Therefore, $P \in \mathbf{NP}$.

What else is in $\mathbf{NP}$? A lot of the problems we've seen in this section of the course are also in $\mathbf{NP}$.

For each of the problems above, we will describe an appropriate certificate and how to use it to verify a solution in polynomial time.

A certificate for Independent Set is a list of the vertices in the independent set. One can verify that the list of vertices is of size at least $k$ and that no two vertices share an edge in $O(m)$ time.
A certificate for Vertex Cover is a list of the vertices in the vertex cover. One can verify that the list of vertices is of size at most $k$ and that no edge is uncovered in $O(m)$ time.
A certificate for Satisfiability is an assignment of each variable. One can verify that the assignment is satisfying for a formula by evaluating the formula in $O(n)$ time, where $n$ is the number of variables. To see this, we can form a parse tree for a well-formed formula and evaluate the truth value at each internal node—there are $O(n)$ of these.
A certificate for 3-SAT is the same as for Satisfiability.
A certificate for Hamiltonian Cycle is a list of the vertices visited in the order of the cycle. One can verify that a cycle is formed by checking for the existence of an edge between each successive pair of vertices in $O(n)$ time.
A certificate for Subset Sum is a list of the numbers in the qualifying subset. One can check that the numbers in the list are members of the input set and add up to the target in time linear in the digits of the numbers in the list.

Okay, cool. Is there anything that's not known to be verifiable in polynomial time? Yes, here's one: Suppose you have a boolean formula in 3-CNF; is it unsatisfiable? Here's another one: given a boolean formula in 3-CNF, is there a unique satisfying assignment for it? What do these problems have in common? Just having one solution isn't enough: we need to rule out other possible solutions. So there's a lot more out there than just $\mathbf P$ and $\mathbf{NP}$!

Now, the next natural question to ask is: what are the problems that can be efficiently verified, but not efficiently solvable? The answer to that is, we don't know, because we don't actually know if any exist.

Does there exist a problem $P \in \mathbf{NP}$ such that $P \not \in \mathbf P$?

This is the famous $\mathbf P$ vs. $\mathbf{NP}$ problem—the biggest open problem in theoretical computer science for almost 50 years. Practically speaking, it's the roadblock that keeps us from definitively answering whether large numbers of problems are efficiently solvable.

Philosophically, this question raises questions about how we think about computation and efficiency, and one of the Millennium Prize Problems. Intuitively, it should be much easier to verify a solution for a problem than it is to find it. And yet, we've been unable to prove it.

A lot of the computational complexity theory that has been developed since the early 70s was in the pursuit of this question. And if you ever go on to study more complexity theory, what you'll find are some answers and way more questions.

$\mathbf{NP}$-completeness

Recall that reduction is not symmetric. This suggests that one thing we could do is try to find the "hardest" problems in NP and try to separate those from the "easier" ones. But what does "hardest" mean in this context? Naturally, if you can solve a hard problem, you can use it to solve an easier problem. In other words, that would be the existence of some problem $B$ for which $A \leq_P B$ for all $A \in NP$. Such problems are called NP-complete.

It's important to note that a problem can be $\mathbf{NP}$-hard but not in $\mathbf{NP}$, which is why that notion is separate from the notion of completeness. Problems that are $\mathbf{NP}$-complete represent the "most difficult" problems in $\mathbf{NP}$.

One consequence of the existence of such a problem is it would give us a very easy way to show that $\mathbf{NP} = \mathbf P$. Suppose that $P$ is $\mathbf{NP}$-complete. If we find a polynomial time algorithm for $P$, then this means we have a polynomial time algorithm for every problem in $\mathbf{NP}$, by transitivity of $\leq_P$, and therefore $\mathbf P = \mathbf{NP}$.

Of course, an obvious question is whether such a problem exists. It sounds like a lot of work to show that a problem $P$ is $\mathbf{NP}$-complete, since one would have to show how to encode an arbitrary problem in $\mathbf{NP}$ into an instance of $P$ (how would this work for something like Vertex Cover?). But for now, let's suppose that there is an $\mathbf{NP}$-complete problem out there. By using transitivity of $\leq_P$, we get the following.

Let $P$ be $\mathbf{NP}$-complete and suppose $P' \in \mathbf{NP}$. If $P \leq_P P'$ then $P'$ is $\mathbf{NP}$-complete.

Let $Q \in \mathbf{NP}$. Since $P$ is $\mathbf{NP}$-complete, we have $Q \leq_P P$. But since $P \leq_P P'$, by transitivity, we have $Q \leq_P P'$. Since $Q$ was arbitrary, we have that $P'$ is $\mathbf{NP}$-complete.

So if we want to show that $P$ is $\mathbf{NP}$-complete, we can avoid having to show how to transform an instance of some arbitrary problem in $\mathbf{NP}$ into an instance of $P$. All we have to do is show that we can transform a problem we've already shown is $\mathbf{NP}$-complete into $P$, which is much easier. The other observation is that this means that all $\mathbf{NP}$-complete problems are polynomial time reducible to each other.

Of course, all of this only works if such a problem actually exists. That such a problem does exist was shown in 1971 by Stephen Cook, who we've seen before briefly when we were discussing fast integer multiplication. This is the result for which Cook is most famous. The theorem was independently proved by Leonid Levin, who at the time was in the Soviet Union. Levin is currently a faculty member at Boston University and Cook is a faculty member at the University of Toronto, which he joined in 1970 due to being denied tenure at Berkeley.

One thing you might notice is that $\mathbf{NP}$ is not exactly an acronym for polynomially verifiable. That is because efficient verification is an alternate, but equivalent, definition for $\mathbf{NP}$. $\mathbf{NP}$ actually stands for nondeterministic polynomial time. The original definition has to do with distinguishing between problems that could be solved on deterministic vs. nondeterministic Turing machines.

For our purposes, we can think of it this way: we can solve a problem in $\mathbf{NP}$ in polynomial time using nondeterminism. How do we do this? First, we nondeterministically guess/choose a certificate $c$. Since $c$ is polynomial in size, we can do this in polynomial time. Then once we have our string, we can run the verifier in polynomial time to get our answer.

We won't go through the proof in detail here, because it involves Turing machines, but the proof is really amazing. So recall that a Turing machine is a model of computation and for all intents and purposes, you can consider a specific Turing machine as a computational device that computes the solution to a particular problem, much like an algorithm. Here, we're particularly interested in nondeterministic Turing machines.

Since $\mathbf{NP}$ is the class of problems that can be solved by a nondeterministic Turing machine in polynomial time, an easy way to show that every problem reduces to a particular problem is just to show how to simulate a nondeterministic Turing machine. This is basically what the proof of Cook-Levin does.

The idea is to take a nondeterministic Turing machine and an input string and to show how to construct a propositional formula that simulates the computation of the Turing machine on the input string. The proof involves showing the following:

The consequence of this is that if we can solve Satisfiability, we can solve any problem in $\mathbf{NP}$. How? Here is an algorithm:

So if we can solve Satisfiability in polynomial time, we can solve any problem in $\mathbf{NP}$ in polynomial time.

The existence of an NP-complete problem was obviously a huge breakthrough and the natural follow-up question was what other NP-complete problems might be out there. Since these are the hardest problems in NP, the thinking was that we'd find a few more and it wouldn't be very long before we're able to show a clear separation from P. Not long after, Richard Karp (from Edmonds-Karp) published a paper showing 21 problems were NP-complete in 1972, some of which we've seen already. And since Cook-Levin and Karp, there have been thousands of problems shown to be $\mathbf{NP}$-complete.

Practically speaking, this gives us a much easier way to prove that certain problems are $\mathbf{NP}$-complete (which, for now, means that we can give up on finding an efficient algorithm for them). To do this we need to show that our problem is in $\mathbf{NP}$ and that there is an $\mathbf{NP}$-complete problem that reduces to it. This can be summarized in four steps.

By Proposition 9.12, if $A$ is $\mathbf{NP}$-complete, and $A \leq_P B$, and $B \in \mathbf{NP}$, then $B$ is $\mathbf{NP}$-complete. Therefore, it suffices to show that each of the problems is in $\mathbf{NP}$ and there exists a reduction from an $\mathbf{NP}$-complete problem. By Theorem 9.13, Satisfiability is $\mathbf{NP}$-complete.

3-SAT is in $\mathbf{NP}$ by Proposition 9.9. We have Satisfiability $\leq_P$ 3-SAT by Theorem 8.16. Therefore, 3-SAT is $\mathbf{NP}$-complete.
Independent Set is in $\mathbf{NP}$ by Proposition 9.9. We have 3-SAT $\leq_P$ Independent Set by Theorem 8.14 and 3-SAT is $\mathbf{NP}$-complete by Theorem 9.14(a). Therefore, Independent Set is $\mathbf{NP}$-complete.
Vertex Cover is in $\mathbf{NP}$ by Proposition 9.9. We have Independent Set $\leq_P$ Vertex Cover by Proposition 8.10 and Independent Set is $\mathbf{NP}$-complete by Theorem 9.14(b). Therefore, Vertex Cover is $\mathbf{NP}$-complete.
Hamiltonian Cycle is in $\mathbf{NP}$ by Proposition 9.9. We have 3-SAT $\leq_P$ Hamiltonian Cycle by Theorem 8.20 and 3-SAT is $\mathbf{NP}$-complete by Theorem 9.14(a). Therefore, Hamiltonian Cycle is $\mathbf{NP}$-complete.
Subset Sum is in $\mathbf{NP}$ by Proposition 9.9. We have 3-SAT $\leq_P$ Subset Sum by Theorem 8.24 and 3-SAT is $\mathbf{NP}$-complete by Theorem 9.14(a). Therefore, Subset Sum is $\mathbf{NP}$-complete.
0-1 Knapsack is in $\mathbf{NP}$: A certificate is the list of items. We can check the values and weights in time $O(n\ell)$. We have Subset Sum $\leq_P$ 0-1 Knapsack by Theorem 8.23 and Subset Sum is $\mathbf{NP}$-complete by Theorem 9.14(h). Therefore, 0-1 Knapsack is $\mathbf{NP}$-complete.

This leads to another interesting question. We obviously have problems that are efficiently solvable. And we have a surprisingly large number of problems that are difficult, which are $\mathbf{NP}$-complete. In fact, most problems that we can show are in $\mathbf{NP}$ turn out to be either $\mathbf{NP}$-complete or actually in $\mathbf P$. But is there anything in between? More specifically, are there problems that are in $\mathbf{NP}$ but not $\mathbf{NP}$-complete? The answer, surprisingly, is we don't know! Part of the reason for this mystery is the following theorem of Ladner from 1975.

Suppose that $\mathbf P \neq \mathbf{NP}$. Then there exists a problem $P \in \mathbf{NP} \setminus \mathbf P$ that is not $\mathbf{NP}$-complete.

So if we can show that there aren't any such problems that are "in between" $\mathbf P$ and $\mathbf{NP}$-complete, then we can show that $\mathbf P = \mathbf{NP}$. Of course, the issue is that we don't know for sure either way. We have some problems that we think might qualify, but we haven't been able to prove it definitively.

There are some particularly important problems that could exist in this space. The first is one Integer Factorization. The standard version of this problem is to find the prime factors of an integer $n$. The decision problem version is: Given an integer $n$ and an integer $m$ with $1 \lt m \lt n$, is $m$ a factor of $n$? Despite there being no efficient algorithms for solving Integer Factorization, no one has been able to show that it is $\mathbf{NP}$-complete. In fact, there's enough evidence that it's not completely crazy to think that Integer Factorization might turn out to be in $\mathbf P$. However, most people think this is a pretty good candidate for an intermediate problem.

The other big one is Graph Isomorphism: Given two graphs $G$ and $H$, are $G$ and $H$ isomorphic? We say two graphs are isomorphic if there's a way to map vertices of one to the other in such a way that the edge relationships are all preserved. In effect, an isomorphism is really a renaming of variables. Again, no one has been able to show that this problem has an efficient algorithm or that it's $\mathbf{NP}$-complete. The last big breakthrough on the problem was about five years ago, when Professor Babai, in our department, gave a quasipolynomial algorithm; i.e. it runs in $2^{O((\log n)^c)}$ time. Graph Isomorphism is a special case of Subgraph Isomorphism, which is: Given two graphs $G$ and $H$, is there a subgraph of $G$ that's isomorphic to $H$? Subgraph Isomorphism is $\mathbf{NP}$-complete.

Intuitively, it seems like these problems are "obviously" hard, but intuition has been proven wrong before. One example of this is Primality Testing: given an integer $n$, is it prime? This was proven to be in $\mathbf{NP}$ in 1975 by Pratt. But over the decades, the exact class to which Primality Testing belonged shifted, with the development of some randomized algorithms, until it was finally shown in 2004 by Agrawal, Kayal, and Saxena that Primality Testing was actually in $\mathbf P$!

All of this is to say that there is still a lot that we don't know in complexity theory and in algorithms.

Getting over it

In the context of a course like algorithms, we are interested in solving problems efficiently. And in less academic settings, part of what you'll have to do is somehow deal with these difficult problems. While you may not be cooking up reductions, you'll need to recognize when you might have a hard problem on your hands and what to do when you do. Unfortunately, we don't have time to get into a lot of this, but it is useful to know that we haven't (mostly) just given up and curled into the fetal position.

So suppose you have an $\mathbf{NP}$-hard problem that you have to solve. How do you deal with it? You have a few options.

Look for special cases

The first step is to figure out if you can solve your problem for a specific case. Often times, a problem might be difficult in general, but exhibit efficient algorithms for special cases of the problem. This can involve modifying the problem space (instead of general graphs, consider only bipartite graphs or trees) or modifying the problem itself (instead of arbitrary cost functions, impose some restriction on the cost function). Examples of this include:

Consider approximation

If that doesn't work, the next thing you can do is figure out if you must solve your problem optimally, or whether there's room for some slack. Many times, an optimization problem may be difficult to solve for the exact best case, but if you're willing to put up with some degree of suboptimality and accept a "good enough" solution, then many problems can be solved efficiently. This approach leads to the study of approximation algorithms, which are algorithms that can guarantee a solution to within a certain factor (i.e. you can get an independent set that's at most half as big as the largest set).

An easy example of this is with vertex cover. Here's an easy way to approximate a vertex cover on any graph.

This can be expressed more directly as a greedy algorithm (greedy and randomized algorithms get quite a lot of use in approximation algorithms). But here's the idea: for each edge in the matching, you cover all edges that are incident to its endpoints, so you know that you've got a cover. However, you also know that you need at least one vertex per edge in your matching in the cover because the edges in the matchings don't share any endpoints. This means that at worst, you've got a vertex that's twice as large as the actual optimal solution. Maybe that's good enough.

Finding such approximations is often but not always possible, and there are a wide range of schemes that can arise, like whether you want a certain degree of approximation or whether you're willing to sacrifice a bit of performance for more accuracy. For instance, while Vertex cover is approximiable by a factor of 2, Set Cover is not approximable to a factor of $O(\log n)$. In other words, you can't get a constant-factor approximation for it.

Try exponential algorithms anyway

But there are also many problems that are not approximable efficiently unless $\mathbf P = \mathbf{NP}$. So what can you do then? At this point you cry and get ready to design a non-polynomial time algorithm. But even here, making use of the correct algorithm design strategies can make a big difference. We've already seen an example of this: the 0-1 knapsack problem. By employing dynamic programming, we were able to get a solution that's polynomial in $n$ as opposed to the brute-force solution of checking all $2^n$ subsets of items—a significant improvement.

One can also leverage developments in technologies like SAT solvers, which are programs designed to solve very large instances of SAT very quickly. While there are no theoretical guarantees on performance (since worst-case is still not polynomial), these often work out well in practice for reasonably large instances. This sort of thing relies on the fact that you know how to transform a problem into SAT and being able to recognize that you have an NP-complete problem.

Beyond $\mathbf{NP}$-completeness

There's a lot left even beyond just dealing with $\mathbf{NP}$-completeness. Here's a bit of a roundup of things we've seen that might be of interest.

Hopefully, this is enough to give you a picture of what else is out there, whether you decide to continue studying theoretical computer science or not.