MPCS 55001 — Lecture 17

So at this point, we have the tools to show that there are problems that are about as hard as each other, but we still don't know whether they have efficient algorithms or whether we should expect them to.

Ultimately, what we want to do is classify those problems that we know have efficient algorithms. But first, what is a problem? We have a vague notion of what this is, but we would like to formalize this notion. It might be surprising that we would want to formalize what a problem is mathematically or that we can even do this, but this is really what computer science is about: the mathematics of how to solve problems.

Recall that we mentioned that a decision problem is one that, when given an input, answers a Yes or no. This sounds suspiciously like a function, so we might say that a problem is some function that maps inputs from $\mathbb N$ to $\{Y,N\}$. But as we've seen above, our domains can get much more complicated than $\mathbb N$.

Viewing such problems as functions is a good start, but we want something more general to work with. In the above examples, there's no unification in terms of the number of inputs or the domains of the functions, which makes it a bit difficult to talk about the two questions in the same way.

One of the first insights you will have internalized already is that we can represent pretty much anything we would like with the proper encoding. Generally speaking, this means that we can treat any input as a string. We can even treat "multiple" inputs, as in the case of our graph problem, as a single string, by encoding the delimiters appropriately as well. In this way, we can now define decision problems a functions from strings to Yes or No.

Let $\Sigma$ be an alphabet. A decision problem is a Boolean function $P:\Sigma^* \to \{0,1\}$ that maps input strings over $\Sigma$ to a single bit.

Another way to look at a Boolean function is as a set $S$, where a string $x$ is in $S$ if $P(x) = 1$ and $x \not \in S$ if $P(x) = 0$.

Let $\Sigma$ be an alphabet. A decision problem is a set $P \subseteq \Sigma^*$. A string $x$ is a Yes instance of $P$ if and only if $x \in P$.

Then we can treat an algorithm as a function that tells us whether an input is a Yes or No instance and we can associate it with a running time.

An algorithm $A$ solves a problem $P$ if \[A(x) = \begin{cases} \mathrm{Yes} & \text{if $x \in P$,} \\ \mathrm{No} & \text{if $x \not \in P$.} \end{cases}\] We say $A$ runs in polynomial time if for every input string $x$, $A(x)$ terminates in $\leq p(|x|)$ steps, where $p$ is a polynomial.

This allows us to define a class of problems based on the time complexity of the algorithms that solve them.

The class $\mathbf P$ is the class of decision problems for which there exists a polynomial time algorithm.

What problems are in $\mathbf P$? Almost every problem we've seen so far is a problem in $\mathbf P$, because we have polynomial time algorithms that solve them (with a few exceptions). As a reminder, the following problems that we've seen that are in $\mathbf P$ are

However, we've also seen a few problems where we don't know whether there are polynomial time algorithms.

A big problem with this is that we don't know whether there is or isn't an efficient algorithm for these problems. As mentioned before, a definitive answer one way or the other would be helpful, but we don't have one. But maybe we can try something less ambitious first. What if, instead of trying to solve the problem, we could check whether a proposed solution is correct?

Imagine again that we have a mysterious cube that can solve a particularly difficult problem for us. A good question to ask about our mysterious cube is how we can tell if it's correct. If it can solve a hard problem, how can we possibly check it?

So the mysterious cube goes okay, I can give you proof and along with a Yes answer to the problem it prints out a receipt that we can use to check whether it's correct or not. And after being spooked by the cube that was apparently listening in on us, we have to figure out what to do with this new information.

An algorithm $V(x,c)$ is a verifier for a problem $P$ if for every input string $x$, $x \in P$ if and only if there exists a string $c$ such that $V(x,c) = yes$. We say that $c$ is a certificate.

Consider the Independent Set problem. If a graph $G$ contains an independent set, a certificate for this is the set $S$. Given $S$, we can check if $S$ is an independent set in polynomial time: for each vertex in $S$, check its neighbours and verify that none of them are in $S$.

Of course, we can ask the same sorts of questions about verification algorithms as we do of any algorithm, particularly what its time complexity is. This leads us to the following idea.

The class $\mathbf{NP}$ is the class of decision problems for which there exists a polynomial time verifier $V$. That is, $V(x,c)$ is a polynomial-time algorithm and the certificate $c$ has size polynomial in the size of $x$; $|c| \leq p(|x|)$ for some polynomial $p$.

So which problems are in $\mathbf{NP}$? One easy thing to see is have that any problem that was efficiently solvable to begin with is also efficiently verifiable.

Suppose we have a problem $P \in \mathbf P$. Then there is a polynomial time algorithm $A$ that solves it. We will define the following verifier $V$ for $P$:

On input $x$ and certificate $c$, Return $A(x)$

Essentially, this verifier ignores whatever certificate is provided and just calls $A$ to compute the answer itself. This will always give us the correct answer for any certificate $c$. This verifier runs in polynomial time, since $A$ runs in polynomial time. Therefore, $P \in \mathbf{NP}$.

What else is in $\mathbf{NP}$? A lot of the problems we've seen in this section of the course are also in $\mathbf{NP}$.

For each of the problems above, we will describe an appropriate certificate and how to use it to verify a solution in polynomial time.

A certificate for Independent Set is a list of the vertices in the independent set. One can verify that the list of vertices is of size at least $k$ and that no two vertices share an edge in $O(m)$ time.
A certificate for Vertex Cover is a list of the vertices in the vertex cover. One can verify that the list of vertices is of size at most $k$ and that no edge is uncovered in $O(m)$ time.
A certificate for Set Cover is a list of the subsets $X_i$ in the cover. One can verify that the list of subsets is of size at most $k$ and that the union of the subsets is equal to the universe in $O(n)$ time, where $n$ is the number of elements in the universe.
A certificate for Satisfiability is an assignment of each variable. One can verify that the assignment is satisfying for a formula by evaluating the formula in $O(n)$ time, where $n$ is the number of variables. To see this, we can form a parse tree for a well-formed formula and evaluate the truth value at each internal node—there are $O(n)$ of these.
A certificate for 3-SAT is the same as for Satisfiability.
A certificate for Hamiltonian Cycle is a list of the vertices visited in the order of the cycle. One can verify that a cycle is formed by checking for the existence of an edge between each successive pair of vertices in $O(n)$ time.

Okay, cool. Is there anything that's not known to be verifiable in polynomial time? Yes, here's one: Suppose you have a boolean formula in 3-CNF; is it unsatisfiable? Here's another one: given a boolean formula in 3-CNF, is there a unique satisfying assignment for it? What do these problems have in common? Just having one solution isn't enough: we need to rule out other possible solutions. So there's a lot more out there than just $\mathbf P$ and $\mathbf{NP}$!

Now, the next natural question to ask is: what are the problems that can be efficiently verified, but not efficiently solvable? The answer to that is, we don't know, because we don't actually know if any exist.

Does there exist a problem $P \in \mathbf{NP}$ such that $P \not \in \mathbf P$?

This is the famous $\mathbf P$ vs. $\mathbf{NP}$ problem—the biggest open problem in theoretical computer science for almost 50 years. Practically speaking, it's the roadblock that keeps us from definitively answering whether large numbers of problems are efficiently solvable.

Philosophically, this question raises questions about how we think about computation and efficiency, and one of the Millennium Prize Problems. Intuitively, it should be much easier to verify a solution for a problem than it is to find it. And yet, we've been unable to prove it.

A lot of the computational complexity theory that has been developed since the early 70s was in the pursuit of this question. And if you ever go on to study more complexity theory, what you'll find are some answers and way more questions.

$\mathbf{NP}$-completeness

Recall that reduction is not symmetric. This suggests that one thing we could do is try to find the "hardest" problems in NP and try to separate those from the "easier" ones. But what does "hardest" mean in this context? That would be the existence of some problem $A$ for which $A \leq_P B$ for all $B \in NP$. Such problems are called NP-complete.

It's important to note that a problem can be $\mathbf{NP}$-hard but not in $\mathbf{NP}$, which is why that notion is separate from the notion of completeness. Problems that are $\mathbf{NP}$-complete represent the "most difficult" problems in $\mathbf{NP}$.

One consequence of the existence of such a problem is it would give us a very easy way to show that $\mathbf{NP} = \mathbf P$. Suppose that $P$ is $\mathbf{NP}$-complete. If we find a polynomial time algorithm for $P$, then this means we have a polynomial time algorithm for every problem in $\mathbf{NP}$, by transitivity of $\leq_P$, and therefore $\mathbf P = \mathbf{NP}$.

Of course, an obvious question is whether such a problem exists. It sounds like a lot of work to show that a problem $P$ is $\mathbf{NP}$-complete, since one would have to show how to encode an arbitrary problem in $\mathbf{NP}$ into an instance of $P$ (how would this work for something like Vertex Cover?). But for now, let's suppose that there is an $\mathbf{NP}$-complete problem out there. By using transitivity of $\leq_P$, we get the following.

Let $P$ be $\mathbf{NP}$-complete and suppose $P' \in \mathbf{NP}$. If $P \leq_P P'$ then $P'$ is $\mathbf{NP}$-complete.

Let $Q \in \mathbf{NP}$. Since $P$ is $\mathbf{NP}$-complete, we have $Q \leq_P P$. But since $P \leq_P P'$, by transitivity, we have $Q \leq_P P'$. Since $Q$ was arbitrary, we have that $P'$ is $\mathbf{NP}$-complete.

So if we want to show that $P$ is $\mathbf{NP}$-complete, we can avoid having to show how to transform an instance of some arbitrary problem in $\mathbf{NP}$ into an instance of $P$. All we have to do is show that we can transform a problem we've already shown is $\mathbf{NP}$-complete into $P$, which is much easier. The other observation is that this means that all $\mathbf{NP}$-complete problems are polynomial time reducible to each other.

Of course, all of this only works if such a problem actually exists. That such a problem does exist was shown in 1971 by Stephen Cook, who we've seen before briefly when we were discussing fast integer multiplication. This is the result for which Cook is most famous. The theorem was independently proved by Leonid Levin, who at the time was in the Soviet Union. Levin is currently a faculty member at Boston University and Cook is a faculty member at the University of Toronto, which he joined in 1970 due to being denied tenure at Berkeley.

One thing you might notice is that $\mathbf{NP}$ is not exactly an acronym for polynomially verifiable. That is because efficient verification is an alternate, but equivalent, definition for $\mathbf{NP}$. $\mathbf{NP}$ actually stands for nondeterministic polynomial time. The original definition has to do with distinguishing between problems that could be solved on deterministic vs. nondeterministic Turing machines.

For our purposes, we can think of it this way: we can solve a problem in $\mathbf{NP}$ in polynomial time using nondeterminism. How do we do this? First, we nondeterministically guess/choose a certificate $c$. Since $c$ is polynomial in size, we can do this in polynomial time. Then once we have our string, we can run the verifier in polynomial time to get our answer.

We won't go through the proof in detail here, because it involves Turing machines, but the proof is really amazing. So recall that a Turing machine is a model of computation and for all intents and purposes, you can consider a specific Turing machine as a computational device that computes the solution to a particular problem, much like an algorithm. Here, we're particularly interested in nondeterministic Turing machines.

Since $\mathbf{NP}$ is the class of problems that can be solved by a nondeterministic Turing machine in polynomial time, an easy way to show that every problem reduces to a particular problem is just to show how to simulate a nondeterministic Turing machine. This is basically what the proof of Cook-Levin does.

The idea is to take a nondeterministic Turing machine and an input string and to show how to construct a propositional formula that simulates the computation of the Turing machine on the input string. The proof involves showing the following:

The consequence of this is that if we can solve Satisfiability, we can solve any problem in $\mathbf{NP}$. How? Here is an algorithm:

So if we can solve Satisfiability in polynomial time, we can solve any problem in $\mathbf{NP}$ in polynomial time.

The existence of an NP-complete problem was obviously a huge breakthrough and the natural follow-up question was what other NP-complete problems might be out there. Since these are the hardest problems in NP, the thinking was that we'd find a few more and it wouldn't be very long before we're able to show a clear separation from P. Not long after, Richard Karp (from Edmonds-Karp) published a paper showing 21 problems were NP-complete in 1972, some of which we've seen already. And since Cook-Levin and Karp, there have been thousands of problems shown to be $\mathbf{NP}$-complete.

Practically speaking, this gives us a much easier way to prove that certain problems are $\mathbf{NP}$-complete (which, for now, means that we can give up on finding an efficient algorithm for them). To do this we need to show that our problem is in $\mathbf{NP}$ and that there is an $\mathbf{NP}$-complete problem that reduces to it. This can be summarized in four steps.

By Proposition 17.12, if $A$ is $\mathbf{NP}$-complete, and $A \leq_P B$, and $B \in \mathbf{NP}$, then $B$ is $\mathbf{NP}$-complete. Therefore, it suffices to show that each of the problems is in $\mathbf{NP}$ and there exists a reduction from an $\mathbf{NP}$-complete problem. By Theorem 17.13, Satisfiability is $\mathbf{NP}$-complete.

3-SAT is in $\mathbf{NP}$ by Proposition 17.9. We have Satisfiability $\leq_P$ 3-SAT by Theorem 16.5. Therefore, 3-SAT is $\mathbf{NP}$-complete.
Independent Set is in $\mathbf{NP}$ by Proposition 17.9. We have 3-SAT $\leq_P$ Independent Set by Theorem 16.4 and 3-SAT is $\mathbf{NP}$-complete by Theorem 17.14(a). Therefore, Independent Set is $\mathbf{NP}$-complete.
Vertex Cover is in $\mathbf{NP}$ by Proposition 17.9. We have Independent Set $\leq_P$ Vertex Cover by Proposition 15.10 and Independent Set is $\mathbf{NP}$-complete by Theorem 17.14(b). Therefore, Vertex Cover is $\mathbf{NP}$-complete.
Set Cover is in $\mathbf{NP}$ by Proposition 17.9. We have Vertex Cover $\leq_P$ Set Cover by Proposition 15.13 and Vertex Cover is $\mathbf{NP}$-complete by Theorem 17.14(c). Therefore, Set Cover is $\mathbf{NP}$-complete.
Hamiltonian Cycle is in $\mathbf{NP}$ by Proposition 17.9. We have 3-SAT $\leq_P$ Hamiltonian Cycle by Theorem 16.10 and 3-SAT is $\mathbf{NP}$-complete by Theorem 17.14(a). Therefore, Hamiltonian Cycle is $\mathbf{NP}$-complete.
We claim Undirected Hamiltonian Cycle is in $\mathbf{NP}$: a certificate is a list of the vertices in the order of the Hamiltonian cycle. This can be verified by checking for the existence of an edge between each successive pair of vertices and can be done in $O(n)$ time, where $n$ is the number of vertices in the graph. We have Hamiltonian Cycle $\leq_P$ Undirected Hamiltonian Cycle by Theorem 16.9 and Hamiltonian Cycle is $\mathbf{NP}$-complete by Theorem 17.14(e). Therefore, Undirected Hamiltonian Cycle is $\mathbf{NP}$-complete.
We claim Traveling Salesman is in $\mathbf{NP}$: A certificate is the list of vertices of the tour in order. We can check that this is a cycle with cost $\leq k$ by examining the cost of each edge, in time $O(n)$, where $n$ is the number of vertices. We have Hamiltonian Cycle $\leq_P$ Traveling Salesman by Theorem 98.2 and Hamiltonian Cycle is $\mathbf{NP}$-complete by Theorem 17.14(e). Therefore, Traveling Salesman is $\mathbf{NP}$-complete.
Subset Sum is in $\mathbf{NP}$: A certificate is the list of numbers that sum to $t$. We can check this in $O(n\ell)$ time, where $n$ is the number of elements in the input and $\ell$ is the maximum size of the numbers. We have 3-SAT $\leq_P$ Subset Sum by Theorem 98.6 and 3-SAT is $\mathbf{NP}$-complete by Theorem 17.14(a). Therefore, Subset Sum is $\mathbf{NP}$-complete.
0-1 Knapsack is in $\mathbf{NP}$: A certificate is the list of items. Again, we can check the values and weights in time $O(n\ell)$. We have Subset Sum $\leq_P$ 0-1 Knapsack by Theorem 98.5 and Subset Sum is $\mathbf{NP}$-complete by Theorem 17.14(h). Therefore, 0-1 Knapsack is $\mathbf{NP}$-complete.

This leads to another interesting question. We obviously have problems that are efficiently solvable. And we have a surprisingly large number of problems that are difficult, which are $\mathbf{NP}$-complete. In fact, most problems that we can show are in $\mathbf{NP}$ turn out to be either $\mathbf{NP}$-complete or actually in $\mathbf P$. But is there anything in between? More specifically, are there problems that are in $\mathbf{NP}$ but not $\mathbf{NP}$-complete? The answer, surprisingly, is we don't know! Part of the reason for this mystery is the following theorem of Ladner from 1975.

Suppose that $\mathbf P \neq \mathbf{NP}$. Then there exists a problem $P \in \mathbf{NP} \setminus \mathbf P$ that is not $\mathbf{NP}$-complete.

So if we can show that there aren't any such problems that are "in between" $\mathbf P$ and $\mathbf{NP}$-complete, then we can show that $\mathbf P = \mathbf{NP}$. Of course, the issue is that we don't know for sure either way. We have some problems that we think might qualify, but we haven't been able to prove it definitively.

There are some particularly important problems that could exist in this space. The first is one Integer Factorization. The standard version of this problem is to find the prime factors of an integer $n$. The decision problem version is: Given an integer $n$ and an integer $m$ with $1 \lt m \lt n$, is $m$ a factor of $n$? Despite there being no efficient algorithms for solving Integer Factorization, no one has been able to show that it is $\mathbf{NP}$-complete. In fact, there's enough evidence that it's not completely crazy to think that Integer Factorization might turn out to be in $\mathbf P$. However, most people think this is a pretty good candidate for an intermediate problem.

The other big one is Graph Isomorphism: Given two graphs $G$ and $H$, are $G$ and $H$ isomorphic? We say two graphs are isomorphic if there's a way to map vertices of one to the other in such a way that the edge relationships are all preserved. In effect, an isomorphism is really a renaming of variables. Again, no one has been able to show that this problem has an efficient algorithm or that it's $\mathbf{NP}$-complete. The last big breakthrough on the problem was about five years ago, when Professor Babai, in our department, gave a quasipolynomial algorithm; i.e. it runs in $2^{O((\log n)^c)}$ time. Graph Isomorphism is a special case of Subgraph Isomorphism, which is: Given two graphs $G$ and $H$, is there a subgraph of $G$ that's isomorphic to $H$? Subgraph Isomorphism is $\mathbf{NP}$-complete.

Intuitively, it seems like these problems are "obviously" hard, but intuition has been proven wrong before. One example of this is Primality Testing: given an integer $n$, is it prime? This was proven to be in $\mathbf{NP}$ in 1975 by Pratt. But over the decades, the exact class to which Primality Testing belonged shifted, with the development of some randomized algorithms, until it was finally shown in 2004 by Agrawal, Kayal, and Saxena that Primality Testing was actually in $\mathbf P$!

All of this is to say that there is still a lot that we don't know in complexity theory and in algorithms.

MPCS 55001 — Lecture 17

Classification of problems

$\mathbf{NP}$-completeness