CMSC 27200 — Lecture 23

So you might say, okay, it makes sense to reduce Vertex Cover to Independent Set because they're both graph problems. But what if we considered something that wasn't about graphs at all?

A boolean variable is a variable that can be assigned a value from $\{0,1\}$ (or $\{T,F\}$). A literal is a boolean variable or its negation. A clause is a disjunction of literals. We say a boolean formula $\varphi$ is in conjunctive normal form (CNF) if $\varphi$ is a conjunction of clauses.

Consider the formula \[\varphi = (\neg x_1 \vee x_2 \vee x_3) \wedge (x_1 \vee \neg x_2 \vee x_3) \wedge (\neg x_1 \vee x_2 \vee x_4).\] This formula is in conjunctive normal form and has

Recall that propositional formulas are logical statements that assert the truth of some statement. Usually this is mathematical in nature. We are interested in determining whether such a statement is ever true. In other words, is there a way for us to assign the variables in such a way that the formula becomes true? This is called the satisfiability problem.

The satisfiability (SAT) problem is: Given a propositional formula $\varphi$, does it have a satisfying truth assignment? The 3-SAT problem is: Given a propositional formula $\varphi$ in conjunctive normal form, where each clause contains exactly three literals, does it have a satisfying truth assignment?

Satisfiability is a basic problem in mathematical logic and has obvious applications in things like circuit design. However, as mentioned before, there are a number of mathematical statements that one can express in propositional logic, so being able to solve this would mean being able to determine the satisfiability of any statement that can be expressed as a propositional statement.

Given an instance $\varphi$ of 3-SAT, where $\varphi$ has $k$ clauses, we will construct an instance $(G,k)$ of Independent set that has an independent set of size $k$ if and only if $\varphi$ is satisfiable.

Our construction is as follows. For each clause $C = x_i \vee x_j \vee x_k$ in $\varphi$, we construct a triangle. Then, we add an edge between each literal and its negations in the other clauses.

Consider the following graph, constructed by using the formula from Example 23.2. Note that each triangle corresponds to a clause and that the literals $x_1$ and $x_2$ have edges to their negations, $\neg x_1$ and $\neg x_2$.

We will show that $\varphi$ is satisfiable if and only if $G$ contains an independent set of size $k$, where $k$ is the number of clauses in $\varphi$.

First, suppose $\varphi$ is satisfiable and consider a satisfying assignment. This means that in every clause, at least one of the literals is true. Then we choose one of these literals from each clause and these form an independent set $S$.

To see that they do, it is clear that only one vertex can be chosen from each triangle. Then all that remains is that there are no edges joining two vertices in different triangles. But the only way this can happen is between two literals $x_i$ and $\neg x_i$, which can't be the case in a satisfying assignment. therefore, $S$ is an independent set.

Next, suppose that $G$ contains an independent set $S$ of size $k$. Since $S$ is independent, it must contain exactly one vertex from each triangle. We set the literals corresponding to these vertices to True, as long as they are consistent. Then this means every clause evaluates to true, so $\varphi$ is satisfied.

With that we have the following relationships among the problems we've discussed so far: \[\mathrm{3SAT} \leq_P \mathrm{Independent\ Set} \leq_P \mathrm{Vertex\ Cover}.\]

Classification of problems

So at this point, we have the tools to show that there are problems that are about as hard as each other, but we still don't know whether they have efficient algorithms or whether we should expect them to.

Ultimately, what we want to do is classify those problems that we know have efficient algorithms. But first, what is a problem? We have a vague notion of what this is, but we would like to formalize this notion. It might be surprising that we would want to formalize what a problem is mathematically or that we can even do this, but this is really what computer science is about: the mathematics of how to solve problems.

Recall that we mentioned that a decision problem is one that, when given an input, answers a Yes or no. This sounds suspiciously like a function, so we might say that a problem is some function that maps inputs from $\mathbb N$ to $\{Y,N\}$. But as we've seen above, our domains can get much more complicated than $\mathbb N$.

Viewing such problems as functions is a good start, but we want something more general to work with. In the above examples, there's no unification in terms of the number of inputs or the domains of the functions, which makes it a bit difficult to talk about the two questions in the same way.

One of the first insights you will have internalized already is that we can represent pretty much anything we would like with the proper encoding. Generally speaking, this means that we can treat any input as a string. We can even treat "multiple" inputs, as in the case of our graph problem, as a single string, by encoding the delimiters appropriately as well. In this way, we can now define decision problems a functions from strings to Yes or No.

Let $\Sigma$ be an alphabet. A decision problem is a Boolean function $P:\Sigma^* \to \{0,1\}$ that maps input strings over $\Sigma$ to a single bit.

Another way to look at a Boolean function is as a set $S$, where a string $x$ is in $S$ if $P(x) = 1$ and $x \not \in S$ if $P(x) = 0$.

Let $\Sigma$ be an alphabet. A decision problem is a set $P \subseteq \Sigma^*$. A string $x$ is a Yes instance of $P$ if and only if $x \in P$.

Then we can treat an algorithm as a function that tells us whether an input is a Yes or No instance and we can associate it with a running time.

An algorithm $A$ solves a problem $P$ if \[A(x) = \begin{cases} \mathrm{Yes} & \text{if $x \in P$,} \\ \mathrm{No} & \text{if $x \not \in P$.} \end{cases}\] We say $A$ runs in polynomial time if for every input string $x$, $A(x)$ terminates in $\leq p(|x|)$ steps, where $p$ is a polynomial.

This allows us to define a class of problems based on the time complexity of the algorithms that solve them.

The class $\mathbf P$ is the class of decision problems for which there exists a polynomial time algorithm.

What problems are in $\mathbf P$? Almost every problem we've seen so far is a problem in $\mathbf P$, because we have polynomial time algorithms that solve them (with a few exceptions). As a reminder, the following problems that we've seen that are in $\mathbf P$ are

However, we've also seen a few problems where we don't know whether there are polynomial time algorithms.

A big problem with this is that we don't know whether there is or isn't an efficient algorithm for these problems. As mentioned before, a definitive answer one way or the other would be helpful, but we don't have one. But maybe we can try something less ambitious first. What if, instead of trying to solve the problem, we could check whether a proposed solution is correct?

Polynomial Time Verification

Imagine again that we have a mysterious cube that can solve a particularly difficult problem for us. A good question to ask about our mysterious cube is how we can tell if it's correct. If it can solve a hard problem, how can we possibly check it?

So the mysterious cube goes okay, I can give you proof and along with a Yes answer to the problem it prints out a receipt that we can use to check whether it's correct or not. And after being spooked by the cube that was apparently listening in on us, we have to figure out what to do with this new information.

An algorithm $V(x,c)$ is a verifier for a problem $P$ if for every input string $x$, $x \in P$ if and only if there exists a string $c$ such that $V(x,c) = yes$. We say that $c$ is a certificate.

Consider the Independent Set problem. If a graph $G$ contains an independent set, a certificate for this is the set $S$. Given $S$, we can check if $S$ is an independent set in polynomial time: for each vertex in $S$, check its neighbours and verify that none of them are in $S$.

Of course, we can ask the same sorts of questions about verification algorithms as we do of any algorithm, particularly what its time complexity is. This leads us to the following idea.

The class $\mathbf{NP}$ is the class of decision problems for which there exists a polynomial time verifier $V$. That is, $V(x,c)$ is a polynomial-time algorithm and the certificate $c$ has size polynomial in the size of $x$; $|c| \leq p(|x|)$ for some polynomial $p$.

So which problems are in $\mathbf{NP}$? One easy thing to see is have that any problem that was efficiently solvable to begin with is also efficiently verifiable.

Suppose we have a problem $P \in \mathbf P$. Then there is a polynomial time algorithm $A$ that solves it. We will define the following verifier $V$ for $P$:

On input $x$ and certificate $c$, Return $A(x)$

Essentially, this verifier ignores whatever certificate is provided and just calls $A$ to compute the answer itself. This will always give us the correct answer for any certificate $c$. This verifier runs in polynomial time, since $A$ runs in polynomial time. Therefore, $P \in \mathbf{NP}$.

What else is in $\mathbf{NP}$? A lot of the problems we've seen in this section of the course are also in $\mathbf{NP}$.

For each of the problems above, we will describe an appropriate certificate and how to use it to verify a solution in polynomial time.

A certificate for Independent Set is a list of the vertices in the independent set. One can verify that the list of vertices is of size at least $k$ and that no two vertices share an edge in $O(m)$ time.
A certificate for Vertex Cover is a list of the vertices in the vertex cover. One can verify that the list of vertices is of size at most $k$ and that no edge is uncovered in $O(m)$ time.
A certificate for Satisfiability is an assignment of each variable. One can verify that the assignment is satisfying for a formula by evaluating the formula in $O(n)$ time, where $n$ is the number of variables. To see this, we can form a parse tree for a well-formed formula and evaluate the truth value at each internal node—there are $O(n)$ of these.
A certificate for 3-SAT is the same as for Satisfiability.

Okay, cool. Is there anything that's not known to be verifiable in polynomial time? Yes, here's one: Suppose you have a boolean formula in 3-CNF; is it unsatisfiable? Here's another one: given a boolean formula in 3-CNF, is there a unique satisfying assignment for it? What do these problems have in common? Just having one solution isn't enough: we need to rule out other possible solutions. So there's a lot more out there than just $\mathbf P$ and $\mathbf{NP}$!

Now, the next natural question to ask is: what are the problems that can be efficiently verified, but not efficiently solvable? The answer to that is, we don't know, because we don't actually know if any exist.

Does there exist a problem $P \in \mathbf{NP}$ such that $P \not \in \mathbf P$?

This is the famous $\mathbf P$ vs. $\mathbf{NP}$ problem—the biggest open problem in theoretical computer science for almost 50 years. Practically speaking, it's the roadblock that keeps us from definitively answering whether large numbers of problems are efficiently solvable.

Philosophically, this question raises questions about how we think about computation and efficiency, and one of the Millennium Prize Problems. Intuitively, it should be much easier to verify a solution for a problem than it is to find it. And yet, we've been unable to prove it.

A lot of the computational complexity theory that has been developed since the early 70s was in the pursuit of this question. And if you ever go on to study more complexity theory, what you'll find are some answers and way more questions.