MPCS 55001 — Lecture 5

Randomization

We saw that with Quicksort that when we were faced with a seemingly intractable choice, we could apply a bit of randomness and show that on average, our algorithm would run decently well. Of course, that's only the theory—in practice, it turns out Quicksort runs really well. We're going to see how to apply randomization to algorithm design in a few more ways.

Minimum cut

We are now well acquainted with cuts, which partitions a given graph into two, for the purposes of computing the minimum spanning tree. We made the observation that this was a natural thing to do, since MSTs are about connectivity and we are looking for the least cost edge that connects two "parts" of the graph. We'll continue down this train of thought of a cut as a separation of the graph, as is implied by the name.

The (global) minimum cut problem is: Given an undirected graph $G = (V,E)$, find the cut $(A,B)$ of $G$ with the cutset of minimal cardinality, also called the minimum cut.

Consider the following graph. The minimum cut is highlighted in green and blue, and its cutset is highlighted in yellow. The size of this cut is 2.

While we defined cuts before to be based on a set $S$, cuts in the context of the minimum cut problem are typically expressed as $(A,B)$—there isn't really a difference between the two definitions, because $B$ is necessarily $V-A$ since a cut has to partition the vertex set. This means that at first glance, this seems to be troublesome—we have as many as $2^n$ cuts to look at.

While there are deterministic algorithms for solving this, applying just a bit of randomness gives a simpler and faster algorithm (with some caveats). The algorithm we'll see is is due to Karger in 1994, while he was a PhD student. The idea is quite simple, but first we'll need the following definition.

Let $G = (V,E)$ be an undirected graph and let $e = (u,v) \in E$. The edge contraction of the edge $e$ in $G$ is a multigraph $G/e$ such that $e$ is deleted, $u$ and $v$ are removed from the graph and replaced with a vertex $w$, and all edges incident to $u$ or $v$ are now incident to $w$.

In the following, we contract the edge $(u,v)$. We allow for multiple edges between two vertices after this operation, but disallow self-loops. Loops may occur after several successive edge contractions, but we remove them from the graph when they do.

Edge contraction seems like a weird and arbitrary thing to do, but it turns out to be a well-defined operation that leads to the idea of graph minors in graph theory (this is not what we'll be focusing on).

Edge contraction is the primary operation that we'll be using in this algorithm. From the definition, we will necessarily allow that our graphs can be multigraphs. That is, we will allow graphs with multiple edges between two pairs of vertices.

    \begin{algorithmic}
    \PROCEDURE{karger}{$G$}
        \IF{$|V| = 2$}
            \RETURN{the cut represented by $(v_1,v_2)$}
        \ELSE
            \STATE Choose an edge $e$ uniformly at random
            \RETURN{\CALL{karger}{$G/e$}}
        \ENDIF
    \ENDPROCEDURE
    \end{algorithmic}

Consider the following graph.

The following is a series of possible choices made by the algorithm.

Then our resulting cut is $(\{a,b,c\},\{d,e,f\})$ and its size is 1.

Implicitly, when we contract an edge, we end up with a "supernode" which contains the label of the vertices of the edge we've contracted. By the time we end up with only two vertices, we have a set of vertices "in" $v_1$ and the rest of the vertices "in" $v_2$. These two supernodes then naturally define our cut, and the remaining edges between the two vertices is our cutset.

Let's think about the intuition behind this. What we want is a cut with a minimal set of edges in the cutset. This means that if I randomly select an edge in the graph, it probably isn't in the minimum cut. Of course, this changes and becomes less and less true as we have fewer and fewer edges. This leads to the following observation: there's nothing actually guaranteeing that this algorithm has to be correct!

Suppose after the first edge contraction, our algorithm instead chose the following sequence of edge contractions.

Then our resulting cut is $(\{a,c\},\{b,d,e,f\})$ and its size is 2.

So Karger's algorithm can produce a cut that's not the minimum cut for $G$. If we observe closely what happened, we realize that if the algorithm happens to choose an edge that is in the cutset to contract, then it will not correctly produce that cut. This is very problematic if we care about our algorithms being correct.

Here, we harness randomness in a very different way than in Quicksort. In Quicksort, we were faced with a problem of how to find the pivot quickly and instead chose to leave the choice up to randomness. This choice means that we may end up with runs of Quicksort that run faster or slower, but our algorithm is always correct. Such algorithms are called Las Vegas algorithms.

Karger's algorithm takes a very different approach. Instead, our algorithm is always fast, it always takes $O(m)$ worst-case time. However, randomness is applied in a way that the result of our algorithm is only correct with some probability $p$. These algorithms are called Monte Carlo algorithms.

Why do Monte Carlo algorithms work? We need the following result from probability theory.

Let $X$ be a Bernoulli trial with probability of success $p$. The expected number of trials before the first success is $\frac 1 p$.

In other words, we can get an algorithm where we have a correct answer with high probability if we run the algorithm enough times. Specifically, we expect to have a correct answer by running it $\frac 1 p$ times. The key then is to find our probability of success $p$. The danger with this approach in general is that $p$ can be some function of $n$. This means that the number of times we run our algorithm will be $\frac 1 p = O(f(n))$. As long as $f(n)$ is polynomial, then we still have a polynomial time algorithm.

As we discussed earlier, for our current problem, this amounts to figuring out the probability that an edge in the minimum cut will be chosen by the algorithm at any point in the run. We will show that the probability that our algorithm is correct won't be too bad.

Karger's algorithm produces a minimum cut with probability at least $\frac 2 {n^2}$.

Let $k$ be the size of a minimum cut in $G$. We want to consider the probability that an edge in our minimum cut is not chosen. We need to consider the number of edges in our graph. We observe that every vertex must have degree at least $k$, since otherwise, we have a smaller cut. Therefore, by the handshaking lemma, we have \[2|E| = \sum_{v \in V} \deg(v) \geq kn\] which gives us $|E| \geq \frac 1 2 k n$. Then the probability that an edge in our cut is chosen in the first iteration is at most \[\frac k {\frac 1 2 kn} = \frac 2 n.\] This gives us a probability of $1 - \frac 2 n$ of choosing an edge that isn't in our cut.

But this is only for the first choice. We must also consider the probability of the choices as we recursively apply the algorithm. This takes the form of a recurrence. Let $E(n)$ the event that our algorithm produces the correct minimum cut on $n$ edges. Then \[\Pr(E(n)) \geq \left(1 - \frac 2 n \right) \cdot \Pr(E(n-1)) = \frac{n-2}{n} \cdot \Pr(E(n-1)).\] This recurrence can get unrolled relatively easily: \begin{align} \Pr(E(n)) &\geq \left(\frac {n-2} n \right) \left( \frac {n-3}{n-1} \right) \left(\frac {n-4} {n-2} \right) \cdots \left( \frac {1}{3} \right) \\ &= \frac{2}{n(n-1)} \\ &\geq \frac{2}{n^2} \\ \end{align}

 

We make the additional observation that $\frac 2 {n(n-1)} = {\binom n 2}^{-1}$.

Again, it might be troubling that the probability that the algorithm is correct seems so small but what this shows is that the probability that the algorithm is wrong isn't too bad—we only need to run our algorithm polynomially many times. Combined with our result on expected success, we have the following.

The expected number of runs of Karger's algorithm before it is correct is $O(n^2)$.

But maybe that still isn't too convincing—after all, this only says that we expect to run our algorithm about $\binom n 2$ times before we get the correct result. This is fine if we have a very easy way to tell whether we have the minimum cut, but it's not clear how we can do that. Rather, we may want the probability that we have the minimum cut after a certain number of runs. If we stick with the $\binom n 2$ number, then we have a very simple calculation to determine the probability that our algorithm is incorrect, \[\left( 1 - \frac{1}{\binom n 2} \right)^{\binom n 2} \leq \frac 1 e.\] This neat result comes out of the observation from analysis that $\lim\limits_{x \to \infty} \left(1 - \frac 1 x\right)^x = \frac 1 e$. This works out to be a bit over $\frac 1 3$, which is not bad. Obviously, running this more times will bring the probability of failure down. An additional factor of $\ln n$ gets us to a probability that scales with $n$.

The probability of failure when Karger's algorithm is run $n^2 \ln n$ times is at most $\frac 1 {n^2}$.

Our probability of failure is roughly $\left(1 - \frac 2 n\right)$. Since the runs are independent, we have total probability of failure $\left(1 - \frac 2 n\right)^{n^2 \ln n}$. We get \[\left(1 - \frac 2 n\right)^{n^2 \ln n} = \left( \left(1 - \frac 2 {n^2}\right)^{\frac{n^2} 2} \right)^{2 \ln n} \leq \left( \frac 1 e \right)^{2 \ln n} = \frac 1 {n^2}.\]

 

When put together, this gives us a total running time of at least $O(mn^2 \log n)$ if we want our algorithm to produce a correct answer with high probability—specifically, when the probability of failure is a polynomial fraction (it is very high when we get an exponential fraction). This running time is actually not so different from what we get from deterministic algorithms. However, we have a very simple algorithm, so refinements of this idea over the years following resulted in an algorithm that runs in $O(m \log^3 n)$, due to Karger in 2000. This remains faster than the best known deterministic algorithms. Obviously, this means that this is still an active area of research.

Hashing

By now, you've probably used dictionaries in a language like Python. These are convenient data structures that allow you to associate a value with some kind of key. At first glance, this seems to be a nice way to label an item instead of using an array and remembering the 0th item is this and the 1th item is that and so on. But there's a lot more going on underneath the surface.

For instance, if we need to compare labels every time we look something up, how can this be fast? Even, say, string comparison at best should require something like $O(n)$ time for strings of length $n$. And how do we know where an item gets put based on its key? This is also a problem that needs to be dealt with in order to get decent performance out of our data structure.

We will consider how to design such a data structure.

Let $U$ be a universe of elements. A dictionary is a structure for maintaining a subset of elements $S \subseteq U$.

As always, when discussing data structures, we want to talk about the operations they should support. For dictionaries, we need the following:

This does not look quite like dictionaries in practice, because our theoretical dictionaries make the simplifying assumption that the values are the same as keys. This assumption happens to be fine—the difficult part of implementing dictionaries is how to deal with keys, and the values can just get tacked on at no extra cost. So when we talk about items, we're really talking about keys. It becomes clear that our universe $U$ is a set of keys, which leads to the following complication.

The problem sounds quite simple at first glance, and in fact, we've had to deal with similar settings before. For instance, we needed to manage sets of vertices in graph algorithms. However, in those settings, our sets were typically close in size to our universe $V$. The added complication in our current setting is that $U$ is typically very large and $S$ can be very small in comparison. For instance, if we take arbitrary strings to be our set of keys, then it is often the case that we will use a very, very small portion of all strings as keys. This means that a strategy that involves keeping track of all possible elements in $U$ (as we did with graph algorithms) is not feasible.

Suppose $|S| = n$ and consider if we use an array-based data structure. This would involve $O(|U|)$ space, which would not be feasible. Similarly, using a list-based structure can be quite costly, with $O(n)$ or at best $O(\log n)$ time operations. We would like to achieve the $O(1)$ time cost of array-based structures, but with space requirements closer to $n$ than $|U|$.

The key to doing this is hashing.

A hash function is a function $h : U \to \{0, \dots, n-1\}$ that maps elements of $U$ to a number from $0$ to $n-1$. A hash table is an array $A$ of size $n$ such that an item $x \in U$ is stored in $A[h(x)]$.

Let $U$ be the set of all strings. We define a hash function $h:U \to \{0,\dots,n-1\}$ for $x \in U$ by \[h(x) = \operatorname{len}(x) \bmod n.\]

This is a nice and neat definition with an obvious problem. Since $|U| \gt n$, there must be elements $u,v \in U$ such that $h(u) = h(v)$. Such a situation is called a hash collision. What then? The simplest remedy for this is to simply keep a linked list (or some other appropriate data structure) of elements with the same address.

We will use $h$ as defined previously and $n = 5$. A hash table based on $h$ might look like the following.

However, there is an obvious issue with this: what happens if we have a lot of collisions? If we're unlucky, we can end up hashing many elements to one location. Ideally, we would like to choose a hash function that minimizes collisions, thereby distributing all items evenly through our table. It turns out this is very hard to do: if $|U| \geq n^2$ then it is very easy to show that for any hash function $h$, there exists a subset of $n$ elements that all have the same hash. This immediately gives us $O(n)$ lookup.

This becomes an issue when we think about how to compute our hashes. A simple way to do this is to use some property of the data—for example, if we have a bunch of identifiers, maybe we can do some math on them, like take the length or some sum of their digits, or maybe even apply a modulus to them. However, no matter what we do, such a hash function will inevitably concentrate elements around certain patterns present in the data.

So we have an overarching confounding issue—while we may like to think of given inputs as random, we can't make any such guarantees. This is an important distinction that makes things like average-case analysis difficult. Just because we can't know anything about our input does not mean that we can assume it's random. Assuming our input is random means assuming that it has a particular distribution, which is almost never the case.

So if we can't assume that our input is random, what can we do? In a sense, we will have to create our own randomness. But how? One obvious way to do this is to assign locations for items uniformly at random. It's clear that in this case we have for a random hash function $h$ and two items $u,v \in U$, \[\Pr_h(h(u) = h(v)) = \frac 1 n.\] To see that this is the case, we note that there are $n$ locations to place $u$. Having chosen one of those, we pick a place to put $v$—there are $n$ possible locations, one of which is already occupied by $u$.

So we've solved our problem—except that there's an issue. We need to remember where we put our items! So we need to create another table, which holds the results of $h$ and we're back at square one.

Let's take a step back and think about what we've done with this hash function $h$ that chooses places uniformly at random for each item. From a programming perspective, it's easy to think of this as $h$ "generating" an address for each item as it comes in. However, from a mathematician's perspective, functions can't "generate" things in this way. Instead, what we're really doing is choosing a function $h : U \to \{0, \dots, n-1\}$ uniformly at random, from the set of all functions from $U$ to $\{0,\dots,n-1\}$.

This idea is enough to get us where we want: how we want to think about this problem is instead to think about how to choose a hash function at random. However, it's no good to make such a choice from all possible functions—we've already seen that's too hard. Instead, we need to think about how to design a set of functions that is both easily and predictably computed, but still has enough randomness that we can expect to have few collisions.

The idea that we'll consider is universal hashing, due to Carter and Wegman in the late 70's.

A family of hash functions $H$ is universal if for any pair of elements $u \neq v$, \[\Pr_{h \in H}(h(u) = h(v)) \leq \frac 1 n.\]

That is, across all choices $h$ of functions in $H$, the probability that $h(u) = h(v))$ for all pairs of elements $u\neq v$ is at most $\frac 1 n$. Again, this holds for all inputs—we do not assume that inputs are random. We assume that our choice of $h$ is random.

Let $U = \{a,b,c,d,e,f\}$ and take $n = 2$. Consider the following hash functions:

  $a$ $b$ $c$ $d$ $e$ $f$
$h_1(x)$ $0$ $0$ $0$ $1$ $1$ $1$
$h_2(x)$ $0$ $1$ $0$ $1$ $0$ $1$
Since $n = 2$, what we want is a family of hash functions such that the probability that two items have a hash collision is at most half across all of the hash functions in the family. If we consider the family $H = \{h_1,h_2\}$, we can see that it is not universal by considering the probability of $a$ and $c$ colliding. From our definition, we see that in both functions, $h(a) = h(c)$, so the probability that they collide is 1, or always.

If we add the following functions to $H$, we get a universal family.

  $a$ $b$ $c$ $d$ $e$ $f$
$h_3(x)$ $0$ $0$ $1$ $0$ $1$ $1$
$h_4(x)$ $1$ $0$ $0$ $1$ $1$ $0$

You can verify the details yourself. If we consider the case of $a$ and $c$, then we observe that they collide for $h_1$ and $h_2$ but not for $h_3$ and $h_4$. Therefore, they collide on $\frac 1 2$ of the choices of hash function, which gives us our desired probability of $\frac 1 2$.

So we may wonder whether this property is enough to guarantee good performance on our hash tables. If we think back to the idea of handling collisions as a linked list, we would like for our hashing scheme to result in linked lists of size at most $\frac m n$, where $m$ is the number of elements we've inserted and $n$ is the size of our hash table.

Let $H$ be a universal family of hash functions mapping $U$ to $\{0,\dots, n-1\}$. Then the expected running time of a hash table operation is $O(\alpha)$, where $m$ is the number of elements in the hash table, $n$ is the size of the table, and $\alpha = \frac m n$.

Recall that the cost of a hash table operation is based on the size of a linked list in an entry of the hash table, which is equivalent to the number of hash collisions on that entry. So we want to compute the expected number of collisions on some element, say $u$, in our hash table $A$.

Define a random variable $C_u$ for the number of collisions with $u$ in $A$ and define $C_{u,v}$ for whether $u$ and $v$ collide, by \[C_{u,v} = \begin{cases} 1 & \text{if $h(u) = h(v)$,} \\ 0 & \text{otherwise}. \end{cases}\] Then $C_u = \sum\limits_{v \in A} C_{u,v}$. By linearity of expectation, we have $E(C_u) = \sum\limits_{v \in A} E(C_{u,v})$, so we need to compute $E(C_{u,v})$ for each pair of elements $u,v$. Since $C_{u,v}$ is a Bernoulli trial, we have \[E(C_{u,v}) = \Pr(C_{u,v} = 1) = \begin{cases} 1 & \text{if $u = v$,} \\ \frac 1 m & \text{otherwise.} \end{cases}\] Putting this together, we have \[E(C_u) = \sum_{v \in A} E(C_{u,v}) = \sum_{v \in A} \frac 1 n = \frac m n.\] Since the expected number of collisions for a single entry is $\alpha = \frac mn$, the expected length of a linked list for an entry in the hash table is $\alpha$. This means that the expected running time for a hash table operation is $O(\alpha)$.

If we assume that the number of items in the table is some constant larger than $n$, then we can argue we have $O(1)$ expected time for our operations.

This result shows that we don't need full randomness for our purposes. Rather, it's enough to introduce just enough randomness that it defeats an adversary (here, we use adversary in the theoretical sense—as a process that makes a series of worst-case choices).

However, based on our definition, it's not actually clear that universal families of hash functions even exist, much less that they can be effectively computed and that we can efficiently make a random choice for them...