CMSC 27100 — Lecture 21

Probability

Now that we have spent some time on learning how to enumerate outcomes, we can start to think about the likelihood of particular outcomes. This is where probability theory comes into play. In particular, we are interested in discrete probability, which deals with countably many outcomes and is attacked using many of the tools that we've been using in this class. The "non-discrete" branch of probability theory deals with uncountably many outcomes and is approached using methods from calculus and analysis. For instance, where we can get away with summation, you will see a lot of integration when working with probability distributions that are continuous.

At first glance, even though we're primarily concerned with discrete probability, it seems a bit strange that we'd try to stuff probability theory into this course which is already chock full of other things to cover. But probability and randomness play an important role in computer science and particularly in theoretical computer science. Probabilistic methods are an important part of analysis of algorithms and algorithm design. There are some problems for which we may not know of an efficient deterministic algorithm, but we can come up with a provably efficient randomized algorithm. To tackle these problems, we need a basic understanding of discrete probability. The nice thing about this is that we don't really need too much fancy probability theory for this purpose.

A probability space $(\Omega,\Pr)$ is a pair of a finite set $\Omega \neq \emptyset$ together with a function $\Pr : \Omega \to \mathbb R^+$ such that

  1. $\forall \omega \in \Omega, \Pr(\omega) \gt 0$,
  2. $\sum\limits_{\omega \in \Omega} \Pr(\omega) = 1$.

We say that $\Omega$ is the sample space and $\Pr$ is the probability distribution.

The above two conditions are known as the probability axioms and is due to Andrey Kolmogorov in the 1930s. Kolmogorov was a Soviet mathematician and was particularly influential in computer science by developing notions of algorithmic information theory in the 1960s. It might be a bit surprising to learn (as I did) that the formalization of basic probability concepts didn't come about until the 20th century, despite there being analysis of games of chance dating back to 17th century France.

We can define more set theoretic notions for probability.

An event is a subset $A \subseteq \Omega$ of the sample space.

Because the sample space and events are just sets, we can often give "schematic" visual representations of them in the same way as sets. For example, the following is the sample space $\Omega$, which we treat in the same way as a universe, and events $A$ and $B$.

Intuitively, if we "throw a dart" at $\Omega$, then the probability that it lands in $A$ should be the $\Pr(A)$.

A sample space for flipping a coin would be whether it lands on heads and tails $\{H,T\}$. A sample space for the outcome of a die roll would be $\{1,2,3,4,5,6\}$. An event can be something simple like flipping a coin and it landing on heads, in which case the event would be the set $\{H\}$. It can also be more complicated, like a die roll landing on a square, which would be the set $\{1,4\}$, or perhaps the positive divisors of 6, which would be the set $\{1,2,3,6\}$.

For each sample space, we define a probability distribution, which tells us the probability of a particular event $\omega \in \Omega$. The simplest one is the uniform distribution.

The uniform distribution over a sample space $\Omega$ defines $\Pr(\omega) = \frac 1 {|\Omega|}$ for all $\omega \in \Omega$. Thus, $\Pr(A) = \frac{|A|}{|\Omega|}$.

Typically, this is the distribution we assume when we have an arbitrary finite set and say pick an element at random from our set—every item has an equal probability of being chosen:$\frac 1 n$ for $n$ elements, or $\frac{1}{|\Omega|}$.

In some sense, our picture of $\Omega$ and events $A$ and $B$ above is a bit inaccurate, because it depicts a continuous sample space. It might be better to make an explicitly discrete sample space. We can visualize this as a sample space that is broken up into discrete chunks.

Here, we have the same idea, except we have discretized the space into 12 blocks, each of equal size. So now, we have $\Pr(A) = \frac 6{12} = \frac 1 2$ and $\Pr(B) = \frac{5}{12}$.

It's clear from above that the sample space for rolling one die is $\{1,2,3,4,5,6\}$, but what about two dice? Assuming they're distinguishable, we have the space \[\{(i,j) \mid i,j \in \{1,2,3,4,5,6\}\}.\] Of course, we can go a bit further than this: this is just $\{1,2,3,4,5,6\} \times \{1,2,3,4,5,6\}$.

What about coins? It turns out that flipping a bunch of coins is probably the most important sample space for theoretical comptuer science. Why? Let's try to do the same exercise as with dice. What if we want to represent a series of three coin flips? We can play the same game as with the dice: we have $\{H,T\} \times \{H,T\} \times \{H,T\}$ for our sample space. Now, if we were to generalize this to $n$ coin flips, we would have something like $\{H,T\}^n$. This space looks a lot more familiar if we rename $T = 0$ and $H = 1$, giving us the space $\{0,1\}^n$. That's right—the sample space of $n$ coin flips can be represented as the set of length $n$ binary strings!

This is nice because it gives more insight into how we might think of generating a random number. Typically, we think of a random number as something that we just "choose" "at random", but most people do not make choices that are truly uniformly at random. This suggests a better way of thinking about random numbers. First, figure out how large of a random number you want. Then flip a coin $n$ times, where $n$ is the size of your number. Then you have a binary string that you can interpret as a number!

This also suggests that creating a larger random number requires a "more" randomness and it gives us a way to think about randomness as a resource. If we need more coin flips, we are using more bits of randomness. We can then treat this quantity as a resource just like we do with time or space and ask how much randomness we might need for an algorithm and talk about the amount of randomness as a size of the input to the algorithm.

Recall that a standard deck of playing cards consists of four suits, $\spadesuit, \heartsuit, \clubsuit, \diamondsuit$ with cards numbered from 2 through 10, and a Jack, Queen, King, and Ace. This gives us 52 cards in total ($4 \times 13$). Suppose we want to know when drawing five cards what the probability that we can draw a flush is—that is, five cards of the same suit.

Our sample space $\Omega$ here will be all possible draws of five cards and our event $F$ is drawing a flush. Since the order in which we arrange these cards doesn't matter, this is a combination and there are $\binom{52}5$ such possible draws. Then, we need to count how many flushes there are. Since each suit has 13 cards, we have $\binom{13}5$ possible ways to get a flush from a single suit. Since there are four suits, we have $\binom 4 1 \binom{13}5$ flushes in total. So the probability that we have a flush is \[\frac{|F|}{|\Omega|} = \frac{\binom 4 1 \binom{13}{5}}{\binom{52}5}.\]

Now, suppose we deal a hand of 13 cards from a standard 52 card deck to four players: North, East, South, and West. What is the probability that North holds all the aces?

First, we need to think about what our sample space should be. Since we are concerned with hands, this will be the result of the deal: all of the possible ways to distribute 13 card hands to four people. Combinatorially, this is distributing 52 distinguishable objects into four distinguishable boxes with 13 objects each. Then the size of our sample space is $$|\Omega| = \binom{52}{13,13,13,13}.$$ Now, we want to count the number of such hands where North holds all the aces. In this case, North holds four cards that are fixed already, so we need to count how the other 48 cards are distributed. Let's call this event $N_{4A}$. This gives us $$|N_{4A}| = \binom{48}{9,13,13,13}.$$ Then the probability of our event is just \begin{align*} \Pr(N_{4A}) &= \frac{|N_{4A}|}{|\Omega|} \\ &= \frac{\binom{48}{9,13,13,13}}{\binom{52}{13,13,13,13}} \\ &= \frac{13 \cdot 12 \cdot 11 \cdot 10}{52 \cdot 51 \cdot 50 \cdot 49} \\ &\approx 0.00264 \end{align*}

However, sometimes, a uniform distribution may not cut it. These are situations in which events are not all equally likely. This is why we define probability as a function that satisfies the probability axioms.

Suppose we wanted to discuss the probabilty of a roll of a single die. For a fair die, this would be quite simple: our sample space would be $\{1,2,3,4,5,6\}$ and each outcome would be equally likely, $\frac 1 6$. However, if we wanted to model a fixed or biased die, or one in which one outcome is more likely, then our probability distribution for such a die would no longer use the uniform distribution. We would have to define our own probabilty distribution, ensuring that $\Pr(\Omega) = 1$. One such possibility would be to define the following distribution: \[\Pr(\omega) = \begin{cases} \frac 3 4 & \text{if $\omega = 1$}, \\ \frac 1{20} & \text{if $\omega = 2,3,4,5,6$}. \end{cases}\]

Since we are dealing with sets, we can combine our sets in the same way as we've been doing to consider the probability of multiple events together. First, the following is a simple, but important bound.

$\Pr(A \cup B) \leq \Pr(A) + \Pr(B)$.

This follows directly from our understanding of sets. We have \begin{align*} \Pr(A \cup B) &= \frac{|A \cup B|}{|\Omega|} \\ &= \frac{|A| + |B| - |A \cap B|}{|\Omega|} \\ &= \frac{|A|}{|\Omega|} + \frac{|B|}{|\Omega|} - \frac{|A \cap B|}{|\Omega|} \\ &= \Pr(A) + \Pr(B) - \Pr(A \cap B) \\ &\leq \Pr(A) + \Pr(B) \end{align*}

 

Just like with sets, we can define the notion of disjointness for events.

Events $A$ and $B$ are disjoint if $A \cap B = \emptyset$.

$\Pr \left( \bigcup\limits_{i=1}^k A_i \right) \leq \sum_{i=1}^k \Pr(A_i)$. Furthermore, this is an equality if and only if the $A_i$ are pairwise disjoint.

This is just a generalization of the union bound and the proof comes straight from applying it. Alternatively, you can use induction to prove this. If you consider the events to be disjoint, this is really pretty much the sum rule. As you might have noticed, this result is due to George Boole, who is the same Boole that gave us Boolean algebra. Often times, this is also called the union bound.

Conditional probability

Something that doesn't quite have a direct analogue from our discussions on set theory is the notion of conditional events. Arguably, this is what makes probability theory more than just fancy combinatorics.

If $A$ and $B$ are events, then the conditional probability of $A$ given $B$ is $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.$$ Note that by this definition, $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$.

One way of looking at this is if we restrict our attention to the event $B$, treat it like a sample space, and ask, what is the probability of $A$ in this space? In other words, what we're saying here is that if we know that $B$ already happened, then what is the probability that $A$ will happen? If we assume that our outcomes are uniformly distributed, then we can think about this in terms of sets: $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)} = \frac{\frac{|A \cap B|}{|\Omega|}}{\frac{|B|}{|\Omega|}} = \frac{|A \cap B|}{|B|}.$$

If we look back at our diagram, we have the following.

We can view this "globally" with respect to $\Omega$ and compute the probabilities: \[\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)} = \frac{\frac{2}{12}}{\frac{5}{12}} = \frac 2 5.\] Or, if we jump straight into our observation above about "recentering" our space on $B$, we get that $\Pr(A \mid B) = \frac{|A \cap B|}{|B|} = \frac 2 5$ all the same.

Consider a roll of three dice.

  1. What is the probability that the sum of the dice is 9?
  2. What is the probability that the first die shows 5?
  3. What is the probability that the first die shows 5, given that the sum of the dice is 9?

Here, if we want to talk about the "first" die, we need our dice to be distinguishable (maybe we colour them each a different colour). This means that our sample space is an arrangement of the outcome of the three dice after a roll—a 3-tuple or triple. Each of the three dice can take on a value from 1 through 6, so our sample space is the cartesian product of $\{1, \dots, 6\}$, giving us $6^3 = 216$ possible outcomes.

  1. What we want here is the number of solutions $x_1 + x_2 + x_3 = 9$, with $1 \leq x_1, x_2, x_3 \leq 6$. This is very similar to a combinations with repetition problem: we want 9 items of types $x$, $y$, and $z$. But there's a problem here: going ahead with this problem type includes some solutions that we don't want to consider. In particular, these solutions allow for $x,y,z = 0$, which we don't want if we're thinking about dice.

    To solve this, we rephrase the problem as trying to find solutions to the equation $x+y+z=6$ and count the solutions for which $x,y,z$ take on values from 0 through 5. Informally, what we've done is removed 1 from each of $x,y,z$, so we count the proper numbers. This gets us most of the way there: there are $\binom 8 2$ such solutions. However, this includes a few solutions that we don't want: $(6,0,0), (0,6,0), (0,0,6)$. So we need to subtract these from our total. This gives us $\binom 8 2 - 3 = 25$ and then the probability is $\frac{25}{216} \approx 0.0116$.

  2. Here, we fix one of the die, leaving the other two free. In other words, we want the set of all triples of the form $(5,y,z)$, for all $1 \leq y,z \leq 6$. Since our choice of the first die has been fixed, there are exactly $1 \cdot 6 \cdot 6$ triples of this form, so the probability of this event is $\frac{6^2}{6^3} = \frac 1 6$.
  3. This is a conditional probability question. For convenience, let $A$ be the event "the sum of the dice is 9" and let $B$ be the event "the first die is 5". We already know that there are 25 instances where the dice sum up to 9. What's left is to compute the number of outcomes where the dice add up to 9 and the first die is 5. Fixing the first die to 5, we have that the other two dice must add up to 4, giving us the following outcomes: $(5,1,3),(5,2,2),(5,1,3)$. Then the probability is computed by the following: $$\Pr(B \mid A) = \frac{\Pr(A \cap B)}{\Pr(A)} = \frac{|A \cap B|}{|A|} = \frac{3}{25}.$$

Conditional probability allows us to discuss probabilities for events that are dependent on each other. For instance, it's clear in our previous example that conditioning on the sum of the dice roll will affect the probability of certain other rolls.

This allows us to also consider those events that don't affect each other. That is, we can take two events, condition one of them against the other and the outcome doesn't change.

Events $A$ and $B$ are independent if $\Pr(A \cap B) = \Pr(A) \cdot \Pr(B)$.

Why does this definition make sense? Suppose $A$ and $B$ are independent. Then \[\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)} = \frac{\Pr(A) \cdot \Pr(B)}{\Pr(B)} = \Pr(A).\] In other words, conditioning on $B$ did not change the probability of $A$ at all. It's not hard to see that this relation is symmetric.

From the definitions we've seen, by rearranging the equation for conditional probability to get $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$, we can make the following observations:

It's worth pointing out here the difference between independence and disjointedness. The key lies in correctly interpreting the definition for disjointedness, which is that $\Pr(A \cap B) = 0$. If we go back to the definition, this says that $|A \cap B| = 0$. In other words, events $A$ and $B$ cannot and do not ever occur at the same time. On the other hand, independence tells us that $A$ and $B$ can occur at the same time, but they do not have any effect on each other. So in this sense, if $A$ and $B$ are disjoint, they are actually very dependent on each other.