CMSC 27100 — Lecture 18

Conditional probability and independence

Example 17.9. Consider a roll of three dice.

  1. What is the probability that the sum of the dice is 9?
  2. What is the probability that the first die shows 5?
  3. What is the probability that the first die shows 5, given that the sum of the dice is 9?

Here, if we want to talk about the "first" die, we need our dice to be distinguishable. This means that our sample space is an arrangement of the outcome of the three dice after a roll. Each of the three dice can take on a value from 1 through 6, giving us $6^3 = 216$ possible outcomes.

  1. What we want here is the number of solutions $x_1 + x_2 + x_3 = 9$, with $1 \leq x_1, x_2, x_3 \leq 6$. This turns out to be very similar to a problem on this week's problem set, so I'll just say that the number of such outcomes is $\binom 8 2 - 3 = 25$. Then the probability is $\frac{25}{216} \approx 0.0116$.
  2. Here, we fix one of the die, leaving the other two free. This is exactly $1 \cdot 6 \cdot 6$, so the probability of this event is $\frac{6^2}{6^3} = \frac 1 6$.
  3. This is a conditional probability question. For convenience, let $A$ be the event "the sum of the dice is 9" and let $B$ be the event "the first die is 5". We already know that there are 25 instances where the dice sum up to 9. What's left is to compute the number of outcomes where the dice add up to 9 and the first die is 5. Fixing the first die to 5, we have that the other two dice must add up to 4, giving us the following outcomes: $(5,1,3),(5,2,2),(5,1,3)$. Then the probability is computed by the following: $$\Pr(B \mid A) = \frac{\Pr(A \cap B)}{\Pr(A)} = \frac{|A \cap B|}{|A|} = \frac{3}{25}.$$

Definition 18.1. Events $A$ and $B$ are independent if $\Pr(a \cap B) = \Pr(A) \cdot \Pr(B)$.

From the definitions we've seen, by rearranging the equation for conditional probability to get $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$, we can make the following observations:

Example (Monty Hall) 18.2. The Monty Hall problem is a classic example of conditional probability and independence and it is one that makes people extremely mad. The story is loosely based on the game show Let's Make a Deal, whose host was a guy by the name of Monty Hall.

Suppose there are three doors and there is a prize behind each door. Behind one of the doors is a car and behind the other two are goats. So you pick a door. The host will then choose a door to open that reveals a goat. You are then given the option to either stick with your original choice or to change your choice to the remaining closed door. Assuming you want a car and not a goat, which one is the better option?

We may reason based on our intuition that we had a $\frac 1 3$ chance to win the car to begin with and just because Monty opens a door shouldn't change that fact. However, this assumes that the choice to switch is independent of your initial choice. Note that the door Monty opens depends on your initial choice. If you chose the door with the car behind it, then Monty can choose any of the two other doors. However, if you chose a goat door, Monty only has one door he can open.

So how do we reason about this? We correctly identify that we have a $\frac 1 3$ chance to win the car, which doesn't change if we don't switch doors. However, this means that we have a $\frac 2 3$ chance of initially guessing a goat door. In this case, Monty opens the only door with a goat available to him and the car must be in the remaining door. This means that switching gives us a $\frac 2 3$ chance to choose the car door.

Definition 18.3. We say events $A$ and $B$ are positively correlated if $\Pr(A \cap B) \gt \Pr(A) \cdot \Pr(B)$ and they are negatively correlated if $\Pr(A \cap B) \lt \Pr(A) \cdot \Pr(B)$.

Example 18.4. Consider a roll of a six-sided die and the events

Are these events independent? If not, how are they related? First, our sample space is $\Omega = \{1,2,3,4,5,6\}$, so our events are going to be $A = \{1,3,5\}$ and $B = \{2,3,5\}$. Then $\Pr(A) = \frac 3 6 = \frac 1 2$ and $\Pr(B) = \frac 3 6 = \frac 1 2$. We want to consider the probability of the intersection of our events. This gives us $$\Pr(A \cap B) = \Pr(\{1,3,5\} \cap \{2,3,5\}) = \Pr(\{3,5\}) = \frac 2 6 = \frac 1 3.$$ But $\Pr(A) \cdot \Pr(B) = \frac 1 2 \cdot \frac 1 2 = \frac 1 4$, which is not $\frac 1 3$, so $A$ and $B$ are not independent. Furthermore, we have $$\Pr(A \cap B) = \frac 1 3 \gt \frac 1 4 = \Pr(A) \cdot \Pr(B),$$ so $A$ and $B$ are positively correlated. Of course, this makes sense intuitively, because we know the only even prime number is 2.

Definition 18.5. Events $A_1, A_2, \dots, A_k$ are pairwise independent if for all $i \neq j$, $A_i$ and $A_j$ are independent. The events are mutually independent if for every $I \subseteq \{1, \dots, k\}$, $$\Pr \left( \bigcap_{i \in I} A_i \right) = \prod_{i \in I} \Pr(A_i).$$

This definition suggests that there can be events $A_1, A_2, \dots, A_k$ which are pairwise independent, but not mutually independent. Coming up with examples of such events ($A,B,C$ which are pairwise but not mutually independent) makes for a nice exercise.

Random variables and Bernoulli trials

Recall at the beginning of the course, when we were first discussing logic, that we could define propositional statements, but at some point we hit a wall in terms of being able to express properties of objects. At that point, we had to introduce predicates into our language to do be able to express such notions.

We have a very similar problem at this point, where we have been discussing and defining events explicitly whenever we want to discuss some particular set of outcomes. What would be nice is if there were a way to take some events from our sample space and assign it some value based on the properties of the event. Of course, this describes what a function would do, and so, we will introduce a particular kind of function for this purpose.

Definition 18.6. A random variable is a function $f : \Omega \to \mathbb R$ where $\Omega$ is the sample space of a probability space.

Note that the choice of co-domain $\mathbb R$ is more permissive than prescriptive. A lot of the time, we'll be mapping objects to a particular subset of $\mathbb R$, like $\mathbb N$ or even $\{0,1\}$. And while the standard definition of a random variable is for $\mathbb R$, this can be extended to more exotic co-domains than just $\mathbb R$. In our case, since we're primarily concerned with discrete sample spaces, we are likely to consider random variables that map to discrete spaces.

Now, you may have noticed from our discussion that random variables are not random and are not variables (a random variable is a function, as we can see from the definition above). This naming is due to Francesco Cantelli, from the early 1900s.

Having set up this mechanism, we can define the probability of a particular outcome of a random variable.

Definition 18.7. If $X$ is a random variable, we define $\Pr(X = r) = \Pr(\{\omega \in \Omega \mid X(\omega) = r\})$.

Example 18.8. Suppose we flip a coin three times. Let $X(\omega)$ be the random variable that equals the number of heads that appear when $\omega$ is the outcome. So we have \begin{align*} X(HHH) &= 3, \\ X(HHT) = X(HTH) = X(THH) &= 2, \\ X(TTH) = X(THT) = X(HHT) &= 1, \\ X(TTT) &= 0. \end{align*} Now, suppose that the coin is fair. Since we assume that each of the coin flips is mutually independent, this gives us a probability of $\frac 1 2 \frac 1 2 \frac 1 2 = \frac 1 8$ for any one of the above outcomes. Now, let's consider $\Pr(X = k)$:

$k$ $\{\omega \in \Omega \mid X(\omega) = k\}$ $\Pr(X = k)$
3 $\{HHH\}$ $\frac 1 8$
2 $\{HHT,HTH,THH\}$ $\frac 3 8$
1 $\{TTH,THT,HTT\}$ $\frac 3 8$
0 $\{TTT\}$ $\frac 1 8$

This looks suspiciously like something we've seen before. Here, we'll introduce the notion of Bernoulli trials.

Definition 18.9. A Bernoulli Trial is a random variable $B$ whose range is restricted to $\{0,1\}$, where the event $\{\omega \in \Omega \mid B(\omega) = 1\}$ is the success event, and $\{\omega \in \Omega \mid B(\omega) = 0\}$ is the failure event.

Bernoulli trials are named after Jacob Bernoulli who is also sometimes referred to as James Bernoulli in the late 1600s. There are lots of other mathematicians in the Bernoulli family but the lion's share of mathematical results named after Bernoulli are really named after this particular Bernoulli.

Bernoulli trials are a repetition of $n$ mutually independent events. How should we interpret this in our probability framework? In other words, what do the sample space and the probability distribution look like?

If we have a probability space $(\Omega,\Pr)$ and we want to conduct $n$ independent trials, then what we really want is to take the product of a bunch of these outcomes. So we define the space $(\Omega^n, \Pr_n)$, where $\Omega^n$ is the set of $n$-tuples $(a_1, a_2, \dots, a_n)$ with $a_i \in \Omega$ and $$\Pr_n((a_1, a_2, \dots, a_n)) = \Pr(a_1) \Pr(a_2) \cdots \Pr(a_n).$$ Here, each position $i$ corresponds to one of our trials and we can define the corresponding Bernoulli trial by $X_i((a_1, \dots, a_n)) = X(a_i)$. Since we're taking the product, none of the trials in our tuple depend on each other, so they're independent. We can use this idea to prove the following.

Theorem 18.10. The probability of $k$ successes in $n$ independent Bernoulli trials with probability of success $p$ and probability of failure $q = 1-p$ is $$\binom n k p^k q^{n-k} = \binom n k p^k (1-p)^{n-k}.$$

Proof. For each $n$-tuple $A = (a_1, \dots, a_n)$ of trials with Bernoulli random variable $X$, we can construct a string $w_A = b_1 b_2 \cdots b_n$ over the alphabet $\{S,F\}$ by $$b_i = \begin{cases} S & \text{if $X(a_i) = 1$,} \\ F & \text{otherwise.} \end{cases}$$ Then the number of $n$-tuples with $k$ successes is the same as the number of binary strings of length $n$ with $k$ $S$'s. Recall that there are exactly $\binom n k$ of these.

Then for each $n$-tuple with $k$ successes, we have $\Pr_n(A) = p^k (1-p)^{n-k}$. Putting this together, the probability of $k$ successes is $$\binom n k p^k (1-p)^{n-k}.$$ $\tag*{$\Box$}$

This is called the binomial distribution, for the reason that this is what you'd get if you plugged in the $x = p$ and $y = (1-p)$ into the Binomial Theorem. It is from this that we get the definition $$b(k;n,p) = \binom n k p^k (1-p)^{n-k}.$$ We can then reinterpret our results from the coin-flipping example as a Bernoulli trial. Since the probability for a fair coin flip to land on heads is $p = \frac 1 2$, we have

$k$ $b(k;n,p)$
3 $$\binom 3 3 \left( \frac 1 2 \right)^3 \left( \frac 1 2 \right)^0$$ $$\frac 1 8$$
2 $$\binom 3 2 \left( \frac 1 2 \right)^2 \left( \frac 1 2 \right)^1$$ $$\frac 3 8$$
1 $$\binom 3 1 \left( \frac 1 2 \right)^1 \left( \frac 1 2 \right)^2$$ $$\frac 3 8$$
0 $$\binom 3 0 \left( \frac 1 2 \right)^0 \left( \frac 1 2 \right)^3$$ $$\frac 1 8$$