MPCS 50103 — Lecture 11

Bayes' Theorem

A common scenario that we may run into is that we have calculated the probabilities of some events based on some data we have. However, we will often receive new information and we would like to incorporate that new information to compute the probability based on the information we already have together with the new information. This is where the result of Thomas Bayes, Presbyterian minister and statistician, comes into play.

If $A$ and $B$ are events with $\Pr(A), \Pr(B) \gt 0$, then \[\Pr(A \mid B) = \frac{\Pr(B\mid A) \cdot \Pr(A)}{\Pr(B)}.\]

By definition, we have \[\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.\] Rearranging this equation, we get $\Pr(A \cap B) = \Pr(A \mid B) \Pr(B)$. We can do something similar with $\Pr(B \mid A)$ to get $\Pr(A \cap B) = \Pr(B \mid A) \Pr(A)$. Then this gives us the equality $\Pr(A \mid B) \Pr(B) = \Pr(B \mid A) \Pr(A)$. Then we can rearrange this equation to get \[\Pr(A \mid B) = \frac{\Pr(B \mid A) \Pr(A)}{\Pr(B)}.\]

 

Consider again our biased die with $\Pr(1) = \frac 3 4$ and $\Pr(\omega) = \frac1{20}$ for $\omega = 2,3,4,5,6$. Suppose you go to a dice thrift store to pick up some second-hand dice. Unfortunately, there is a tendency for about 5% of the dice available to be biased in the way we described above and the remaining 95% of the dice are regular old fair dice. Because of this, the store will let you test-roll some dice. If you pick a die at random and roll two 1s, what is the probability that you picked up a biased die?

What we essentially want to do is compute the probability that, given that we rolled two 1s, we picked up a biased die. If we let $B$ denote the event that we picked up a biased die and we denote by $R_{1,1}$ the event that we rolled two 1s, then we want to compute $\Pr(B \mid R_{1,1})$. Bayes' Theorem tells us that $$\Pr(B \mid R_{1,1}) = \frac{\Pr(R_{1,1} \mid B) \cdot \Pr(B)}{\Pr(R_{1,1})}.$$ We know $\Pr(B) = 5\% = \frac 1 {20}$ and we know that $\Pr(R_{1,1} \mid B) = \left( \frac 3 4 \right)^2 = \frac{9}{16}$. All we need to do is to calculate $\Pr(R_{1,1})$.

Recall that the Law of Total Probability tells us that $$\Pr(R_{1,1}) = \Pr(R_{1,1} \mid B) \cdot \Pr(B) + \Pr(R_{1,1} \mid \overline B) \cdot \Pr(\overline B)$$ where $\overline B$ is the non-biased die event, or in other words, the fair dice. This gives us \begin{align*} \Pr(R_{1,1}) &= \Pr(R_{1,1} \mid B) \cdot \Pr(B) + \Pr(R_{1,1} \mid \overline B) \cdot \Pr(\overline B) \\ &= \frac{9}{16} \cdot \frac 1{20} + \frac 1{36} \cdot \frac{19}{20} \\ &= \frac{157}{2880} \end{align*} Then plugging these into the formula for Bayes' Theorem gives us $$\Pr(B \mid R_{1,1}) = \frac{\frac{9}{320}}{\frac{157}{2880}} = \frac{81}{157} \approx 0.516.$$

Independence

Events $A$ and $B$ are independent if $\Pr(A \cap B) = \Pr(A) \cdot \Pr(B)$.

From the definitions we've seen, by rearranging the equation for conditional probability to get $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$, we can make the following observations:

We say events $A$ and $B$ are positively correlated if $\Pr(A \cap B) \gt \Pr(A) \cdot \Pr(B)$ and they are negatively correlated if $\Pr(A \cap B) \lt \Pr(A) \cdot \Pr(B)$.

Consider a roll of a six-sided die and the events

Are these events independent? If not, how are they related? First, our sample space is $\Omega = \{1,2,3,4,5,6\}$, so our events are going to be $A = \{1,3,5\}$ and $B = \{2,3,5\}$. Then $\Pr(A) = \frac 3 6 = \frac 1 2$ and $\Pr(B) = \frac 3 6 = \frac 1 2$. We want to consider the probability of the intersection of our events. This gives us $$\Pr(A \cap B) = \Pr(\{1,3,5\} \cap \{2,3,5\}) = \Pr(\{3,5\}) = \frac 2 6 = \frac 1 3.$$ But $\Pr(A) \cdot \Pr(B) = \frac 1 2 \cdot \frac 1 2 = \frac 1 4$, which is not $\frac 1 3$, so $A$ and $B$ are not independent. Furthermore, we have $$\Pr(A \cap B) = \frac 1 3 \gt \frac 1 4 = \Pr(A) \cdot \Pr(B),$$ so $A$ and $B$ are positively correlated. Of course, this makes sense intuitively, because we know the only even prime number is 2.

Events $A_1, A_2, \dots, A_k$ are pairwise independent if for all $i \neq j$, $A_i$ and $A_j$ are independent. The events are mutually independent if for every $I \subseteq \{1, \dots, k\}$, $$\Pr \left( \bigcap_{i \in I} A_i \right) = \prod_{i \in I} \Pr(A_i).$$

This definition suggests that there can be events $A_1, A_2, \dots, A_k$ which are pairwise independent, but not mutually independent. Coming up with examples of such events ($A,B,C$) which are pairwise but not mutually independent) makes for a nice exercise.

Random variables and Bernoulli trials

Recall at the beginning of the course, when we were first discussing logic, that we could define propositional statements, but at some point we hit a wall in terms of being able to express properties of objects. At that point, we had to introduce predicates into our language to do be able to express such notions.

We have a very similar problem at this point, where we have been discussing and defining events explicitly whenever we want to discuss some particular set of outcomes. What would be nice is if there were a way to take some events from our sample space and assign it some value based on the properties of the event. Of course, this describes what a function would do, and so, we will introduce a particular kind of function for this purpose.

A random variable is a function $f : \Omega \to \mathbb R$ where $\Omega$ is the sample space of a probability space.

Note that the choice of co-domain $\mathbb R$ is more permissive than prescriptive. A lot of the time, we'll be mapping objects to a particular subset of $\mathbb R$, like $\mathbb N$ or even $\{0,1\}$. And while the standard definition of a random variable is for $\mathbb R$, this can be extended to more exotic co-domains than just $\mathbb R$. In our case, since we're primarily concerned with discrete sample spaces, we are likely to consider random variables that map to discrete spaces.

Now, you may have noticed from our discussion that random variables are not random and are not variables (a random variable is a function, as we can see from the definition above). This naming is due to Francesco Cantelli, from the early 1900s.

Having set up this mechanism, we can define the probability of a particular outcome of a random variable.

If $X$ is a random variable, we define $\Pr(X = r) = \Pr(\{\omega \in \Omega \mid X(\omega) = r\})$.

Suppose we flip a coin three times. Let $X(\omega)$ be the random variable that equals the number of heads that appear when $\omega$ is the outcome. So we have \begin{align*} X(HHH) &= 3, \\ X(HHT) = X(HTH) = X(THH) &= 2, \\ X(TTH) = X(THT) = X(HHT) &= 1, \\ X(TTT) &= 0. \end{align*} Now, suppose that the coin is fair. Since we assume that each of the coin flips is mutually independent, this gives us a probability of $\frac 1 2 \frac 1 2 \frac 1 2 = \frac 1 8$ for any one of the above outcomes. Now, let's consider $\Pr(X = k)$:

$k$ $\{\omega \in \Omega \mid X(\omega) = k\}$ $\Pr(X = k)$
3 $\{HHH\}$ $\frac 1 8$
2 $\{HHT,HTH,THH\}$ $\frac 3 8$
1 $\{TTH,THT,HTT\}$ $\frac 3 8$
0 $\{TTT\}$ $\frac 1 8$

This looks suspiciously like something we've seen before. Here, we'll introduce the notion of Bernoulli trials.

A Bernoulli Trial is a random variable $B$ whose range is restricted to $\{0,1\}$, where the event $\{\omega \in \Omega \mid B(\omega) = 1\}$ is the success event, and $\{\omega \in \Omega \mid B(\omega) = 0\}$ is the failure event.

Bernoulli trials are named after Jacob Bernoulli who is also sometimes referred to as James Bernoulli in the late 1600s. There are lots of other mathematicians in the Bernoulli family but the lion's share of mathematical results named after Bernoulli are really named after this particular Bernoulli.

Bernoulli trials are a repetition of $n$ mutually independent events. How should we interpret this in our probability framework? In other words, what do the sample space and the probability distribution look like?

If we have a probability space $(\Omega,\Pr)$ and we want to conduct $n$ independent trials, then what we really want is to take the product of a bunch of these outcomes. So we define the space $(\Omega^n, \Pr_n)$, where $\Omega^n$ is the set of $n$-tuples $(a_1, a_2, \dots, a_n)$ with $a_i \in \Omega$ and $$\Pr_n((a_1, a_2, \dots, a_n)) = \Pr(a_1) \Pr(a_2) \cdots \Pr(a_n).$$ Here, each position $i$ corresponds to one of our trials and we can define the corresponding Bernoulli trial by $X_i((a_1, \dots, a_n)) = X(a_i)$. Since we're taking the product, none of the trials in our tuple depend on each other, so they're independent. We can use this idea to prove the following.

The probability of $k$ successes in $n$ independent Bernoulli trials with probability of success $p$ and probability of failure $q = 1-p$ is $$\binom n k p^k q^{n-k} = \binom n k p^k (1-p)^{n-k}.$$

For each $n$-tuple $A = (a_1, \dots, a_n)$ of trials with Bernoulli random variable $X$, we can construct a string $w_A = b_1 b_2 \cdots b_n$ over the alphabet $\{S,F\}$ by $$b_i = \begin{cases} S & \text{if $X(a_i) = 1$,} \\ F & \text{otherwise.} \end{cases}$$ Then the number of $n$-tuples with $k$ successes is the same as the number of binary strings of length $n$ with $k$ $S$'s. Recall that there are exactly $\binom n k$ of these.

Then for each $n$-tuple with $k$ successes, we have $\Pr_n(A) = p^k (1-p)^{n-k}$. Putting this together, the probability of $k$ successes is \[\binom n k p^k (1-p)^{n-k}.\]

 

This is called the binomial distribution, for the reason that this is what you'd get if you plugged in the $x = p$ and $y = (1-p)$ into the Binomial Theorem. It is from this that we get the definition $$b(k;n,p) = \binom n k p^k (1-p)^{n-k}.$$ We can then reinterpret our results from the coin-flipping example as a Bernoulli trial. Since the probability for a fair coin flip to land on heads is $p = \frac 1 2$, we have

$k$ $b(k;n,p)$
3 $$\binom 3 3 \left( \frac 1 2 \right)^3 \left( \frac 1 2 \right)^0$$ $$\frac 1 8$$
2 $$\binom 3 2 \left( \frac 1 2 \right)^2 \left( \frac 1 2 \right)^1$$ $$\frac 3 8$$
1 $$\binom 3 1 \left( \frac 1 2 \right)^1 \left( \frac 1 2 \right)^2$$ $$\frac 3 8$$
0 $$\binom 3 0 \left( \frac 1 2 \right)^0 \left( \frac 1 2 \right)^3$$ $$\frac 1 8$$