MPCS 50103 — Lecture 7

Example 12.4. Recall that a standard deck of playing cards consists of four suits, $\spadesuit, \heartsuit, \clubsuit, \diamondsuit$ with cards numbered from 2 through 10, and a Jack, Queen, King, and Ace. This gives us 52 cards in total ($4 \times 13$). What is the probability that a poker hand contains a full house (i.e. a three of a kind together with a pair)? First, let's consider what our sample space is. Since we're concerned with possible poker hands, this should be all of the 5-card hands from our 52 cards. In this case, the size of our sample space is $$|\Omega| = \binom{52}5 = \frac{52 \cdot 51 \cdot 50 \cdot 49 \cdot 48}{5 \cdot 4 \cdot 3 \cdot 2 \cdot 1} = 2598960.$$ Now, let $F$ denote the event for full houses. First, we need to choose two denominations. These will not be the same, since otherwise, we'd need five of the same denomination. We also need to distinguish between the denomination for the three of a kind and the denomination for the pair, so this is a 2-permutation and not a combination, which gives us $P(13,2)$. From this selection, we have $\binom 4 3$ ways to choose a three of a kind for one denomination and $\binom 4 2$ ways to choose a pair. Putting this all together, we get $$|F| = P(13,2) \binom 4 3 \binom 4 2 = (13 \cdot 12) \cdot \frac{4 \cdot 3 \cdot 2}{3 \cdot 2 \cdot 1} \cdot \frac{4 \cdot 3}{2 \cdot 1} = 13 \cdot 12 \cdot 4 \cdot 6 = 3744.$$ Then the probability that we get a full house is $$\Pr(F) = \frac{|F|}{|\Omega|} = \frac{3744}{2598960} \approx 0.00144.$$

Example 12.5. Suppose we deal a hand of 13 cards from a standard 52 card deck to four players: North, East, South, and West. What is the probability that North holds all the aces? First, we need to think about what our sample space should be. Since we are concerned with hands, this will be the result of the deal: all of the possible ways to distribute 13 card hands to four people. Combinatorially, this is distributing 52 distinguishable objects into four distinguishable boxes with 13 objects each. Then the size of our sample space is $$|\Omega| = \binom{52}{13,13,13,13}.$$ Now, we want to count the number of such hands where North holds all the aces. In this case, North holds four cards that are fixed already, so we need to count how the other 48 cards are distributed. Let's call this event $N_{4A}$. This gives us $$|N_{4A}| = \binom{48}{9,13,13,13}.$$ Then the probability of our event is just \begin{align*} \Pr(N_{4A}) &= \frac{|N_{4A}|}{|\Omega|} \\ &= \frac{\binom{48}{9,13,13,13}}{\binom{52}{13,13,13,13}} \\ &= \frac{13 \cdot 12 \cdot 11 \cdot 10}{52 \cdot 51 \cdot 50 \cdot 49} \\ &\approx 0.00264 \end{align*}

So determining the probability of a single event amounts to counting the possible outcomes, or the sizes of the sets for the event and the sample space. And since we are dealing with sets, we can combine our sets in the same way as we've been doing to consider the probability of multiple events together.

Theorem 12.6. $\Pr(A \cup B) + \Pr(A \cap B) = \Pr(A) + \Pr(B)$.

You might notice that this is really just a rearrangement of the Inclusion-Exclusion Principle, with the intersection term moved to the other side of the equality. This leads us to the familiar notion of disjoint events.

Definition 12.7. Events $A$ and $B$ are disjoint if $A \cap B = \emptyset$.

Theorem 12.8 (Boole's Inequality). $\Pr \left( \bigcup_{i=1}^k A_i \right) \leq \sum_{i=1}^k \Pr(A_i)$. Furthermore, this is an equality if and only if the $A_i$ are pairwise disjoint.

This is just a generalization of Theorem 12.4 and the proof comes straight from applying it. Alternatively, you can use induction to prove this. If you consider the events to be disjoint, this is really pretty much the sum rule. As you might have noticed, this result is due to George Boole, who is the same Boole that gave us Boolean algebra.

Conditional probability

Something that doesn't quite have a direct analogue from our discussions on set theory is the notion of conditional events.

Definition 13.1. If $A$ and $B$ are events, then the conditional probability of $A$ relative to $B$ is $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.$$ Note that by this definition, $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$.

One way of looking at this is if we restrict our attention to the event $B$ and ask, what is the probability of $A$ in this space? In this case, we are asking for the intersection of the events $A$ and $B$. Then to "zoom out" back to the sample space $\Omega$, we divide by the probability of $B$.

Example 13.2. Consider a roll of three dice.

What is the probability that the sum of the dice is 9?
What is the probability that the first die shows 5?
What is the probability that the first die shows 5, given that the sum of the dice is 9?

Here, if we want to talk about the "first" die, we need our dice to be distinguishable. This means that our sample space is an arrangement of the outcome of the three dice after a roll. Each of the three dice can take on a value from 1 through 6, giving us $6^3 = 216$ possible outcomes.

What we want here is the number of solutions $x_1 + x_2 + x_3 = 9$, with $1 \leq x_1, x_2, x_3 \leq 6$. This turns out to be very similar to a problem on this week's problem set, so I'll just say that the number of such outcomes is $\binom 8 2 - 3 = 25$. Then the probability is $\frac{25}{216} \approx 0.0116$.
Here, we fix one of the die, leaving the other two free. This is exactly $1 \cdot 6 \cdot 6$, so the probability of this event is $\frac{6^2}{6^3} = \frac 1 6$.
This is a conditional probability question. For convenience, let $A$ be the event "the sum of the dice is 9" and let $B$ be the event "the first die is 5". We already know that there are 25 instances where the dice sum up to 9. What's left is to compute the number of outcomes where the dice add up to 9 and the first die is 5. Fixing the first die to 5, we have that the other two dice must add up to 4, giving us the following outcomes: $(5,1,3),(5,2,2),(5,1,3)$. Then the probability is computed by the following: $$\Pr(B \mid A) = \frac{\Pr(A \cap B)}{\Pr(A)} = \frac{|A \cap B|}{|A|} = \frac{3}{25}.$$

Bayes' Theorem

A common scenario that we may run into is that we have calculated the probabilities of some events based on some data we have. However, we will often receive new information and we would like to incorporate that new information to compute the probability based on the information we already have together with the new information. This is where the result of Thomas Bayes, Presbyterian minister and statistician, comes into play.

Definition 13.3. A partition of $\Omega$ is a family of pairwise disjoint events $H_1, \dots, H_m$ covering $\Omega$. That is, $\bigcup_{i=1}^m H_i = \Omega$ and $H_i \cap H_j = \emptyset$ for all $i \neq j$.

Theorem 13.4 (Law of Total Probability). Given a partition $H_1, \dots, H_m$ of $\Omega$ and an event $A$, $$\Pr(A) = \sum_{i=1}^m \Pr(A \mid H_i) \cdot \Pr(H_i).$$

Proof. Since we have a partition $H_1, \dots, H_m$, we have $A = A \cap \Omega = A \cap \bigcup_{i = 1}^m H_i$. Then, \begin{align*} \Pr(A) &= \Pr\left( A \cap \bigcup_{i=1}^m H_i \right) \\ &= \Pr \left( \bigcup_{i=1}^m (A \cap H_i) \right) \\ &= \sum_{i=1}^m \Pr(A \cap H_i) &\text{since $H_i$ are pairwise disjoint} \\ &= \sum_{i=1}^m \Pr(A \mid H_i) \Pr(H_i) \end{align*} $\tag*{$\Box$}$

The Law of Total Probability generalizes the idea that if I know $\Pr(A \mid B)$ and I know $\Pr(A \mid \overline B)$, then that just gives me $\Pr(A)$ when I put them together, along with $\Pr(B)$. The Law of Total Probability will be particularly useful for computing the probabilities of events when applying Bayes' Theorem.

Theorem 13.5 (Bayes' Theorem). If $A$ and $B$ are events with $\Pr(A), \Pr(B) \gt 0$, then $$\Pr(A \mid B) = \frac{\Pr(B\mid A) \cdot \Pr(A)}{\Pr(B)}.$$

Proof. By definition, we have $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.$$ Rearranging this equation, we get $\Pr(A \cap B) = \Pr(A \mid B) \Pr(B)$. We can do something similar with $\Pr(B \mid A)$ to get $\Pr(A \cap B) = \Pr(B \mid A) \Pr(A)$. Then this gives us the equality $\Pr(A \mid B) \Pr(B) = \Pr(B \mid A) \Pr(A)$. Then we can rearrange this equation to get $$\Pr(A \mid B) = \frac{\Pr(B \mid A) \Pr(A)}{\Pr(B)}.$$ $\tag*{$\Box$}$

Example 13.6. Consider again our biased die with $\Pr(1) = \frac 3 4$ and $\Pr(\omega) = \frac1{20}$ for $\omega = 2,3,4,5,6$. Suppose you go to a dice thrift store to pick up some second-hand dice. Unfortunately, there is a tendency for about 5% of the dice available to be biased in the way we described above and the remaining 95% of the dice are regular old fair dice. Because of this, the store will let you test-roll some dice. If you pick a die at random and roll two 1s, what is the probability that you picked up a biased die?

What we essentially want to do is compute the probability that, given that we rolled two 1s, we picked up a biased die. If we let $B$ denote the event that we picked up a biased die and we denote by $R_{1,1}$ the event that we rolled two 1s, then we want to compute $\Pr(B \mid R_{1,1})$. Bayes' Theorem tells us that $$\Pr(B \mid R_{1,1}) = \frac{\Pr(R_{1,1}) \mid B) \cdot \Pr(B)}{\Pr(R_{1,1})}.$$ We know $\Pr(B) = 5\% = \frac 1 {20}$ and we know that $\Pr(R_{1,1} \mid B) = \left( \frac 3 4 \right)^2 = \frac{9}{16}$. All we need to do is to calculate $\Pr(R_{1,1})$.

Recall that the Law of Total Probability tells us that $$\Pr(R_{1,1}) = \Pr(R_{1,1} \mid B) \cdot \Pr(B) + \Pr(R_{1,1} \mid \overline B) \cdot \Pr(\overline B)$$ where $\overline B$ is the non-biased die event, or in other words, the fair dice. This gives us \begin{align*} \Pr(R_{1,1}) &= \Pr(R_{1,1} \mid B) \cdot \Pr(B) + \Pr(R_{1,1} \mid \overline B) \cdot \Pr(\overline B) \\ &= \frac{9}{16} \cdot \frac 1{20} + \frac 1{36} \cdot \frac{19}{20} \\ &= \frac{157}{2880} \end{align*} Then plugging these into the formula for Bayes' Theorem gives us $$\Pr(B \mid R_{1,1}) = \frac{\frac{9}{320}}{\frac{157}{2880}} = \frac{81}{157} \approx 0.516.$$

Independence

The Monty Hall problem demonstrates that our intuitive understanding of independence isn't enough.

Definition 14.1. Events $A$ and $B$ are independent if $\Pr(a \cap B) = \Pr(A) \cdot \Pr(B)$.

From the definitions we've seen, by rearranging the equation for conditional probability to get $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$, we can make the following observations:

$\Pr(A \cap B) = 0$ if $A$ and $B$ are disjoint,
$\Pr(A \cap B) = \Pr(A) \cdot \Pr(B)$ if $A$ and $B$ are independent,
$\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$ if $A$ and $B$ are not independent.

Example (Monty Hall) 14.2. The Monty Hall problem is a classic example of conditional probability and independence and it is one that makes people extremely mad. The story is loosely based on the game show Let's Make a Deal, whose host was a guy by the name of Monty Hall.

Suppose there are three doors and there is a prize behind each door. Behind one of the doors is a car and behind the other two are goats. So you pick a door. The host will then choose a door to open that reveals a goat. You are then given the option to either stick with your original choice or to change your choice to the remaining closed door. Assuming you want a car and not a goat, which one is the better option?

We may reason based on our intuition that we had a $\frac 1 3$ chance to win the car to begin with and just because Monty opens a door shouldn't change that fact. However, this assumes that the choice to switch is independent of your initial choice. Note that the door Monty opens depends on your initial choice. If you chose the door with the car behind it, then Monty can choose any of the two other doors. However, if you chose a goat door, Monty only has one door he can open.

So how do we reason about this? We correctly identify that we have a $\frac 1 3$ chance to win the car, which doesn't change if we don't switch doors. However, this means that we have a $\frac 2 3$ chance of initially guessing a goat door. In this case, Monty opens the only door with a goat available to him and the car must be in the remaining door. This means that switching gives us a $\frac 2 3$ chance to choose the car door.

We will use Bayes' theorem to formally describe the probability of the outcomes for each choice. Let $C$ be the random variable that indicates which door has the car behind it and let $D$ be the random variable that tells us which door Monty Hall opens. We denote each door by 1, 2, or 3. Suppose we choose door $i$.

First, we observe that $\Pr(C = j) = \frac 1 3$ for $j = 1,2,3$. That is, each door is equally likely to have the car behind it.

Next, we need to think about which door Monty is likely to open. Recalling that we chose door $i$, we have $$ \Pr(D = k \mid C = j) = \begin{cases} 1 & \text{if $i \neq j \neq k$,} \\ 0 & \text{if $k = i$ or $k = j$,} \\ \frac 1 2 & \text{if $i = j$ and $k \neq i$.} \\ \end{cases} $$ The first case says that if we didn't choose the door with the car, then Monty must open the remaining door with the goat. The second case says that Monty will not open the door we chose or the door containing the car. The last case occurs when we choose the door with the car, in which case Monty opens one of the other two doors with equal probability.

Putting this together, we can consider two cases: if we chose the correct door ($C = i$) or if we chose the wrong door ($C = j \neq i$). In each case, Monty opens a door $k$. We know from the rules that $k \neq i$, since we chose $i$ and $k$ cannot be the door with the car behind it.

If we chose the correct door, we have $$\Pr(C = i \mid D = k) = \frac{\Pr(D = k \mid C = i) \Pr(C = i)}{\Pr(D = k)}.$$ We must compute $\Pr(D = k)$, but that's not hard because we've exhausted all the possibilities above. We have \begin{align*} \Pr(D = k) &= \sum_{n=1}^3 \Pr(D = k \mid C = n) \Pr(C = n) \\ &= 1 \cdot \frac 1 3 + 0 \cdot \frac 1 3 + \frac 1 2 \cdot \frac 1 3 \\ &= \frac 1 2 \end{align*} Then going back to our equation above, we have $$\Pr(C = i \mid D = k) = \frac{\frac 1 2 \cdot \frac 1 3}{\frac 1 2} = \frac 1 3.$$ That is, if we picked the correct door to begin with, then Monty revealing the door doesn't change our odds of winning and we're still at $\frac 1 3$. Now, if we repeat this analysis but with $C = j \neq i$, we have $$\Pr(C = j \mid D = k) = \frac{1 \cdot \frac 1 3}{\frac 1 2} = \frac 2 3.$$ This says that if we picked door $i$, then the probability that the car is behind the other door after Monty makes the revelation is $\frac 2 3$. Therefore, the best strategy is always to switch.

Definition 14.3. We say events $A$ and $B$ are positively correlated if $\Pr(A \cap B) \gt \Pr(A) \cdot \Pr(B)$ and they are negatively correlated if $\Pr(A \cap B) \lt \Pr(A) \cdot \Pr(B)$.

Example 14.4. Consider a roll of a six-sided die and the events

$A$: the roll is odd,
$B$: the roll is prime.

Are these events independent? If not, how are they related? First, our sample space is $\Omega = \{1,2,3,4,5,6\}$, so our events are going to be $A = \{1,3,5\}$ and $B = \{2,3,5\}$. Then $\Pr(A) = \frac 3 6 = \frac 1 2$ and $\Pr(B) = \frac 3 6 = \frac 1 2$. We want to consider the probability of the intersection of our events. This gives us $$\Pr(A \cap B) = \Pr(\{1,3,5\} \cap \{2,3,5\}) = \Pr(\{3,5\}) = \frac 2 6 = \frac 1 3.$$ But $\Pr(A) \cdot \Pr(B) = \frac 1 2 \cdot \frac 1 2 = \frac 1 4$, which is not $\frac 1 3$, so $A$ and $B$ are not independent. Furthermore, we have $$\Pr(A \cap B) = \frac 1 3 \gt \frac 1 4 = \Pr(A) \cdot \Pr(B),$$ so $A$ and $B$ are positively correlated. Of course, this makes sense intuitively, because we know the only even prime number is 2.

Definition 14.5. Events $A_1, A_2, \dots, A_k$ are pairwise independent if for all $i \neq j$, $A_i$ and $A_j$ are independent. The events are mutually independent if for every $I \subseteq \{1, \dots, k\}$, $$\Pr \left( \bigcap_{i \in I} A_i \right) = \prod_{i \in I} \Pr(A_i).$$

This definition suggests that there can be events $A_1, A_2, \dots, A_k$ which are pairwise independent, but not mutually independent. Coming up with examples of such events ($A,B,C$ which are pairwise but not mutually independent) makes for a nice exercise.

Random variables and Bernoulli trials

Recall at the beginning of the course, when we were first discussing logic, that we could define propositional statements, but at some point we hit a wall in terms of being able to express properties of objects. At that point, we had to introduce predicates into our language to do be able to express such notions.

We have a very similar problem at this point, where we have been discussing and defining events explicitly whenever we want to discuss some particular set of outcomes. What would be nice is if there were a way to take some events from our sample space and assign it some value based on the properties of the event. Of course, this describes what a function would do, and so, we will introduce a particular kind of function for this purpose.

Definition 14.6. A random variable is a function $f : \Omega \to \mathbb R$ where $\Omega$ is the sample space of a probability space.

Note that the choice of co-domain $\mathbb R$ is more permissive than prescriptive. A lot of the time, we'll be mapping objects to a particular subset of $\mathbb R$, like $\mathbb N$ or even $\{0,1\}$. And while the standard definition of a random variable is for $\mathbb R$, this can be extended to more exotic co-domains than just $\mathbb R$. In our case, since we're primarily concerned with discrete sample spaces, we are likely to consider random variables that map to discrete spaces.

Now, you may have noticed from our discussion that random variables are not random and are not variables (a random variable is a function, as we can see from the definition above). This naming is due to Francesco Cantelli, from the early 1900s.

Having set up this mechanism, we can define the probability of a particular outcome of a random variable.

Definition 14.7. If $X$ is a random variable, we define $\Pr(X = r) = \Pr(\{\omega \in \Omega \mid X(\omega) = r\})$.

Example 14.8. Suppose we flip a coin three times. Let $X(\omega)$ be the random variable that equals the number of heads that appear when $\omega$ is the outcome. So we have \begin{align*} X(HHH) &= 3, \\ X(HHT) = X(HTH) = X(THH) &= 2, \\ X(TTH) = X(THT) = X(HHT) &= 1, \\ X(TTT) &= 0. \end{align*} Now, suppose that the coin is fair. Since we assume that each of the coin flips is mutually independent, this gives us a probability of $\frac 1 2 \frac 1 2 \frac 1 2 = \frac 1 8$ for any one of the above outcomes. Now, let's consider $\Pr(X = k)$:

$k$	$\{\omega \in \Omega \mid X(\omega) = k\}$	$\Pr(X = k)$
3	$\{HHH\}$	$\frac 1 8$
2	$\{HHT,HTH,THH\}$	$\frac 3 8$
1	$\{TTH,THT,HTT\}$	$\frac 3 8$
0	$\{TTT\}$	$\frac 1 8$

This looks suspiciously like something we've seen before. Here, we'll introduce the notion of Bernoulli trials.

Definition 14.9. A Bernoulli Trial is a random variable $B$ whose range is restricted to $\{0,1\}$, where the event $\{\omega \in \Omega \mid B(\omega) = 1\}$ is the success event, and $\{\omega \in \Omega \mid B(\omega) = 0\}$ is the failure event.

Bernoulli trials are named after Jacob Bernoulli who is also sometimes referred to as James Bernoulli in the late 1600s. There are lots of other mathematicians in the Bernoulli family but the lion's share of mathematical results named after Bernoulli are really named after this particular Bernoulli.

Bernoulli trials are a repetition of $n$ mutually independent events. How should we interpret this in our probability framework? In other words, what do the sample space and the probability distribution look like?

If we have a probability space $(\Omega,\Pr)$ and we want to conduct $n$ independent trials, then what we really want is to take the product of a bunch of these outcomes. So we define the space $(\Omega^n, \Pr_n)$, where $\Omega^n$ is the set of $n$-tuples $(a_1, a_2, \dots, a_n)$ with $a_i \in \Omega$ and $$\Pr_n((a_1, a_2, \dots, a_n)) = \Pr(a_1) \Pr(a_2) \cdots \Pr(a_n).$$ Here, each position $i$ corresponds to one of our trials and we can define the corresponding Bernoulli trial by $X_i((a_1, \dots, a_n)) = X(a_i)$. Since we're taking the product, none of the trials in our tuple depend on each other, so they're independent. We can use this idea to prove the following.

Theorem 14.10. The probability of $k$ successes in $n$ independent Bernoulli trials with probability of success $p$ and probability of failure $q = 1-p$ is $$\binom n k p^k q^{n-k} = \binom n k p^k (1-p)^{n-k}.$$

Proof. For each $n$-tuple $A = (a_1, \dots, a_n)$ of trials with Bernoulli random variable $X$, we can construct a string $w_A = b_1 b_2 \cdots b_n$ over the alphabet $\{S,F\}$ by $$b_i = \begin{cases} S & \text{if $X(a_i) = 1$,} \\ F & \text{otherwise.} \end{cases}$$ Then the number of $n$-tuples with $k$ successes is the same as the number of binary strings of length $n$ with $k$ $S$'s. Recall that there are exactly $\binom n k$ of these.

Then for each $n$-tuple with $k$ successes, we have $\Pr_n(A) = p^k (1-p)^{n-k}$. Putting this together, the probability of $k$ successes is $$\binom n k p^k (1-p)^{n-k}.$$ $\tag*{$\Box$}$

This is called the binomial distribution, for the reason that this is what you'd get if you plugged in the $x = p$ and $y = (1-p)$ into the Binomial Theorem. It is from this that we get the definition $$b(k;n,p) = \binom n k p^k (1-p)^{n-k}.$$ We can then reinterpret our results from the coin-flipping example as a Bernoulli trial. Since the probability for a fair coin flip to land on heads is $p = \frac 1 2$, we have

$k$	$b(k;n,p)$
3	$$\binom 3 3 \left( \frac 1 2 \right)^3 \left( \frac 1 2 \right)^0$$	$$\frac 1 8$$
2	$$\binom 3 2 \left( \frac 1 2 \right)^2 \left( \frac 1 2 \right)^1$$	$$\frac 3 8$$
1	$$\binom 3 1 \left( \frac 1 2 \right)^1 \left( \frac 1 2 \right)^2$$	$$\frac 3 8$$
0	$$\binom 3 0 \left( \frac 1 2 \right)^0 \left( \frac 1 2 \right)^3$$	$$\frac 1 8$$

Expectation

We have moved from talking about the probability of events to thinking about the probability of a property of events, in the form of random variables. Now, if we know these probabilities, we can ask the question of what kinds of values we might expect more often.

Definition 15.1. Let $(\Omega,\Pr)$ be a probability space and let $X$ be a random variable. The expected value of $X$ is $$E(X) = \sum_{\omega \in \Omega} X(\omega) \Pr(\omega).$$

Roughly speaking, the expectation of a random variable is the (weighted) average value for the random variable.

Example 15.2. Consider a fair 6-sided die and let $X$ be the random variable for the number that was rolled. We have $$E(X) = 1 \cdot \frac 1 6 + 2 \cdot \frac 1 6 + 3 \cdot \frac 1 6 + 4 \cdot \frac 1 6 + 5 \cdot \frac 1 6 + 6 \cdot \frac 1 6 = \frac{21}{6} = 3.5.$$ Now, consider a biased 6-sided die with $$\Pr(\omega) = \begin{cases} \frac 3 4 & \text{if $\omega = 1$,} \\ \frac{1}{20} & \text{otherwise}. \end{cases}$$ Let $Y$ be the random variable for the number that was rolled with our biased die. Then we get $$E(Y) = 1 \cdot \frac 3 4 + 2 \cdot \frac{1}{20} + 3 \cdot \frac{1}{20} + 4 \cdot \frac{1}{20} + 5 \cdot \frac{1}{20} + 6 \cdot \frac{1}{20} = \frac 3 4 + \frac{20}{20} = 1.75.$$ This indicates to us that the biased die will give us a 1 more often and therefore the average value that we should expect from a roll is much closer to 1.

Recall that the motivation for defining random variables was so we could get away from considering the probabilities of elementary events. The definition of expectation is not very helpful in this regard. Luckily, it is not too difficult to reformulate expectation in terms of the values that the random variable takes on.

Proposition 15.3. Let $\Pr(X = r) = \Pr(\{\omega \mid X(\omega) = r\})$ and let $\{r_1, \dots, r_k\} = \operatorname{range}(X)$. Then $$E(X) = \sum_{i=1}^k r_i \cdot \Pr(X = r_i).$$

Proof. Recall that the range of a function $X$ is all of the possible values that $X(\omega)$ can take over every $\omega \in \Omega$. Then we have \begin{align*} \sum_{i=1}^k r_i \cdot \Pr(X = r_i) &= \sum_{i=1}^k r_i \cdot \Pr(\{\omega \in \Omega \mid X(\omega) = r_i\}) \\ &= \sum_{i=1}^k r_i \cdot \sum_{\omega \in \Omega, X(\omega) = r_i} \Pr(\omega) \\ &= \sum_{i=1}^k \sum_{\omega \in \Omega, X(\omega) = r_i} X(\omega) \Pr(\omega) \\ &= \sum_{\omega \in \Omega} X(\omega) \Pr(\omega) \\ &= E(X) \end{align*} $\tag*{$\Box$}$

Sometimes, we will need to deal with more than one random variable. For instance, suppose we want to consider the expected value of a roll of three dice. We would like to be able to use the expected value for a single die roll, which we already worked out above. The following result gives us a way to do this.

Theorem 15.4 (Linearity of Expectation). Let $(\Omega, \Pr)$ be a probability space, $c_1, c_2, \dots, c_n$ be real numbers, and $X_1, X_2, \dots, X_n$ be random variables over $\Omega$. Then $$E\left( \sum_{i=1}^n c_i X_i \right) = \sum_{i=1}^n c_i E(X_i).$$

Proof. \begin{align*} E\left( \sum_{i=1}^n c_i X_i \right) &= \sum_{\omega \in \Omega} \left( \sum_{i=1}^n c_i X_i(\omega) \right) \Pr(\omega) \\ &= \sum_{\omega \in \Omega} \sum_{i=1}^n c_i X_i(\omega) \Pr(\omega) \\ &= \sum_{i=1}^n \sum_{\omega \in \Omega} c_i X_i(\omega) \Pr(\omega) \\ &= \sum_{i=1}^n c_i \left(\sum_{\omega \in \Omega} X_i(\omega) \Pr(\omega)\right) \\ &= \sum_{i=1}^n c_i E(X_i) \end{align*} $$\tag*{$\Box$}$$

Example 15.5. Suppose we have three of our biased dice from the previous example, each with $\Pr(1) = \frac 3 4$ and probability $\frac{1}{20}$ for the other outcomes. What is the expected value of a roll of these three dice? Without the linearity property, we would need to consider the expected value for every 3-tuple of rolls $(\omega_1, \omega_2, \omega_3)$. However, we can apply the previous theorem to get a more direct answer. Let $X_1, X_2, X_3$ be random variables for the roll of each die. Then we have $$E(X_1 + X_2 + X_3) = E(X_1) + E(X_2) + E(X_3) = 1.75 + 1.75 + 1.75 = 5.25.$$

Something you may have noticed is that from our previous discussions, in a roll of three dice, the roll of each die is independent from the others. In fact, we can define independence for random variables similarly.

Definition 15.6. We say the random variables $X$ and $Y$ are independent if for all $r,s \in \mathbb R$, $\Pr(X = r \wedge Y = s) = \Pr(X = r) \cdot \Pr(Y = s)$.

It is important to note that although we have a very similar notion of independence for random variables, none of our results so far have involved independence. That is, the linearity of expectation does not depend on the independence of the random variables involved.

Example 15.7. If you go to Japan, something you may experience is it starts raining and you duck inside the nearest convenience store to pick up an umbrella and then come back outside to find that the rain has stopped. You will probably have gotten one of those very cheap and ubiquitous 500 yen clear plastic umbrellas. Now, if you go inside to a restaurant or a shop, you will need to leave your umbrella at the front in a bin full of umbrellas. Since cheap umbrellas are so common, people just end up taking a random umbrella when they leave. Suppose $n$ people go inside a shop and leave their umbrellas at the front, and when they exit the shop, they randomly take an umbrella. What is the expected number of people who take the umbrella they had before entering the shop?

Let $X$ be the random variable that denotes the number of people who take their own umbrella when they leave. We can then define random variables $X_i$ corresponding to each person by $$X_i = \begin{cases} 1 & \text{if person $i$ takes their own umbrella when they leave,} \\ 0 & \text{otherwise.} \end{cases}$$ Then we have $X = X_1 + \cdots + X_n$. Since each person randomly chooses an umbrella, we have $E(X_i) = \frac 1 n$. Then we have $$E(X) = E(X_1 + \cdots + X_n) = E(X_1) + \cdots + E(X_n) = \overbrace{\frac 1 n + \cdots + \frac 1 n}^{\text{$n$ times}} = 1.$$

Example 15.8. Consider a random permutation $\pi$ of $\{1, 2, \dots, n\}$. A pair $(i,j)$ is an inversion in $\pi$ if $i \lt j$ and $j$ precedes, or appears before, $i$ in $\pi$. What is the expected number of inversions in $\pi$?

Let $I_{i,j}$ be the random variable with $I_{i,j} = 1$ if $i \lt j$ and $(i,j)$ is an inversion and $I_{i,j} = 0$ otherwise. Let $X$ be the random variable for the number of inversions in a permutation. Then $$X = \sum_{1 \leq i \lt j \leq n} I_{i,j}.$$ Then we have $$E(X) = \sum_{1 \leq i \lt j \leq n} E(I_{i,j}).$$ So, what is $E(I_{i,j})$? If we consider all permutations, then half of them will have $i \lt j$ and the other half will have $j \lt i$. This gives us $E(I_{i,j}) = \frac 1 2$. Then, we count the number of pairs $(i,j)$ with $i \lt j$ to get the number of terms in our summation. There are $\binom n 2$ such pairs, since we pick any pair of numbers and there's only one possible order we care about. This gives us $$E(X) = \sum_{1 \leq i \lt j \leq n} E(I_{i,j}) = \binom n 2 \frac 1 2 = \frac{n(n-1)}2 \frac 1 2 = \frac{n(n-1)}4.$$

This leads us to a nice average-case analysis of bubble sort. As you may recall, bubble sort involves going through an array and checking if each pair of adjacent elements is in the right order and swaps them if they're not. The worst-case analysis comes when we have an array that's totally in reverse, and we're forced to make $n$ passes through the array making $O(n)$ swaps, giving us a running time of $O(n^2)$.

Of course, a common criticism of worst-case analysis is that it's possible that typical cases aren't that bad. So for this case, we can ask ourselves, if we have a random array, how many swaps would we expect to make? This happens to be asking how many pairs of numbers we should expect to be out of order in a random permutation, which we've just figured out is $\frac{n(n-1)}4$. And so it turns out, in the average case, we're still looking at a running time of $O(n^2)$.

Another question we can ask about expectations and random variables is what the probability that a random variable takes on a value that is far from the expected value? Intuitively, this shouldn't happen very often. The following result gives us some idea of this probability by providing a bound on the probability that a random variable takes on a value of at least some constant $a$.

Theorem 15.9 (Markov's Inequality). If $X$ is nonnegative, then $\forall a \gt 0$, $$\Pr(X \geq a) \leq \frac{E(X)}{a}.$$

Proof. \begin{align*} E(X) &= \sum_{\omega \in \Omega} X(\omega) \Pr(\omega) \\ &\geq \sum_{\omega \in \Omega, X(\omega) \geq a} X(\omega) \Pr(\omega) \\ &\geq \sum_{\omega \in \Omega, X(\omega) \geq a} a \cdot \Pr(\omega) \\ &= a \cdot \sum_{\omega \in \Omega, X(\omega) \geq a} \Pr(\omega) \\ &= a \cdot \Pr(X \geq a). \end{align*}

This gives us $\Pr(X \geq a) \leq \frac{E(X)}{a}$. $$\tag*{$\Box$}$$

Markov's inequality is named for Andrey Markov, a Russian mathematician from the turn of the 20th century who is the same Markov responsible for things like Markov chains. Interestingly enough, this inequality appeared earlier in work by Pafnuty Chebyshev, who was Markov's doctoral advisor. However, Chebyshev has his own other inequality which we'll get to soon.