MPCS 50103 — Lecture 12

Expectation

We have moved from talking about the probability of events to thinking about the probability of a property of events, in the form of random variables. Now, if we know these probabilities, we can ask the question of what kinds of values we might expect more often.

Let $(\Omega,\Pr)$ be a probability space and let $X$ be a random variable. The expected value of $X$ is \[E(X) = \sum_{\omega \in \Omega} X(\omega) \Pr(\omega).\]

Roughly speaking, the expectation of a random variable is the (weighted) average value for the random variable.

Consider a fair 6-sided die and let $X$ be the random variable for the number that was rolled. We have \[E(X) = 1 \cdot \frac 1 6 + 2 \cdot \frac 1 6 + 3 \cdot \frac 1 6 + 4 \cdot \frac 1 6 + 5 \cdot \frac 1 6 + 6 \cdot \frac 1 6 = \frac{21}{6} = 3.5.\] Now, consider a biased 6-sided die with \[\Pr(\omega) = \begin{cases} \frac 3 4 & \text{if $\omega = 1$,} \\ \frac{1}{20} & \text{otherwise}. \end{cases}\] Let $Y$ be the random variable for the number that was rolled with our biased die. Then we get \[E(Y) = 1 \cdot \frac 3 4 + 2 \cdot \frac{1}{20} + 3 \cdot \frac{1}{20} + 4 \cdot \frac{1}{20} + 5 \cdot \frac{1}{20} + 6 \cdot \frac{1}{20} = \frac 3 4 + \frac{20}{20} = 1.75.\] This indicates to us that the biased die will give us a 1 more often and therefore the average value that we should expect from a roll is much closer to 1.

Recall that the motivation for defining random variables was so we could get away from considering the probabilities of elementary events. The definition of expectation is not very helpful in this regard. Luckily, it is not too difficult to reformulate expectation in terms of the values that the random variable takes on.

Let $\Pr(X = r) = \Pr(\{\omega \mid X(\omega) = r\})$ and let $\{r_1, \dots, r_k\} = \operatorname{range}(X)$. Then \[E(X) = \sum_{i=1}^k r_i \cdot \Pr(X = r_i).\]

Recall that the range of a function $X$ is all of the possible values that $X(\omega)$ can take over every $\omega \in \Omega$. Then we have \begin{align*} \sum_{i=1}^k r_i \cdot \Pr(X = r_i) &= \sum_{i=1}^k r_i \cdot \Pr(\{\omega \in \Omega \mid X(\omega) = r_i\}) \\ &= \sum_{i=1}^k r_i \cdot \sum_{\omega \in \Omega, X(\omega) = r_i} \Pr(\omega) \\ &= \sum_{i=1}^k \sum_{\omega \in \Omega, X(\omega) = r_i} X(\omega) \Pr(\omega) \\ &= \sum_{\omega \in \Omega} X(\omega) \Pr(\omega) \\ &= E(X) \end{align*}

Sometimes, we will need to deal with more than one random variable. For instance, suppose we want to consider the expected value of a roll of three dice. We would like to be able to use the expected value for a single die roll, which we already worked out above. The following result gives us a way to do this.

Let $(\Omega, \Pr)$ be a probability space, $c_1, c_2, \dots, c_n$ be real numbers, and $X_1, X_2, \dots, X_n$ be random variables over $\Omega$. Then \[E\left( \sum_{i=1}^n c_i X_i \right) = \sum_{i=1}^n c_i E(X_i).\]

Suppose we have three of our biased dice from the previous example, each with $\Pr(1) = \frac 3 4$ and probability $\frac{1}{20}$ for the other outcomes. What is the expected value of a roll of these three dice? Without the linearity property, we would need to consider the expected value for every 3-tuple of rolls $(\omega_1, \omega_2, \omega_3)$. However, we can apply the previous theorem to get a more direct answer. Let $X_1, X_2, X_3$ be random variables for the roll of each die. Then we have \[E(X_1 + X_2 + X_3) = E(X_1) + E(X_2) + E(X_3) = 1.75 + 1.75 + 1.75 = 5.25.\]

Something you may have noticed is that from our previous discussions, in a roll of three dice, the roll of each die is independent from the others. In fact, we can define independence for random variables similarly.

We say the random variables $X$ and $Y$ are independent if for all $r,s \in \mathbb R$, $\Pr(X = r \wedge Y = s) = \Pr(X = r) \cdot \Pr(Y = s)$.

It is important to note that although we have a very similar notion of independence for random variables, none of our results so far have involved independence. That is, the linearity of expectation does not depend on the independence of the random variables involved.

Consider a random permutation $\pi$ of $\{1, 2, \dots, n\}$. A pair $(i,j)$ is an inversion in $\pi$ if $i \lt j$ and $j$ precedes, or appears before, $i$ in $\pi$. What is the expected number of inversions in $\pi$?

Let $I_{i,j}$ be the random variable with $I_{i,j} = 1$ if $i \lt j$ and $(i,j)$ is an inversion and $I_{i,j} = 0$ otherwise. Let $X$ be the random variable for the number of inversions in a permutation. Then \[X = \sum_{1 \leq i \lt j \leq n} I_{i,j}.\] Then we have \[E(X) = \sum_{1 \leq i \lt j \leq n} E(I_{i,j}).\] So, what is $E(I_{i,j})$? If we consider all permutations, then half of them will have $i \lt j$ and the other half will have $j \lt i$. This gives us $E(I_{i,j}) = \frac 1 2$. Then, we count the number of pairs $(i,j)$ with $i \lt j$ to get the number of terms in our summation. There are $\binom n 2$ such pairs, since we pick any pair of numbers and there's only one possible order we care about. This gives us \[E(X) = \sum_{1 \leq i \lt j \leq n} E(I_{i,j}) = \binom n 2 \frac 1 2 = \frac{n(n-1)}2 \frac 1 2 = \frac{n(n-1)}4.\]

This leads us to a nice average-case analysis of bubble sort. As you may recall, bubble sort involves going through an array and checking if each pair of adjacent elements is in the right order and swaps them if they're not. The worst-case analysis comes when we have an array that's totally in reverse, and we're forced to make $n$ passes through the array making $O(n)$ swaps, giving us a running time of $O(n^2)$.

Of course, a common criticism of worst-case analysis is that it's possible that typical cases aren't that bad. So for this case, we can ask ourselves, if we have a random array, how many swaps would we expect to make? This happens to be asking how many pairs of numbers we should expect to be out of order in a random permutation, which we've just figured out is $\frac{n(n-1)}4$. And so it turns out, in the average case, we're still looking at a running time of $O(n^2)$.

Variance

A question we can ask about expectations and random variables is what the probability that a random variable takes on a value that is far from the expected value? Intuitively, this shouldn't happen very often. The following result will give us a rough bound on the probability that a random variable takes on a value of at least some constant $a$.

If $X$ is nonnegative, then $\forall a \gt 0$, \[\Pr(X \geq a) \leq \frac{E(X)}{a}.\]

\begin{align*} E(X) &= \sum_{\omega \in \Omega} X(\omega) \Pr(\omega) \\ &\geq \sum_{\omega \in \Omega, X(\omega) \geq a} X(\omega) \Pr(\omega) \\ &\geq \sum_{\omega \in \Omega, X(\omega) \geq a} a \cdot \Pr(\omega) \\ &= a \cdot \sum_{\omega \in \Omega, X(\omega) \geq a} \Pr(\omega) \\ &= a \cdot \Pr(X \geq a). \end{align*}

This gives us $\Pr(X \geq a) \leq \frac{E(X)}{a}$.

Markov's inequality is named for Andrey Markov, a Russian mathematician from the turn of the 20th century who is the same Markov responsible for things like Markov chains. Interestingly enough, this inequality appeared earlier in work by Pafnuty Chebyshev, who was Markov's doctoral advisor.

Let $X$ be the random variable for the number of heads when a fair coin is tossed $n$ times. Recall that this is just the number of successful outcomes for $n$ independent Bernoulli trials with $p = \frac 1 2$. One can calculate that $E(X) = \frac n 2$.

Let's consider an estimate by Markov's inequality. If we set $a = 60$ and $n = 100$, we get an upper bound on the probability that we get more than 60 heads in 100 coin flips. This gives us \[Pr(X \geq 60) \leq \frac{\frac{100}2}{60} = \frac 5 6.\] So Markov's inequality tells us the probability that we get more than 60 heads is at most just over 80%, which seems kind of high. So you can guess that this inequality, though simple, does not give an incredibly great estimate.

The question of how near or far values of $X$ can be from the expectation leads us to the following notion.

Let $X$ be a random variable on the probability space $(\Omega,\Pr)$. The variance of $X$ is \[V(X) = \sum_{\omega \in \Omega} (X(\omega) - E(X))^2 \cdot \Pr(\omega).\] The standard deviation of $X$ is $\sigma(X) = \sqrt{V(X)}$. Because of this, variance is sometimes denoted by $\sigma^2(X)$.

Again, the definition of variance is in terms of the probabilty of events from $\Omega$, which is not ideal. Luckily, we can reformulate variance in terms of the expectation of $X$ only.

If $X$ is a random variable, then $V(X) = E(X^2) - E(X)^2$.

\begin{align*} V(X) &= \sum_{\omega \in \Omega} (X(\omega) - E(X))^2 \cdot \Pr(\omega) \\ &= \sum_{\omega \in \Omega} (X(\omega)^2 - 2X(\omega)E(X) + E(X)^2) \cdot \Pr(\omega) \\ &= \sum_{\omega \in \Omega} X(\omega)^2 \cdot \Pr(\omega) - 2 E(X) \sum_{\omega \in \Omega} X(\omega) \cdot \Pr(\omega) + E(X)^2 \sum_{\omega \in \Omega} \Pr(\omega) \\ &= E(X^2) - 2E(X)^2 + E(X)^2 \\ &= E(X^2) - E(X)^2 \end{align*}

 

$V(X) = E((X - E(X))^2)$.

Recall the roll of a fair 6-sided die and that $E(X) = 3.5$. To compute the variance, we need $E(X^2)$. For $i = 1, 2, \dots, 6$, $X^2 = i^2$ with probability $\frac 1 6$. Then \[E(X^2) = \sum_{i=1}^6 \frac{i^2}{6} = \frac{91}6.\] Then we can calculate $V(X)$ by \[V(X) = E(X^2) - E(X)^2 = \frac{91}{6} - \left( \frac 7 2 \right)^2 = \frac{35}{12} \approx 2.92.\] From this, we can also get the standard deviation, $\sqrt{V(X)} \approx 1.71$.

Now, what if we wanted to consider the roll of more than one die? Recall that the result of mulitple die rolls is independent and so their corresponding random variables are independent. We can make use of this property via Bienaymé's formula. Irénée-Jules Bienaymé was a French mathematician at around the same time as Markov and Chebyshev and was friends and collaborators with Chebyshev.

If $X$ and $Y$ are independent random variables, then $V(X+Y) = V(X) + V(Y)$.

Recall that the variance of a single die roll is $V(X) = \frac{35}{12}$. Then the variance of two die rolls is $V(X) + V(X) = \frac{35}{6} \approx 5.833$ and the standard deviation is about $2.42$.

Consider a series of $n$ independent Bernoulli trials with probability of success $p$. What is the variance of the number of successes?

If we have a series of trials $(a_1, a_2, \dots, a_n)$, then we define the random variable $X_i$ to be 1 if $a_i$ is a success and 0 otherwise. Then we define $X = X_1 + \cdots + X_n$, so that $X$ counts the number of successes in the $n$ trials. Then the variance is simply $V(X) = V(X_1) + \cdots + V(X_n)$.

To compute this, we need to compute the variance of a single Bernoulli trial. If $X_i$ is our random variable for the Bernoulli, trial, we have that $X_i$ takes on a value of either 0 or 1. Therefore, we have $X_i^2 = X_i$ and $E(X_i) = 1 \cdot p + 0 \cdot (1-p) = p$. Then \[V(X_i) = E(X_i^2) - E(X_i)^2 = p - p^2 = p \cdot (1-p).\] This gives us, we have $V(X) = np(1-p)$.

The following result, due to Chebyshev, will improve on Markov's inequality to give us a better bound on how much a random variable differs from its expected value, but using the variance of the random variable.

Let $X$ be a random variable. For all $a \gt 0$, \[\Pr(|X - E(X)| \geq a) \leq \frac{V(X)}{a^2}.\]

Let $Y = (X - E(X))^2$ be a random variable. Then $E(Y) = V(X)$ by definition. We can apply Markov's inequality, since $Y$ is non-negative, to get \[\Pr(Y \geq a^2) \leq \frac{E(Y)}{a^2}.\] Since $Y = (X-E(X))^2$, consider $(X-E(X))^2 \geq a^2$. We have $(X-E(X))^2 \geq a^2$ if and only if $|X - E(X)| \geq a$. Then, together with $E(Y) = V(X)$, we substitute into the above to get \[\Pr(|X-E(X)| \geq a) \leq \frac{V(X)}{a^2}.\]

 

Let's revisit the example at the start of this section: what is the probability that we get more than 60 heads on 100 flips of a fair coin? Recall that if $X$ is the random variable for the number of heads when a fair coin is tossed $n$ times, then $E(X) = \frac n 2$. From the preceding example, we also have that $V(X) = \frac n 4$ (since $p = (1-p) = \frac 1 2$).

Now, let's apply Chebyshev's inequality. First, we can use $a = \sqrt n$ to get \[\Pr\left( \left|X - \frac n 2 \right| \geq \sqrt n \right) \leq \frac{V(X)}{\sqrt n^2} = \frac n 4 \cdot \frac 1 n = \frac 1 4.\]

What this says is there is at most a $\frac 1 4$ probability that the number of heads that we get differs from the expected number ($\frac 1 2)$ by more than $\sqrt n$. So, if we make 100 rolls, then we have \[\Pr(|X - 50| \geq 10) \leq \frac 1 4.\] In other words, the probability that we get more than 60 heads or less than 40 heads is at most $\frac 1 4$, which is already significantly better than our estimate from Markov's inequality (based on this information, we can actually take one more step to improve the estimate even further—how?).