CMSC 27100 — Lecture 20

Variance

The question of how near or far values of $X$ can be from the expectation leads us to the following notion.

Definition 20.1. Let $X$ be a random variable on the probability space $(\Omega,\Pr)$. The variance of $X$ is $$V(X) = \sum_{\omega \in \Omega} (X(\omega) - E(X))^2 \cdot \Pr(\omega).$$ The standard deviation of $X$ is $\sigma(X) = \sqrt{V(X)}$. Because of this, variance is sometimes denoted by $\sigma^2(X)$.

Again, the definition of variance is in terms of the probabilty of events from $\Omega$, which is not ideal. Luckily, we can reformulate variance in terms of the expectation of $X$ only.

Theorem 20.2. If $X$ is a random variable, then $V(X) = E(X^2) - E(X)^2$.

Proof. \begin{align*} V(X) &= \sum_{\omega \in \Omega} (X(\omega) - E(X))^2 \cdot \Pr(\omega) \\ &= \sum_{\omega \in \Omega} (X(\omega)^2 - 2X(\omega)E(X) + E(X)^2) \cdot \Pr(\omega) \\ &= \sum_{\omega \in \Omega} X(\omega)^2 \cdot \Pr(\omega) - 2 E(X) \sum_{\omega \in \Omega} X(\omega) \cdot \Pr(\omega) + E(X)^2 \sum_{\omega \in \Omega} \Pr(\omega) \\ &= E(X^2) - 2E(X)^2 + E(X)^2 \\ &= E(X^2) - E(X)^2 \end{align*} $\tag*{$\Box$}$

Corollary 20.3. $V(X) = E((X - E(X))^2)$.

Proof. \begin{align*} E((X - E(X))^2) &= E(X^2 - 2 X E(X) + E(X)^2) \\ &= E(X^2) - E(2XE(X)) + E(E(X)^2) \\ &= E(X^2) - 2E(X)^2 + E(X)^2 \\ &= E(X^2) - E(X)^2 \\ &= V(X) \end{align*} $\tag*{$\Box$}$

Example 20.4. Recall the roll of a fair 6-sided die and that $E(X) = 3.5$. To compute the variance, we need $E(X^2)$. For $i = 1, 2, \dots, 6$, $X^2 = i^2$ with probability $\frac 1 6$. Then $$E(X^2) = \sum_{i=1}^6 \frac{i^2}{6} = \frac{91}6.$$ Then we can calculate $V(X)$ by $$V(X) = E(X^2) - E(X)^2 = \frac{91}{6} - \left( \frac 7 2 \right)^2 = \frac{35}{12} \approx 2.92.$$ From this, we can also get the standard deviation, $\sqrt{V(X)} \approx 1.71$.

Now, what if we wanted to consider the roll of more than one die? Recall that the result of mulitple die rolls is independent and so their corresponding random variables are independent. We can make use of this property via Bienaymé's formula. Irénée-Jules Bienaymé was a French mathematician at around the same time as Markov and Chebyshev and was friends and collaborators with Chebyshev.

Theorem 20.5 (Bienaymé's Formula). If $X$ and $Y$ are independent random variables, then $V(X+Y) = V(X) + V(Y)$.

Proof. We will make use of the claim that if $X$ and $Y$ are independent, then $E(XY) = E(X)E(Y)$ (this is an exercise on this week's problem set). \begin{align*} V(X+Y) &= E((X+Y)^2) - E(X+Y)^2 \\ &= E(X^2) + 2E(XY) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2 \\ &= E(X^2) - E(X)^2 + E(Y^2) - E(Y)^2 + 2E(XY) - 2E(X)E(Y) \\ &= E(X^2) - E(X)^2 + E(Y^2) - E(Y)^2 \\ &= V(X) + V(Y) \end{align*} $\tag*{$\Box$}$

Example 20.6. Recall that the variance of a single die roll is $V(X) = \frac{35}{12}$. Then the variance of two die rolls is $V(X) + V(X) = \frac{35}{6} \approx 5.833$ and the standard deviation is about $2.42$.

Example 20.7. Consider a series of $n$ independent Bernoulli trials with probability of success $p$. What is the variane of the number of successes?

If we have a series of trials $(a_1, a_2, \dots, a_n)$, then we define the random variable $X_i$ to be 1 if $a_i$ is a success and 0 otherwise. Then we define $X = X_1 + \cdots + X_n$, so that $X$ counts the number of successes in the $n$ trials. Then the variance is simply $V(X) = V(X_1) + \cdots + V(X_n)$.

To compute this, we need to compute the variance of a single Bernoulli trial. If $X_i$ is our random variable for the Bernoulli, trial, we have that $X_i$ takes on a value of either 0 or 1. Therefore, we have $X_i^2 = X_i$ and $E(X_i) = 1 \cdot p + 0 \cdot (1-p) = p$. Then $$V(X_i) = E(X_i^2) - E(X_i)^2 = p - p^2 = p \cdot (1-p).$$ This gives us, we have $V(X) = np(1-p)$.

Now, we will revisit the question of how likely it is that a random variables takes a value that strays far from its expectation. Much like Markov's inequality, the following result will give us a bound on how much a random variable differs from its expected value. Chebyshev's inequality is named for Pafnuty Chebyshev, who we've already mentioned before.

Theorem 20.8 (Chebyshev's Inequality). Let $X$ be a random variable. For all $a \gt 0$, $$\Pr(|X - E(X)| \geq a) \leq \frac{V(X)}{a^2}.$$

Proof. Let $Y = (X - E(X))^2$ be a random variable. Then $E(Y) = V(X)$ by definition. We can apply Markov's inequality, since $Y$ is non-negative, to get $$\Pr(Y \geq a^2) \leq \frac{E(Y)}{a^2}.$$ Since $Y = (X-E(X))^2$, consider $(X-E(X))^2 \geq a^2$. We have $(X-E(X))^2 \geq a^2$ if and only if $|X - E(X)| \geq a$. Then, together with $E(Y) = V(X)$, we substitute into the above to get $$\Pr(|X-E(X)| \geq a) \leq \frac{V(X)}{a^2}.$$ $\tag*{$\Box$}$

Example 20.9. Let $X$ be the random variable for the number of heads when a fair coin is tossed $n$ times. Recall that this is just the number of successful outcomes for $n$ independent Bernoulli trials with $p = \frac 1 2$. One can calculate that $E(X) = \frac n 2$ and from the previous example, we have that $V(X) = \frac n 4$ (since $p = (1-p) = \frac 1 2$). Now, we apply Chebyshev's inequality, using $a = \sqrt n$ to get $$\Pr\left( \left|X - \frac n 2 \right| \geq \sqrt n \right) \leq \frac{V(X)}{\sqrt n^2} = \frac n 4 \cdot \frac 1 n = \frac 1 4.$$

What this says is there is at most a $\frac 1 4$ probability that the number of heads that we get differs from the expected number ($\frac 1 2)$ by more than $\sqrt n$. So, if we make 100 rolls, then we have $$\Pr(|X - 50| \geq 10) \leq \frac 1 4.$$ In other words, the probability that we get more than 60 heads or less than 40 heads is at most $\frac 1 4$. Of course, the choice of $a$ (or $a^2$) is ours, depending on what kind of threshold we want to know the probability for.

Bayes' Theorem

A common scenario that we may run into is that we have calculated the probabilities of some events based on some data we have. However, we will often receive new information and we would like to incorporate that new information to compute the probability based on the information we already have together with the new information. This is where the result of Thomas Bayes, Presbyterian minister and statistician, comes into play.

Definition 20.10. A partition of $\Omega$ is a family of pairwise disjoint events $H_1, \dots, H_m$ covering $\Omega$. That is, $\bigcup_{i=1}^m H_i = \Omega$ and $H_i \cap H_j = \emptyset$ for all $i \neq j$.

Theorem 20.11 (Law of Total Probability). Given a partition $H_1, \dots, H_m$ of $\Omega$ and an event $A$, $$\Pr(A) = \sum_{i=1}^m \Pr(A \mid H_i) \cdot \Pr(H_i).$$

Proof. Since we have a partition $H_1, \dots, H_m$, we have $A = A \cap \Omega = A \cap \bigcup_{i = 1}^m H_i$. Then, \begin{align*} \Pr(A) &= \Pr\left( A \cap \bigcup_{i=1}^m H_i \right) \\ &= \Pr \left( \bigcup_{i=1}^m (A \cap H_i) \right) \\ &= \sum_{i=1}^m \Pr(A \cap H_i) &\text{since $H_i$ are pairwise disjoint} \\ &= \sum_{i=1}^m \Pr(A \mid H_i) \Pr(H_i) \end{align*} $\tag*{$\Box$}$

The Law of Total Probability generalizes the idea that if I know $\Pr(A \mid B)$ and I know $\Pr(A \mid \overline B)$, then that just gives me $\Pr(A)$ when I put them together, along with $\Pr(B)$. The Law of Total Probability will be particularly useful for computing the probabilities of events when applying Bayes' Theorem.

Theorem 20.12 (Bayes' Theorem). If $A$ and $B$ are events with $\Pr(A), \Pr(B) \gt 0$, then $$\Pr(A \mid B) = \frac{\Pr(B\mid A) \cdot \Pr(A)}{\Pr(B)}.$$

Proof. By definition, we have $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.$$ Rearranging this equation, we get $\Pr(A \cap B) = \Pr(A \mid B) \Pr(B)$. We can do something similar with $\Pr(B \mid A)$ to get $\Pr(A \cap B) = \Pr(B \mid A) \Pr(A)$. Then this gives us the equality $\Pr(A \mid B) \Pr(B) = \Pr(B \mid A) \Pr(A)$. Then we can rearrange this equation to get $$\Pr(A \mid B) = \frac{\Pr(B \mid A) \Pr(A)}{\Pr(B)}.$$ $\tag*{$\Box$}$