CMSC 27100 — Lecture 23

Expectation

We have moved from talking about the probability of events to thinking about the probability of a property of events, in the form of random variables. Now, if we know these probabilities, we can ask the question of what kinds of values we might expect more often.

Let $(\Omega,\Pr)$ be a probability space and let $X$ be a random variable. The expected value of $X$ is \[E(X) = \sum_{\omega \in \Omega} X(\omega) \Pr(\omega).\]

Roughly speaking, the expectation of a random variable is the (weighted) average value for the random variable.

Consider a fair 6-sided die and let $X$ be the random variable for the number that was rolled. We have \[E(X) = 1 \cdot \frac 1 6 + 2 \cdot \frac 1 6 + 3 \cdot \frac 1 6 + 4 \cdot \frac 1 6 + 5 \cdot \frac 1 6 + 6 \cdot \frac 1 6 = \frac{21}{6} = 3.5.\] Now, consider a biased 6-sided die with \[\Pr(\omega) = \begin{cases} \frac 3 4 & \text{if $\omega = 1$,} \\ \frac{1}{20} & \text{otherwise}. \end{cases}\] Let $Y$ be the random variable for the number that was rolled with our biased die. Then we get \[E(Y) = 1 \cdot \frac 3 4 + 2 \cdot \frac{1}{20} + 3 \cdot \frac{1}{20} + 4 \cdot \frac{1}{20} + 5 \cdot \frac{1}{20} + 6 \cdot \frac{1}{20} = \frac 3 4 + \frac{20}{20} = 1.75.\] This indicates to us that the biased die will give us a 1 more often and therefore the average value that we should expect from a roll is much closer to 1.

It's important to note that what this tells us is not that we should expect a fair die roll to be 3.5 or 1.75 when we roll it, because obviously that will never happen. Rather, what expectation tells us is that, over time, the more we roll the die, the closer the value of the average roll will be to 3.5 or 1.75.

Recall that the motivation for defining random variables was so we could get away from considering the probabilities of elementary events. The definition of expectation is not very helpful in this regard. Luckily, it is not too difficult to reformulate expectation in terms of the values that the random variable takes on.

Let $\Pr(X = r) = \Pr(\{\omega \mid X(\omega) = r\})$ and let $\{r_1, \dots, r_k\} = \operatorname{range}(X)$. Then \[E(X) = \sum_{i=1}^k r_i \cdot \Pr(X = r_i).\]

Recall that the range of a function $X$ is all of the possible values that $X(\omega)$ can take over every $\omega \in \Omega$. Then we have \begin{align*} \sum_{i=1}^k r_i \cdot \Pr(X = r_i) &= \sum_{i=1}^k r_i \cdot \Pr(\{\omega \in \Omega \mid X(\omega) = r_i\}) \\ &= \sum_{i=1}^k r_i \cdot \sum_{\omega \in \Omega, X(\omega) = r_i} \Pr(\omega) \\ &= \sum_{i=1}^k \sum_{\omega \in \Omega, X(\omega) = r_i} X(\omega) \Pr(\omega) \\ &= \sum_{\omega \in \Omega} X(\omega) \Pr(\omega) \\ &= E(X) \end{align*}

 

This is a much more useful way of thinking about expectation. Consider the following example.

Let $X$ be a Bernoulli random variable with probability of success $p$. What is $E(X)$? Based on the original definition, we need to know about our sample space and probability distribution. But if we focus only on random variables, the problem becomes quite simple: \[E(X) = 1 \cdot \Pr(X = 1) + 0 \cdot \Pr(X = 0) = 1 \cdot p + 0 \cdot (1-p) = p.\] What this says is that we have two possible values, success and failure. If $p$, the probability of success is higher, then the expectation naturally rises. If we consider a large number of trials, then we would naturally expect to see the average of all of these trials grows.

Sometimes, we will need to deal with more than one random variable. For instance, suppose we want to consider the expected value of a roll of three dice. We would like to be able to use the expected value for a single die roll, which we already worked out above. The following result gives us a way to do this.

Let $(\Omega, \Pr)$ be a probability space, $c_1, c_2, \dots, c_n$ be real numbers, and $X_1, X_2, \dots, X_n$ be random variables over $\Omega$. Then \[E\left( \sum_{i=1}^n c_i X_i \right) = \sum_{i=1}^n c_i E(X_i).\]

Notice that this result is true for all random variables. That is, we can always add expectation of two random variables and multiply expectation by some constant factor. However, this does not include multiplying the expectation of two random variables.

Suppose we have three of our biased dice from the previous example, each with $\Pr(1) = \frac 3 4$ and probability $\frac{1}{20}$ for the other outcomes. What is the expected value of a roll of these three dice? Without the linearity property, we would need to consider the expected value for every 3-tuple of rolls $(\omega_1, \omega_2, \omega_3)$. However, we can apply the previous theorem to get a more direct answer. Let $X_1, X_2, X_3$ be random variables for the roll of each die. Then we have \[E(X_1 + X_2 + X_3) = E(X_1) + E(X_2) + E(X_3) = 1.75 + 1.75 + 1.75 = 5.25.\]

Something you may have noticed is that from our previous discussions, in a roll of three dice, the roll of each die is independent from the others, but we never used this fact. We can define a notion for independence of random variables in a similar way to events.

We say the random variables $X$ and $Y$ are independent if for all $r,s \in \mathbb R$, $\Pr(X = r \wedge Y = s) = \Pr(X = r) \cdot \Pr(Y = s)$.

It is important to note again that although we have a very similar notion of independence for random variables, none of our results so far have involved independence.

Let $X$ be a binomial random variable. We have \[E(X) = \sum_{k=0}^n k \cdot \Pr(X = k) = \sum_{k=0}^n k \binom n k p^k (1-p)^{n-k}.\] At this point, we have a fairly complicated formula. While it's possible to continue to work this out and simplify it using algebra, linearity of expectation gives us another, arguably more insightful, approach.

Recall that $X$ is a binomial random variable, so $X = k$ means we have $k$ successes on $n$ independent Bernoulli trials with probability of success $p$. Let $I_1, \dots, I_n$ be our Bernoulli trials with probability $p$. Then we can write \[X = I_1 + \cdots + I_n.\] So $X = k$ when exactly $k$ of the $I_j$'s is 1. Then we can apply linearity of expectation: \[E(X) = E(I_1 + \cdots + I_n) = E(I_1) + \cdots + E(I_n) = np.\]

A cut in a graph is a partition of the vertices into two sets, say $S$ and $T$. The maximum cut problem is: given a graph, what is the cut with the largest number of edges crossing the cut? That is, once we divide our vertices into $S$ and $T$, we are interested in the number of edges with one end in $S$ and one end in $T$.

For example, in the following graph, we have defined a cut on the graph based on the green vertices $S$ and the remainder purple vertices are $T = V - S$. The edges in blue are the edges that cross the cut: they have one end that's green (in $S$) and one end that's purple (not in $S$).

This problem is known to be computationally difficult, but what if we just want a large cut, not necessarily the largest one? We can use a surprisingly simple randomized algorithm: go through each vertex and randomly independently assign it to $S$ or $T$ with probability $\frac 1 2$.

How many edges cross such a cut? Or more accurately, what is the expected number of edges that cross the cut? Let $C$ be the random variable such that $C = k$ if the cut has $k$ edges. We can define random variables $I_{u,v}$ to be 1 if $u \in S$ and $v \in T$ and 0 otherwise for each pair of vertices $u$ and $v$.

So what is the probability that $u$ and $v$ are on different sides of the cut? There are four possibilities:

Since a vertex is chosen to be in $S$ independently with probability $\frac 1 2$, the probability for each of the four cases above is $\frac 1 2 \cdot \frac 1 2 = \frac 1 4$. Since two of the four cases are the ones in which $u$ and $v$ are on different sides of the cut, we have that $\Pr(I_{u,v} = 1) = \frac 1 2$.

Then we can write $C = \sum\limits_{(u,v) \in E} I_{u,v}$. This gives us \[E(C) = \sum_{(u,v) \in E} E(I_{u,v}) = \sum_{(u,v) \in E} \frac 1 2 = \frac m 2,\] where $m$ is the number of edges in our graph.

What does this tell us? One thing it tells us is that if we try out a random assignment of vertices to $S$ or $T$, we have, in expectation or on average, a cut where half the edges cross the cut. With a bit of work, one can come up with an analysis that tells us how likely this outcome is.

Something else that we can say is that every graph contains at least one cut which involves at least half the edges of the graph. Why? Recall that the expected value is a weighted (by probability) average. This means that there is at least one cut which attains $E(C)$—otherwise, $E(C)$ would be less.

This is an example of what's called the probabilistic method. Recall that there are a few ways to show existence: we can exhibit an example, we can come up with a constructive proof, or we can show indirectly that it must exist by assuming otherwise and arriving at a contradiction. Probaility theory gives us another indirect way to show that an object exists: show that there is a nonzero probability that such an object exists. In our case, we've shown that there is non-zero probability that every graph contains a cut with a large cutset, i.e. at least half of all the edges in the graph cross the cut.