CMSC 27100 — Lecture 22

We can think of some of the sample spaces we've seen already as describing independent events. For example, conisder the sample space of $\{H,T\}^n$. We can define the events $H_1$ to be the first coin flip was heads and $H_n$ to be the $n$th (or final) coin flip was heads and see that they are independent. Why? Intutiively, the first and last coin flips should have nothing to do with each other. If we're flipping a fair coin, the probability of these two events should be $\frac 1 2$. Independence tells us the probability of these two events occurring at the same time is exactly $\frac 1 4$.

Let's lay this out carefully. Since these are strings of length $n$, we can see that $H_1 = \{Hx \mid x \in \{H,T\}^n\}$ and $H_n = \{xH \mid x \in \{H,T\}^n\}$. We see that the probability of both events are $\frac 1 2$: $H_1$ consists of all strings of length $2^{n-1}$ prepended with $H$, which is half of all strings of length $2^n$ (the other half are the same strings prefixed with $T$). The same argument works for $H_n$.

Then what is $H_1 \cap H_n$? It is the set of strings $H \{H,T\}^{n-2} H$. This is exactly $\frac 1 4$ of the strings of length $n$, since to get the other $\frac 3 4$, we need to vary the $H$s at either end with $T$s.

So in the case when $A$ and $B$ are disjoint, they are very negatively correlated.

Are these events independent? If not, how are they related? First, our sample space is $\Omega = \{1,2,3,4,5,6\}$, so our events are going to be $A = \{1,3,5\}$ and $B = \{2,3,5\}$. Then $\Pr(A) = \frac 3 6 = \frac 1 2$ and $\Pr(B) = \frac 3 6 = \frac 1 2$. We want to consider the probability of the intersection of our events. This gives us $$\Pr(A \cap B) = \Pr(\{1,3,5\} \cap \{2,3,5\}) = \Pr(\{3,5\}) = \frac 2 6 = \frac 1 3.$$ But $\Pr(A) \cdot \Pr(B) = \frac 1 2 \cdot \frac 1 2 = \frac 1 4$, which is not $\frac 1 3$, so $A$ and $B$ are not independent. Furthermore, we have $$\Pr(A \cap B) = \frac 1 3 \gt \frac 1 4 = \Pr(A) \cdot \Pr(B),$$ so $A$ and $B$ are positively correlated. Of course, this makes sense intuitively, because we know the only even prime number is 2.

Events $A_1, A_2, \dots, A_k$ are pairwise independent if for all $i \neq j$, $A_i$ and $A_j$ are independent. The events are mutually independent if for every $I \subseteq \{1, \dots, k\}$, $$\Pr \left( \bigcap_{i \in I} A_i \right) = \prod_{i \in I} \Pr(A_i).$$

This definition suggests that there can be events $A_1, A_2, \dots, A_k$ which are pairwise independent, but not mutually independent. Coming up with examples of such events ($A,B,C$) which are pairwise but not mutually independent) makes for a nice exercise.

Random variables

So far, we have been discussing and defining events whenever we want to discuss some particular set of outcomes, but this can be cumbersome for a number of reasons. First, we can only ever talk about events occurring or not. Secondly, it's not always intuitive to consider what elements go into an event, particularly for those events that can be described as a series of independent events, as we did above with the coin flips.

What would be nice is if there were a way to take some events from our sample space and assign it some value based the event and that we can naturally put together or compose these actions together. Of course, this describes what a function would do, and so, we will introduce a particular kind of function for this purpose.

A random variable is a function $X : \Omega \to \mathbb R$ where $\Omega$ is the sample space of a probability space.

Now, you may have noticed from our discussion that random variables are not random and are not variables (a random variable is a function, as we can see from the definition above, which is deterministic). This naming is due to Francesco Cantelli, from the early 1900s.

We notice that while random variables are functions, the standard usage is to really write them as though they're variables. So you will see things written like $X = r$ when what is really meant is $X(\omega) = r$ for some $\omega \in \Omega$.

Note that the choice of co-domain $\mathbb R$ is more permissive than prescriptive. A lot of the time, we'll be mapping objects to a particular subset of $\mathbb R$, like $\mathbb N$ or even $\{0,1\}$. And while the standard definition of a random variable is for $\mathbb R$, this can be extended to more exotic co-domains than just $\mathbb R$. In our case, since we're primarily concerned with discrete sample spaces, we are likely to consider random variables that map to discrete spaces.

The simplest example of a random variable is the random variable that just describes the outcome of the event. These are called indicator random variables or just indicator variables.

For an event $A$, the indicator variable $I_A$ is a function $I_A : \Omega \to \{0,1\}$ which is defined by \[I_A(\omega) = \begin{cases} 1 & \text{if $\omega \in A$}, \\ 0 & \text{if $\omega \not \in A$}. \end{cases} \]

Having set up this mechanism, we can define the probability of a particular outcome of a random variable.

If $X$ is a random variable, the probability that $X = r$, for some $r \in \mathbb R$ is defined by $\Pr(X = r) = \Pr(\{\omega \in \Omega \mid X(\omega) = r\})$.

Suppose we flip a coin three times. Let $X(\omega)$ be the random variable that equals the number of heads that appear when $\omega$ is the outcome. So we have \begin{align*} X(HHH) &= 3, \\ X(HHT) = X(HTH) = X(THH) &= 2, \\ X(TTH) = X(THT) = X(HHT) &= 1, \\ X(TTT) &= 0. \end{align*} Now, suppose that the coin is fair. Since we assume that each of the coin flips is mutually independent, this gives us a probability of $\frac 1 2 \frac 1 2 \frac 1 2 = \frac 1 8$ for any one of the above outcomes. Now, let's consider $\Pr(X = k)$:

This looks suspiciously like something we've seen before. Here, we'll introduce the notion of Bernoulli trials.

$k$	$\{\omega \in \Omega \mid X(\omega) = k\}$	$\Pr(X = k)$
3	$\{HHH\}$	$\frac 1 8$
2	$\{HHT,HTH,THH\}$	$\frac 3 8$
1	$\{TTH,THT,HTT\}$	$\frac 3 8$
0	$\{TTT\}$	$\frac 1 8$

A Bernoulli Trial is a random variable $B$ whose range is restricted to $\{0,1\}$, where the event $\{\omega \in \Omega \mid B(\omega) = 1\}$ is the success event, and $\{\omega \in \Omega \mid B(\omega) = 0\}$ is the failure event.

Bernoulli trials are named after Jacob Bernoulli who is also sometimes referred to as James Bernoulli in the late 1600s. There are lots of other mathematicians in the Bernoulli family but the lion's share of mathematical results named after Bernoulli are really named after this particular Bernoulli.

Bernoulli trials are a repetition of $n$ mutually independent events. How should we interpret this in our probability framework? In other words, what do the sample space and the probability distribution look like?

If we have a probability space $(\Omega,\Pr)$ and we want to conduct $n$ independent trials, then what we really want is to take the product of a bunch of these outcomes. So we define the space $(\Omega^n, \Pr_n)$, where $\Omega^n$ is the set of $n$-tuples $(a_1, a_2, \dots, a_n)$ with $a_i \in \Omega$ and $$\Pr_n((a_1, a_2, \dots, a_n)) = \Pr(a_1) \Pr(a_2) \cdots \Pr(a_n).$$ Here, each position $i$ corresponds to one of our trials and we can define the corresponding Bernoulli trial by $X_i((a_1, \dots, a_n)) = X(a_i)$. Since we're taking the product, none of the trials in our tuple depend on each other, so they're independent. We can use this idea to prove the following.

The probability of $k$ successes in $n$ independent Bernoulli trials with probability of success $p$ is $$\binom n k p^k (1-p)^{n-k}.$$

For each $n$-tuple $A = (a_1, \dots, a_n)$ of trials with Bernoulli random variable $X$, we can construct a string $w_A = b_1 b_2 \cdots b_n$ over the alphabet $\{S,F\}$ by $$b_i = \begin{cases} S & \text{if $X(a_i) = 1$,} \\ F & \text{otherwise.} \end{cases}$$ Then the number of $n$-tuples with $k$ successes is the same as the number of binary strings of length $n$ with $k$ $S$'s. Recall that there are exactly $\binom n k$ of these.

Then for each $n$-tuple with $k$ successes, we have $\Pr_n(A) = p^k (1-p)^{n-k}$. Putting this together, the probability of $k$ successes is \[\binom n k p^k (1-p)^{n-k}.\]

This is called the binomial distribution, denoted $\operatorname{Bin}(n,p)$ for $n$ trials with probability of success $p$. It is appropropriately named since this is what you'd get if you plugged in the $x = p$ and $y = (1-p)$ into the Binomial Theorem. We can restate our theorem from above as

If $X \sim \operatorname{Bin}(n,p)$ (i.e. $X$ is distributed according to the binomial distribution on $n$ trials and probability of success $p$), then \[\Pr(X = k) = \binom n k p^k (1-p)^{n-k}.\] for $k$ with $k \in \{0, \dots, n\}$ and $0$ otherwise.

Notice also that we can reframe Bernoulli trials as a distribution—it just happens to be a very simple one.

We can then reinterpret our results from the coin-flipping example as a Bernoulli trial. Since the probability for a fair coin flip to land on heads is $p = \frac 1 2$, we have

Consider the sample space of graphs with $n$ vertices. It's not immediately obvious how we should assign probabilities to these graphs. One way to do this, due to Erdős and Rényi, is to then consider those graphs with $m$ edges as the sample space and have them uniformly distributed. (In this case, what is the probability of getting a single graph?)

Another model is to consider all the graphs of $n$ vertices but instead of trying to assign a distribution to the graphs directly, we assign probabilities to the presence of edges. We consider each edge independently and include it in our graph with probability $p$. Unlike before, this doesn't really give us a clear picture of what the probability of a particular graph is. Instead, we need to consider such questions using random variables.

$k$	$b(k;n,p)$
3	$$\binom 3 3 \left( \frac 1 2 \right)^3 \left( \frac 1 2 \right)^0$$	$$\frac 1 8$$
2	$$\binom 3 2 \left( \frac 1 2 \right)^2 \left( \frac 1 2 \right)^1$$	$$\frac 3 8$$
1	$$\binom 3 1 \left( \frac 1 2 \right)^1 \left( \frac 1 2 \right)^2$$	$$\frac 3 8$$
0	$$\binom 3 0 \left( \frac 1 2 \right)^0 \left( \frac 1 2 \right)^3$$	$$\frac 1 8$$

We can consider the presence of an edge a Bernoulli trial—that is, we define a random variable $X_{u,v}$ to be 1 if the edge is present in the graph and 0 if not. Formally, \[X_{u,v} = \begin{cases} 1 & \text{if $(u,v) \in E$}, \\ 0 & \text{if $(u,v) \not \in E$}. \end{cases}\] With this, we can ask something like what the probability of a graph having $m$ edges is. Since these are independent, we can apply our discussion from above and see that this is really just a series of trials over all possible pairs of vertices in our graph—since there are $\binom n 2$ of these, we have that any particular graph with $m$ edges occurs with probability $p^m (1-p)^{\binom n 2 - m}$. Then noting that there are $\binom{\binom n 2} m$ ways to choose $m$ edges from $n$ vertices, we have that the probability that we get some graph with $m$ edges is \[\binom{\binom n 2}m p^m (1-p)^{\binom n 2 - m}.\]