MPCS 50103 — Lecture 6

The Binomial Theorem

Let's think about binomials. This will seem like out of left field, but there is a connection, I promise. Recall that a binomial is an expression of two terms.

Hopefully, we are all familiar with how to expand a binomial. For instance, we have $$(x+y)^2 = x^2 + xy + xy + y^2 = x^2 + 2xy + y^2.$$ Of course, we can do the same thing with higher degrees, like \begin{align*} (x+y)^3 &= (x+y)^2 (x+y) \\ &= (x^2 + 2xy + y^2)(x+y) \\ &= x^3 + 2x^2y + xy^2 + x^2y + 2xy^2 + y^3 \\ &= x^3 + 3x^2y + 3xy^2 + y^3 \end{align*}

Now, this process is straightforward but a bit tedious. Perhaps there is an alternative way to think about the resulting coefficients. This becomes a bit more clear if we step away from our algebraic impulses a bit. We can write \begin{align*} (x+y)^4 &= (x+y)(x+y)(x+y)(x+y) \\ &= xxxx + xxxy + xxyx + xxyy + xyxx + xyxy + xyyx + xyyy + \\ &\quad\: yxxx + yxxy + yxyx + yxyy + yyxx + yyxy + yyyx + yyyy \end{align*} When written in this way, each term looks much more like the kinds of objects that we've been working with over the past week. That is, we can think of each term as a combination of the terms from each binomial. In other words, how many terms of the form $x^3y$ are there? Exactly as there are many ways to choose three $x$'s and one $y$ from each binomial. This leads us to the following result.

Theorem 10.1 (Binomial Theorem). For all $n \in \mathbb N$, $$(x+y)^n = \sum^n_{j=0} \binom n j x^{n-j} y^j.$$

Proof. One can prove this via induction and it makes a nice exercise. We will prove this via a combinatorial argument similar to the above. Each term of our polynomial is of the form $x^{n-j}y^j$. Then the coefficients are exactly the number of ways to choose $n-j$ $x$'s from each binomial. There are $\binom n {n-j}$ ways to do this and $\binom n {n-j} = \binom n j$. $\tag*{$\Box$}$

Example 10.2. Suppose you have to prove the inductive case in some proof and you're faced with expanding $(k+1)^5$. Luckily, you don't need to futz around with expanding this polynomial and can directly apply the binomial theorem: \begin{align*} (k+1)^5 &= \binom 5 0 k^5 1^0 + \binom 5 1 k^4 1^1 + \binom 5 2 k^3 1^2 + \binom 5 3 k^2 1^3 + \binom 5 4 k 1^4 + \binom 5 5 k^0 1^5 \\ &= k^5 + 5k^4 + 10k^3 + 10k^2 + 5^k + 1 \end{align*}

This example suggests the following result, which follows from the Binomial Theorem by setting $y = 1$.

Theorem 10.3. For all $n \in \mathbb N$, $$(x+1)^n = \sum_{i=0}^n \binom n i x^i.$$

The following is a neat consequence of the binomial theorem that should match with our intution about what the binomial coefficients mean.

Theorem 10.4. For all $n \in \mathbb N$, $$\sum_{i=0}^n \binom n i = 2^n.$$

Proof. We have $$2^n = (1+1)^n = \sum_{i=0}^n \binom n i 1^{n-i} 1^i = \sum_{i=0} \binom n i.$$ $\tag*{$\Box$}$

Of course, this makes total sense: if $\binom n i$ is the number of subsets of $n$ elements of size $i$, then adding the number of these subsets together is the total number of subsets of a set of size $n$, which is $2^n$.

There are a lot of other neat results that fall out of the Binomial Theorem. Perhaps the most famous one is due to Blaise Pascal in the 1600s.

Theorem 10.5 (Pascal's Identity). For all natural numbers $n \geq k \gt 0$, $$\binom{n+1}{k} = \binom{n}{k-1} + \binom n k.$$

Proof. As has been the case throughout this part of the course, there is a straightforward algebraic proof of this fact as well as a combinatorial proof. We will go through the combinatorial argument.

Let $A$ be a set of $n+1$ elements. We want to count the number of subsets of $A$ of size $k$, which is $\binom{n+1}{k}$. Choose an element $a \in A$ and a subset $B \subseteq A$ of size $k$. Then either $a \in B$ or $a \not \in B$; these two cases are disjoint. If $a \in B$, then $B$ contains $k-1$ elements in $A \setminus \{a\}$ and there are $\binom{n}{k-1}$ such sets. On the other hand, if $a \not \in B$, then $B$ contains $k$ elements in $A \setminus \{a\}$ and there are $\binom{n}{k}$ such sets. Together, this gives us $\binom{n}{k-1} + \binom n k$ subsets of $A$ of size $k$. $\tag*{$\Box$}$

This identity leads us to the famous Pascal's Triangle: $$ \begin{matrix} &&&&&& \binom 0 0 &&&&&& \\ &&&&& \binom 1 0 && \binom 1 1 &&&&& \\ &&&& \binom 2 0 && \binom 2 1 && \binom 2 2 &&&& \\ &&& \binom 3 0 && \binom 3 1 && \binom 3 2 && \binom 3 3 &&& \\ && \binom 4 0 && \binom 4 1 && \binom 4 2 && \binom 4 3 && \binom 4 4 && \\ & \binom 5 0 && \binom 5 1 && \binom 5 2 && \binom 5 3 && \binom 5 4 && \binom 5 5 & \\ \binom 6 0 && \binom 6 1 && \binom 6 2 && \binom 6 3 && \binom 6 4 && \binom 6 5 && \binom 6 6 \end{matrix} $$ Filling in the values for the coefficients, the triangle looks like this: $$ \begin{matrix} &&&&&& 1 &&&&&& \\ &&&&& 1 && 1 &&&&& \\ &&&& 1 && 2 && 1 &&&& \\ &&& 1 && 3 && 3 && 1 &&& \\ && 1 && 4 && 6 && 4 && 1 && \\ & 1 && 5 && 10 && 10 && 5 && 1 & \\ 1 && 6 && 15 && 20 && 15 && 6 && 1 \end{matrix} $$ When viewed like this, it's easy to see how Pascal's identity is used to construct the following row.

This arrangement leads us to even more interesting identities or interpretations of identities with respect to the triangle. For example, Theorem 10.4 is just the sum of the $n$th row of the triangle. We can observe another interesting row identity.

Theorem 10.6. For all $n \in \mathbb N$, $$\sum_{i=0}^n (-1)^i \binom n i = 0.$$

Proof. We take $x = 1$ and $y = -1$ to get $(1-1)^n$. This is obviously 0, and we get the equality when we apply the Binomial theorem to get $$(1-1)^n = \sum_{i=0}^n (-1)^i \binom n i = 0.$$ $\tag*{$\Box$}$

Looking at the triangle, this seems obvious for odd $n$, and even without the triangle, what we get is that the terms $\binom n k$ cancel out with the corresponding $\binom n {n-k}$. For even $n$, this is not as obvious, although the triangle gives us some hints as to why this might work out.

We can say something about the sum of a particlar column too.

Theorem 10.7. For $n \geq r \gt 0$, we have $$\binom{n+1}{r+1} = \sum_{i=r}^n \binom i r.$$

Proof. Let $A$ be a set of $n+1$ elements. We will order the elements such that $a_1 \lt a_2 \lt \cdots \lt a_{n+1}$. Let's consider the subsets of $A$ of size $r+1$. Because we defined an order on the elements of $A$, each subsets $B$ of $r+1$ elements has a largest element, say $a_k$. Furthermore, $a_k$ has to be in the set $\{a_{r+1}, a_{r+2}, \dots, a_{n+1}\}$, since otherwise $B$ would contain fewer than $r+1$ elements.

This means that the set $B \setminus \{a_k\}$ of size $r$ is a subset of $\{a_1, a_2, \dots, a_{k-1}\}$. The total number of such sets is $\binom{k-1}{r}$. Then to get the total number of subsets of $A$ of size $r+1$, we sum over the possible $k$s (which, from above, range from $r+1$ to $n+1$) to get $$\sum_{k = r+1}^{n+1} \binom{k-1}{r} = \sum_{i = r}^n \binom i r$$ by setting $i = k-1$. $\tag*{$\Box$}$

The following identity is due to Alexandre-Théophile Vandermonde from the late 1700s.

Theorem 10.8 (Vandermonde's Identity). For all $n,m \geq r \gt 0$, $$ \binom{m+n}{r} = \sum_{k=0}^r \binom{m}{r-k} \binom n k.$$

Proof. If we have disjoint sets $A$ and $B$ with $|A| = m$ and $|B| = n$, then $\binom{m+n}{r}$ is the number of subsets of $A \cup B$ of size $r$.

We can consider this method for choosing an $r$-element subset of $A \cup B$. First, we choose $r-k$ elements from $A$. Then we choose the remaining $k$ elements from $B$. This gives us a total of $\binom{m}{r-k} \binom n k$ sets. The total number of such sets will depend on $k$, so we sum over $k$ to get $$\sum_{k=0}^r \binom{m}{r-k} \binom n k.$$ $\tag*{$\Box$}$

The following interesting fact follows from Vandermonde's identity.

Corollary 10.9. For all $n \gt 0$, $$\binom{2n}{n} = \sum_{k=0}^n \binom n k ^2.$$

Proof. Using Vandermonde's identity, we set $m = r = n$ to get $$\binom{2n}{n} = \sum_{i=0}^n \binom{n}{n-k} \binom n k.$$ Then we observe that $\binom{n}{n-k} = \binom n k$. $\tag*{$\Box$}$

The Pigeonhole Principle

On the second floor of Crerar, just outside my office and the TA office hour rooms, you'll find the grad student mailboxes. These mail slots are also called pigeonholes. This came from earlier usage, when people really did keep birds in large arrangements of small compartments. The pigeonhole principle was first stated by Dirichlet in the 1800s and the story goes that it was named as such because Dirichlet's father was a postmaster. However, Dirichlet's naming and statement was in French and German and is rendered as pigeonhole in English. We don't really call these things pigeonholes anymore, and it is probably just as well that we think of birds and holes since it illustrates the basic idea quite nicely: if you have more birds than holes, then one of those holes is going to contain more than one bird.

Theorem 11.1 (Pigeonhole Principle). Let $A_1, A_2, \dots, A_k$ be disjoint sets such that $\left| \bigcup_{i=1}^k A_i \right| \gt k$. Then there exists a set $A_i$ such that $|A_i| \geq 2$.

Proof. Suppose we have $\left| \bigcup_{i=1}^k A_i \right| \gt k$. Let us assume that $|A_i| \leq 1$ for all $1 \leq i \leq k$. Then by the sum rule, we have $$\left| \bigcup_{i=1}^k A_i \right| = \sum_{i=1}^k |A_i| \leq \sum_{i=1}^k 1 = k,$$ which contradicts our initial hypothesis that $\left| \bigcup_{i=1}^k A_i \right| \gt k$. $\tag*{$\Box$}$

This idea seems very obvious, but it leads to some useful results.

Corollary 11.2. Let $A$ and $B$ be finite sets with $|A| \gt |B|$ and let $f:A \to B$ be a function. Then $f$ is not one-to-one.

Proof. Let $B = \{b_1, b_2, \dots, b_n\}$. Consider the set $A_i = f^{-1}(b_i)$ for $1 \leq i \leq n$. Then there are $n$ sets $A_i$. By the pigeonhole principle, since there are more sets than there are elements of $A$, there must be one set $A_i$ such that $|A_i| \geq 2$. Therefore, $f$ is not one-to-one. $\tag*{$\Box$}$

Example 11.3. Suppose there are $n$ candidates running for office at a debate and after the debate, the candidates all try to awkwardly shake each others' hands. Then there will be at least two people who shake the same number of hands. To see this, we first note that each person can shake hands with anywhere between 0 and $n-1$ other people, which gives us $n$ possibilities, and so we consider the set $A_i$ of people who shook $i$ hands, with $0 \leq i \leq n-1$. However, observe that it can't be the case that someone shook no one else's hand and someone else shook everyone's hand, so at least one of $A_0$ or $A_{n-1}$ must be empty. This leaves $n-1$ possibilities for $n$ people, so our result follows.

Example 11.4. Suppose I'm a charlatan and I claim that I have a perfect lossless compression algorithm that I'd like to scam some sweet VC funding for. Recall that all data is really just a binary string. So, what I'm essentially claiming is that for every string $x \in \{0,1\}^*$, I can compress it into some other string $x'$ that's shorter and I can decompress it to recover $x$. Let's consider all the binary strings of length $n$. From our previous discussions, we know there are exactly $2^n$ such binary strings. How many strings of length $\lt n$ are there? There are $2^0 + 2^1 + \cdots 2^{n-1} = 2^n - 1$ such strings. But $2^n-1 \lt 2^n$, so there is at least one string of length $\lt n$ that corresponds to two strings of length $n$ after compression.

We can make the pigeonhole principle even more general.

Theorem 11.5. Let $A_1, A_2, \dots, A_k$ be disjoint sets such that $\left| \bigcup_{i=1}^k A_i \right| = n$. Then there exists a set $A_i$ such that $|A_i| \geq \left\lceil \frac n k \right\rceil$.

Proof. Suppose that $\left| \bigcup_{i=1}^k A_i \right| = n$. Let us assume that $|A_i| \lt \left\lceil \frac n k \right\rceil$ for all $1 \leq i \leq k$. Then we have $$\left| \bigcup_{i=1}^k A_i \right| = \sum_{i=1}^k |A_i| \lt \sum_{i=1}^k \frac n k = k \cdot \frac n k = n.$$ Here, we make use of the fact that if $|a_i| \lt \left\lceil \frac n k \right\rceil$, then $|a_i| \lt \frac n k$. Otherwise, since $|a_i|$ is an integer, we would contradict the definition of the ceiling function. In the end, the above argument shows that we contradict our initial hypothesis that $\left| \bigcup_{i=1}^k A_i \right| = n$. $\tag*{$\Box$}$

Example 11.6. Suppose you have large numbers of three colours of socks in a drawer and you're pulling socks out without looking. How many socks do you need to pull out before you're guaranteed to have a pair of the same colour? The answer is four: if we view each colour as a set, all we need is at least one more sock than there are sets.

Now, suppose you are a crab. Crabs have two claws and eight legs so they obviously need eight socks (this is different from octopuses, who have eight limbs, but it is not immediately clear if these should be considered arms or legs). How many socks do you need to pull out before you're guaranteed eight socks of the same colour? The pigeonhole principle says we need $n$ with $\left\lceil \frac n 3 \right\rceil = 8$, which suggests that $n \gt 21$. Approaching it from another perspective, in the worst case, our crab ends up with 7 socks of each colour, so it has 21 socks in total.

We can, of course, ask the same question of spiders, caterpillars, centipedes, and so on.

Theorem 11.7. Let $a_1, a_2, \dots, a_{n+1}$ be integers with $a_i \leq 2n$ with $1 \leq i \leq n+1$. Then there exist $j \neq k$ such that $a_j \mid a_k$.

Proof. We have $a_i = 2^{k_i} \cdot m_i$, where $m_i$ is odd for all $1 \leq i \leq n+1$ by prime factorization. Now consider the greatest odd divisor of each $a_i$, say $b_i$. Since $a_i \leq 2n$ and there are only $n$ odd numbers less than $2n$, there are two numbers $a_k$ and $a_\ell$ such that $b_k = b_\ell$. Without loss of generality, we take $a_k \gt a_\ell$ and therefore, we have $a_\ell \mid a_k$. $\tag*{$\Box$}$

Theorem 11.8. Every sequence of $n^2+1$ distinct numbers must have a strictly monotonic subsequence of length at least $n+1$.

By monotonic, we mean that the subsequence is either strictly increasing or strictly decreasing.

Proof. Let $a_1, a_2, \dots, a_k$ be our sequence, with $k = n^2+1$. Then for each term $a_j$, we associate a pair $(i_j, d_j)$, where $i_j$ is the length of the longest strictly increasing subsequence that begins with $a_j$ and $d_j$ is the length of the longest strictly decreasing subsequence that begins with $a_j$.

Suppose that there are no strictly monotonic subsequences of length greater than $n$. Then every $i_j$ and $d_j$ is at most $n$. In other words, $(i_j,d_j) \in \{1,\dots,n\} \times \{1,\dots,n\}$ and so there are $n^2$ possible different pairs. Since $k = n^2+1$, by the pigeonhole principle, there are at least two pairs that are the same.

Let $a_s$ and $a_t$ be such that $s \neq t$ with $(i_s,d_s) = (i_t,d_t)$. Since $s \neq t$ and the terms of the sequence are distinct, we have either $a_s \lt a_t$ or $a_s \gt a_t$. Suppose $a_s \lt a_t$. Then we have a strictly increasing subsequence of length $i_t$ beginning with $a_t$. However, we can construct a strictly increasing subsequence of length $i_t + 1$ which begins with $a_s$ by appending $a_s$ to the beginning of the subsequence beginning with $a_t$. Then this gives us $i_s = i_t + 1 \neq i_t$, contradicting our assumption. A similar argument follows for $a_s \gt a_t$ and $d_s = d_t$. $\tag*{$\Box$}$

The following is a result from Ramsey Theory. Ramsey Theory is concerned with combinatorial questions that concern how large certain structures have to be before they are guaranteed to have some property. Ramsey was a British mathematician from the early 20th century who passed away very young, at the age of 26 in 1930.

Rosen states the following as a group where people are either friends or enemies. This is not really quite accurate, because just because you're not friends doesn't mean you're necessarily enemies (i.e. the Sid Meier's Civilization I model of diplomacy). The more standard analogy is about people at a party and whether they know each other or not.

Example 11.9. Given any group of six people at a party, either three are mutual friends or three are mutual strangers. Consider Alice. Either Alice does or does not know each of the five others, which must mean that she knows at least three of the others or she doesn't know at least three of the others. We'll consider the case where she knows three other people, since the case where she doesn't know three other people is similar. So she knows Bob, Charlie, and Eve. We can consider two possbilities.

Two of Bob, Charlie, and Eve know each other. If this is the case, then we have our three mutual friends.
None of Bob, Charlie, and Eve know each other. In this case, we have our three mutual strangers.

All of this can be stated more formally using graphs and cliques and colourings, but the party setting allows us to defer defining all of these things until we actually get to graph theory. But note that this argument hinges on the fact that there are at least five other people connected to Alice in order for us to apply the Pigeonhole principle. If we had fewer (i.e. if we took groups of five or less), we wouldn't be able to guarantee either mutual friend/stranger triplet.

We denote by $R(m,n)$ the Ramsey number, which is the minimum number of guests we need to be at our party such that at least $m$ people all know each other or at least $n$ people don't know each other.

Theorem 11.10. $R(3,3) \leq 6$.

Probability

Now that we have spent some time on learning how to enumerate outcomes, we can start to think about the likelihood of particular outcomes. This is where probability theory comes into play. In particular, we are interested in discrete probability, which deals with countably many outcomes and is attacked using many of the tools that we've been using. The other branch of probability theory deals with uncountably many outcomes and is approached using methods from calculus and analysis. For instance, where we can get away with summation, you will see a lot of integration when working in continuous probability theory.

At first glance, even though we're primarily concerned with discrete probability, it seems a bit strange that we'd try to stuff probability theory into this course which is already chock full of other things to cover. But probability and randomness play an important role in computer science and particularly in theoretical computer science. Probabilistic methods are an important part of analysis of algorithms and algorithm design. There are some problems for which we may not know of an efficient deterministic algorithm, but we can come up with a provably efficient randomized algorithm. To tackle these problems, we need a basic understanding of discrete probability.

Definition 12.1. A probability space $(\Omega,\Pr)$ is a finite set $\Omega \neq \emptyset$ together with a function $\Pr : \Omega \to \mathbb R^+$ such that

$\forall \omega \in \Omega, \Pr(\omega) \gt 0$,
$\sum_{\omega \in \Omega} \Pr(\omega) = 1$.

We say that $\Omega$ is the sample space and $\Pr$ is the probability distribution. The above two conditions are known as the probability axioms and is due to Andrey Kolmogorov in the 1930s. Kolmogorov was a Soviet mathematician and was particularly influential in computer science by developing notions of algorithmic information theory in the 1960s. It might be a bit surprising to learn (as I did) that the formalization of basic probability concepts didn't come about until the 20th century, despite there being analysis of games of chance dating back to 17th century France.

An event is a subset $A \subseteq \Omega$ and $\Pr : \mathcal P(\Omega) \to \mathbb R$ defined by $\Pr(A) = \sum_{\omega \in A} \Pr(\omega)$ is the probability of $A$. The atomic events are singleton sets $\{\omega\}$ with $\Pr(\{\omega\}) = \Pr(\omega)$ and the trivial events are events with probability 0 ($\emptyset$) or 1 ($\Omega$).

Example 12.2. A sample space for flipping a coin would be whether it lands on heads and tails $\{H,T\}$. A sample space for the outcome of a die roll would be $\{1,2,3,4,5,6\}$. An event can be something simple like flipping a coin and it landing on heads, in which case the event would be the set $\{H\}$. It can also be more complicated, like a die roll landing on a square, which would be the set $\{1,4\}$, or perhaps the positive divisors of 6, which would be the set $\{1,2,3,6\}$.

Definition 12.3. The uniform distribution over a sample space $\Omega$ defines $\Pr(\omega) = \frac 1 {|\Omega|}$ for all $\omega \in \Omega$. Thus, $\Pr(A) = \frac{|A|}{|\Omega|}$.