20: Probability - Convergence Modes, LLN & CLT
Գործնական
🎲 Convergence, Law of Large Numbers, and Central Limit Theorem
Convergence Modes
01 Convergence modes: The vanishing spike
Define \(X_n\) as follows: with probability \(\frac{1}{n}\), \(X_n = n\); otherwise \(X_n = 0\).
- Compute \(E[X_n]\) and \(\text{Var}[X_n]\). What happens as \(n \to \infty\)?
- Show that \(X_n \xrightarrow{P} 0\) (converges in probability to 0) by computing \(P[|X_n| > \epsilon]\) for any \(\epsilon > 0\).
- Does \(X_n \to 0\) almost surely? Hint: Consider \(\sum_{n=1}^{\infty} P[X_n \neq 0]\). Use the Borel-Cantelli lemma intuition: if events happen “infinitely often” then convergence a.s. fails.
- Explain in 2-3 sentences: why can a sequence have vanishing probability of being far from 0, yet still “spike” infinitely often with probability 1?
a) Mean and variance. \(X_n\) takes value \(n\) with probability \(1/n\) and \(0\) with probability \(1 - 1/n\), so
\[\mathbb{E}[X_n] = n \cdot \tfrac{1}{n} + 0 \cdot (1 - \tfrac{1}{n}) = 1, \qquad \mathbb{E}[X_n^2] = n^2 \cdot \tfrac{1}{n} = n,\]
hence \(\operatorname{Var}(X_n) = n - 1 \to \infty\). So the mean stays pinned at 1 forever while the variance blows up — the entire “mass that matters” sits in a rarer and rarer but taller and taller spike.
b) Convergence in probability. Pick any \(\epsilon > 0\). For \(n > \epsilon\) the event \(\{|X_n| > \epsilon\}\) is exactly \(\{X_n = n\}\), so
\[P[|X_n| > \epsilon] = \tfrac{1}{n} \xrightarrow{n \to \infty} 0,\]
which is the definition of \(X_n \xrightarrow{p} 0\). Note this also gives \(\mathbb{E}|X_n - 0| = 1 \not\to 0\), so \(X_n\) does not converge in \(L^1\) even though it converges in probability — the spike carries unit mean to infinity.
c) Almost sure convergence fails. Let \(A_n = \{X_n \neq 0\}\), so \(P[A_n] = 1/n\). If the \(X_n\) are independent, then \(\sum P[A_n] = \sum 1/n = \infty\), and the second Borel-Cantelli lemma gives
\[P[A_n \text{ occurs i.o.}] = 1.\]
So with probability 1 there are infinitely many \(n\) with \(X_n = n\), and along that subsequence \(X_n\) shoots off to \(\infty\), not 0. Therefore \(X_n \not\xrightarrow{a.s.} 0\).
d) Intuition. “In probability” is a statement about each fixed time slice — at any single large \(n\), the chance of seeing the spike is tiny. “Almost surely” is a statement about whole trajectories — and even if each individual spike is rare, summing rare-but-not-rare-enough probabilities over infinitely many trials guarantees infinitely many spikes occur. Vanishing per-step probability is not the same as a finite total probability budget.
📖 The Borel-Cantelli Lemmas: Infinite Monkeys and Rare Events
The Borel-Cantelli lemmas tell us when rare events happen “infinitely often” (abbreviated i.o.).
Setup: You have infinitely many events \(A_1, A_2, A_3, \ldots\) (like flipping coins forever, or waiting for earthquakes).
Question: What’s the probability that infinitely many of these events occur?
First Borel-Cantelli Lemma (The “Convergent Sum” Lemma)
Statement: If \(\sum_{n=1}^{\infty} P[A_n] < \infty\) (the probabilities sum to a finite number), then \[P[\text{infinitely many } A_n \text{ occur}] = 0.\]
Intuition: If the total probability budget across all events is finite, you can’t afford infinitely many events to happen. Almost surely, only finitely many occur.
Example: Let \(A_n\) = “you win the lottery on day \(n\)” with \(P[A_n] = \frac{1}{n^2}\).
- \(\sum_{n=1}^{\infty} \frac{1}{n^2} = \frac{\pi^2}{6} < \infty\) (converges!)
- So with probability 1, you win the lottery only finitely many times (sad news for infinite wealth dreams).
Non-example: Let \(A_n\) = “you get heads on flip \(n\)” with \(P[A_n] = \frac{1}{2}\).
- \(\sum_{n=1}^{\infty} \frac{1}{2} = \infty\) (diverges!)
- The first lemma says nothing here. (Hint: the second lemma will help!)
Second Borel-Cantelli Lemma (The “Independent Events” Lemma)
Statement: If the events \(A_n\) are independent and \(\sum_{n=1}^{\infty} P[A_n] = \infty\) (diverges), then \[P[\text{infinitely many } A_n \text{ occur}] = 1.\]
Intuition: If events are independent and the total probability is infinite, you have an unlimited budget—infinitely many events must happen almost surely.
Example continued: Coin flips with \(P[\text{Heads}] = \frac{1}{2}\) are independent and \(\sum \frac{1}{2} = \infty\).
- So you get heads infinitely often with probability 1. (This is obvious intuitively, but BC makes it rigorous!)
Another example: Toss a fair coin. Let \(A_n\) = “you get 10 heads in a row starting at flip \(n\).”
- \(P[A_n] = \frac{1}{2^{10}} = \frac{1}{1024}\)
- \(\sum_{n=1}^{\infty} \frac{1}{1024} = \infty\)
- The events aren’t exactly independent (overlapping runs), but a modified argument shows: you will see 10 heads in a row infinitely often, almost surely.
Combining the Two: The Traffic Light of Convergence
| \(\sum P[A_n]\) | Events independent? | What happens i.o.? |
|---|---|---|
| \(< \infty\) | Doesn’t matter | Never (a.s.) |
| \(= \infty\) | ❌ No | Unknown (need more info) |
| \(= \infty\) | ✅ Yes | Always (a.s.) |
Key insight for convergence:
- To show \(X_n \to 0\) almost surely, you often define \(A_n = \{|X_n| > \epsilon\}\) and show \(\sum P[A_n] < \infty\) (so “being far away” happens only finitely often).
- If \(\sum P[A_n] = \infty\) but events aren’t independent, a.s. convergence might fail (see Problem 02!).
The Infinite Monkey Theorem
Claim: A monkey typing randomly will eventually type the complete works of Shakespeare, almost surely.
Proof sketch:
- Let \(A_n\) = “the monkey types ‘To be or not to be’ starting at position \(n\).”
- \(P[A_n] = (1/26)^{18} \approx 10^{-25}\) (assuming 26 letters)
- \(\sum P[A_n] = \infty\) (sum of constants diverges)
- Keystrokes are independent \(\Rightarrow\) BC2 says it happens infinitely often!
- Same argument works for the full text of Hamlet (just a longer string with smaller \(P[A_n]\), but still divergent sum).
Reality check: The universe might end before this happens. Borel-Cantelli guarantees it in the limit, not in your lifetime!
Law of Large Numbers
02 Sample mean convergence: Visualizing LLN
You roll a fair die repeatedly and compute the running average \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\).
- What is \(E[X_i]\) and \(\text{Var}[X_i]\) for a single roll?
- Compute \(E[\bar{X}_n]\) and \(\text{Var}[\bar{X}_n]\). What happens to the variance as \(n\) increases?
- Use Chebyshev’s inequality to bound \(P[|\bar{X}_n - 3.5| > 0.5]\). For what \(n\) is this probability less than 0.05?
- Sketch (or describe) how you’d expect a plot of \(\bar{X}_n\) vs \(n\) to look for \(n = 1\) to \(n = 100\).
a) Single roll. \(X_i\) is uniform on \(\{1, 2, 3, 4, 5, 6\}\), so
\[\mathbb{E}[X_i] = \tfrac{1+2+3+4+5+6}{6} = 3.5, \qquad \operatorname{Var}(X_i) = \tfrac{1}{6}\sum_{k=1}^6 (k - 3.5)^2 = \tfrac{35}{12} \approx 2.92.\]
b) Sample mean. By linearity and independence, \(\mathbb{E}[\bar{X}_n] = 3.5\) and \(\operatorname{Var}(\bar{X}_n) = \operatorname{Var}(X_i)/n = \tfrac{35}{12n}\). Variance shrinks like \(1/n\) — the typical deviation of \(\bar{X}_n\) from \(3.5\) is on the order of \(\sqrt{35/(12n)}\).
c) Chebyshev. With \(\epsilon = 0.5\),
\[P[|\bar{X}_n - 3.5| > 0.5] \le \frac{\operatorname{Var}(\bar{X}_n)}{0.5^2} = \frac{35/(12n)}{1/4} = \frac{35}{3n}.\]
Setting \(\tfrac{35}{3n} < 0.05\) gives \(n > 35/(3 \cdot 0.05) \approx 233.3\), so \(n \ge 234\) suffices (Chebyshev is loose; in reality far fewer rolls would do).
d) Plot description. Early on (\(n = 1, 2, \ldots, 10\)) the running average bounces wildly between 1 and 6. As \(n\) grows, the curve tightens around the horizontal line \(y = 3.5\), with fluctuations shrinking like \(1/\sqrt{n}\) — by \(n = 100\) you’d typically be within roughly \(\pm 0.17\) of \(3.5\). This is exactly LLN in action.
03 When LLN fails: Cauchy counterexample
The Cauchy distribution has PDF \(f(x) = \frac{1}{\pi(1 + x^2)}\) for \(x \in \mathbb{R}\).
- Without computing, explain why \(E[X]\) doesn’t exist for \(X \sim \text{Cauchy}\) by considering the integral \(\int_{-\infty}^{\infty} x \cdot \frac{1}{\pi(1+x^2)} dx\).
- Let \(X_1, \ldots, X_n\) be i.i.d. Cauchy. The sample mean is \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\). Does \(\bar{X}_n\) converge to anything as \(n \to \infty\)?
- Fact: \(\bar{X}_n\) itself has the same Cauchy distribution for all \(n\). Explain why this violates the LLN, and what condition of the LLN is not satisfied.
- Simulate (or imagine) 1000 samples from Cauchy and compute their mean. Would you expect a “tight” estimate or wild variability? Why is averaging useless here?
a) Why \(\mathbb{E}[X]\) does not exist. For large \(|x|\), \(f(x) = \tfrac{1}{\pi(1+x^2)} \sim \tfrac{1}{\pi x^2}\), so
\[\int_{-\infty}^{\infty} |x| \cdot \frac{1}{\pi(1+x^2)}\, dx \sim \int \frac{|x|}{\pi x^2}\, dx = \int \frac{1}{\pi |x|}\, dx,\]
which diverges (logarithmically) at both \(\pm \infty\). Since \(\int |x| f(x)\, dx = \infty\), the expectation is undefined — the standard fix of writing it as a “principal value” \(\lim_{R \to \infty} \int_{-R}^R x f(x)\, dx = 0\) is not the Lebesgue expectation; the positive and negative halves are each individually infinite, so canceling them by symmetry is illegal.
b) Does \(\bar{X}_n\) converge? No. There is no candidate limit because the population mean does not exist, and as part (c) shows the distribution of \(\bar{X}_n\) does not concentrate at all.
c) Why this violates LLN. The (weak/strong) law of large numbers requires \(\mathbb{E}|X_i| < \infty\). Cauchy violates exactly that hypothesis. The remarkable fact that \(\bar{X}_n \sim \text{Cauchy}(0, 1)\) for every \(n\) shows the failure quantitatively: averaging \(n\) Cauchys gives the same distribution back, no concentration whatsoever. So \(\operatorname{Var}(\bar{X}_n)\) does not shrink (it is undefined), and Chebyshev’s inequality — the workhorse of the LLN proof — has nothing to say.
Deeper insight (characteristic functions). The characteristic function of standard Cauchy is \(\varphi(t) = e^{-|t|}\). By independence,
\[\varphi_{\bar{X}_n}(t) = \varphi_{X_1}(t/n)^n = \left(e^{-|t|/n}\right)^n = e^{-|t|},\]
which is again the Cauchy characteristic function. So \(\bar{X}_n \stackrel{d}{=} X_1\) for every \(n\). Compare this to the Normal case: if \(X_i \sim N(0, 1)\), then \(\bar{X}_n \sim N(0, 1/n)\) — variance shrinks like \(1/n\) and the distribution collapses to the point mass at 0. For Cauchy, the heavy tails \(f(x) \sim 1/x^2\) are so fat that the occasional huge sample dominates the sum, no matter how many samples you average.
d) Practical experiment. With 1000 Cauchy samples, you would get wild variability — extreme outliers (say, \(|x_i| > 1000\)) appear regularly and singlehandedly drag the sample mean far from 0. Repeat the experiment several times and you’ll see sample means like \(-3.2\), \(+47\), \(-0.4\), \(+12\), with no convergence pattern. Averaging is useless here; the sample median is a far more robust estimator (and it does converge, since the median is well-defined even for heavy-tailed distributions).
Central Limit Theorem
04 CLT in Python: Simulating the magic
In this problem, you’ll write Python code to demonstrate the Central Limit Theorem empirically.
Setup: Let \(X_i \sim \text{Exp}(1)\) (exponential distribution with rate 1, so \(E[X_i] = 1\), \(\text{Var}[X_i] = 1\)).
Tasks:
- Generate data: For each sample size \(n \in \{1, 5, 10, 30, 50\}\):
- Simulate \(N = 10000\) sample means \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\) (each mean computed from \(n\) i.i.d. \(\text{Exp}(1)\) random variables).
- Store these 10000 sample means.
- Standardize: Compute the standardized version: \[Z_n = \frac{\bar{X}_n - 1}{\sqrt{1/n}} = \sqrt{n} \cdot (\bar{X}_n - 1).\] For each \(n\), compute this for all 10000 samples.
- Visualize: Create a figure with 5 subplots (one for each \(n\)):
- Plot a histogram of the 10000 standardized values \(Z_n\).
- Overlay the standard normal PDF \(N(0, 1)\) as a smooth curve.
- Add a title indicating the sample size \(n\).
- Interpret:
- For which values of \(n\) does the histogram closely match the normal curve?
- The original \(\text{Exp}(1)\) distribution is highly skewed (right tail). Explain in 2-3 sentences why the CLT “fixes” this skewness as \(n\) grows.
Bonus: - Repeat for \(X_i \sim \text{Bernoulli}(0.1)\) (highly skewed discrete distribution). Does CLT work faster or slower than for exponential? Why?
Python hints:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# For generating exponential: np.random.exponential(scale=1, size=n)
# For standard normal PDF: norm.pdf(x, 0, 1)What you should observe. For \(X_i \sim \text{Exp}(1)\):
- \(n = 1\): histogram is just (a shifted/scaled copy of) the original \(\text{Exp}(1)\) — sharply right-skewed, with a long tail to the right and a hard left edge near \(-1\) (since \(\bar{X}_1 \ge 0\) and \(Z_1 = \bar{X}_1 - 1\)).
- \(n = 5\): still visibly skewed, but the peak has moved toward 0 and the left tail is filling in.
- \(n = 10\): starting to look bell-shaped, with mild residual right skew.
- \(n = 30\): histogram tracks \(N(0, 1)\) closely — the standard rule of thumb “\(n \ge 30\)” comes from exactly this kind of picture.
- \(n = 50\): nearly indistinguishable from the standard normal curve.
Why CLT “fixes” skewness. For iid \(X_i\) with mean \(\mu = 1\) and variance \(\sigma^2 = 1\), the CLT says
\[Z_n = \sqrt{n}\,\frac{\bar{X}_n - \mu}{\sigma} \xrightarrow{d} N(0, 1).\]
The standardization by \(\sqrt{n}\) is exactly the right zoom to keep \(\operatorname{Var}(Z_n) = 1\). As \(n\) grows, every iid sum becomes “smoothed out” — the skewness of \(Z_n\) scales like \(\text{skew}(X_i)/\sqrt{n}\) (and the excess kurtosis like \(1/n\)), so all higher-order asymmetries wash out at the predictable rate \(1/\sqrt{n}\). Only the first two moments survive in the limit, which is why every distribution with finite mean and variance ends up at the same Gaussian shape.
Conditions to remember. CLT requires (i) iid samples, (ii) finite mean, (iii) finite variance. Drop (iii) and you get stable laws (Cauchy, problem 03 above) instead of a Gaussian. Drop the iid assumption and you need stronger conditions (Lindeberg, martingale CLTs).
Bonus — Bernoulli(0.1). This distribution is much more skewed (skewness \(= (1-2p)/\sqrt{p(1-p)} \approx 2.67\) vs. \(2\) for Exp(1)), so CLT kicks in slower. A common rule is \(np \ge 10\) and \(n(1-p) \ge 10\), which here means \(n \ge 100\) before the normal approximation is trustworthy. With \(p = 0.1\) and small \(n\), the histogram of \(Z_n\) will look discrete and lumpy (only a few possible sum values) before it eventually smooths into a bell curve.