19: Probability - Distributions


📚 Նյութը

YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!

Դասախոսություն

Գործնական

Չմոռանամ

https://mathlets.org/mathlets/probability-distributions/ կայքը բզբզալ գործնականին

Distribution Identification

01 Distribution detective: Which one fits?

Match each scenario to the most appropriate distribution. Justify each choice in one sentence.

    1. The number of typos on a randomly selected page of a 500-page book, if typos occur randomly at an average rate of 0.5 per page.
    1. Whether a randomly selected email is spam (yes/no), given 40% of emails are spam.
    1. The number of heads in 20 coin flips.
    1. The exact time (in minutes) you wait for the next bus, if buses arrive completely randomly at an average rate of 4 per hour.
    1. A randomly chosen real number between 0 and 10.
  • a) Poisson. Counts of rare independent events with a known average rate per fixed unit. Here \(\lambda = 0.5\) typos per page.
  • b) Bernoulli. A single yes/no trial. \(X = 1\) if spam, \(0\) otherwise, with \(p = 0.4\).
  • c) Binomial. A fixed number \(n = 20\) of independent Bernoulli trials with the same success probability \(p = 0.5\). $X = $ number of heads.
  • d) Exponential. Continuous waiting time for the next event in a Poisson process. With \(\lambda = 4\) per hour, \(X \sim \text{Exp}(4)\) (in hours) or equivalently \(\text{Exp}(4/60)\) in minutes.
  • e) Uniform. A continuous variable equally likely to land anywhere in a fixed interval: \(X \sim U(0, 10)\).

Pattern to remember. Discrete count of rare events \(\to\) Poisson; fixed \(n\) trials \(\to\) Binomial; one trial \(\to\) Bernoulli; continuous waiting time \(\to\) Exponential; “no value more likely than another” on an interval \(\to\) Uniform.


02 Name that distribution

For each scenario, identify the distribution, state its parameter(s), and write the PMF or PDF.

    1. A call center receives calls at an average rate of 8 per hour. Let \(X\) be the number of calls received between 2:00 PM and 3:00 PM.
    1. A software update crashes with probability 0.03. An IT department pushes the update to 200 computers independently. Let \(Y\) be the number of computers that crash.
    1. A sensor measures temperature continuously, but due to manufacturing imprecision, the true reading is somewhere between 98.5°C and 101.5°C with no value more likely than another. Let \(T\) be the measured temperature.
    1. A quality inspector tests light bulbs one by one. Each bulb independently fails inspection with probability 0.15. Let \(N\) be the number of bulbs tested until the first failure.
    1. The time between earthquakes in a seismically active region averages 4 months. Let \(W\) be the waiting time (in months) until the next earthquake.

a) \(X \sim \text{Poisson}(\lambda = 8)\). PMF:

\[P(X = k) = \frac{e^{-8} 8^k}{k!}, \quad k = 0, 1, 2, \dots\]

b) \(Y \sim \text{Binomial}(n = 200, p = 0.03)\). PMF:

\[P(Y = k) = \binom{200}{k} (0.03)^k (0.97)^{200 - k}, \quad k = 0, 1, \dots, 200\]

c) \(T \sim \text{Uniform}(98.5, 101.5)\). PDF:

\[f_T(t) = \frac{1}{101.5 - 98.5} = \frac{1}{3}, \quad t \in [98.5, 101.5]\]

d) \(N \sim \text{Geometric}(p = 0.15)\) — number of trials until the first failure. PMF (using the “trials until first success” convention):

\[P(N = k) = (0.85)^{k - 1} \cdot 0.15, \quad k = 1, 2, 3, \dots\]

e) \(W \sim \text{Exponential}(\lambda = 1/4)\) months\(^{-1}\) (since \(\mathbb{E}[W] = 1/\lambda = 4\)). PDF:

\[f_W(w) = \tfrac{1}{4} e^{-w/4}, \quad w \geq 0\]

Common pitfall. For the Exponential, \(\lambda\) is a rate (events per unit time), not a mean. Mean is \(1/\lambda\). If a problem gives you the average waiting time, take its reciprocal.


03 Mystery distributions: Identify from data

A researcher collects data from three different experiments and computes summary statistics:

Dataset A: \(n = 500\) observations, all values are either 0 or 1. Sample mean \(\approx 0.23\), sample variance \(\approx 0.177\).

Dataset B: \(n = 1000\) observations, values range from 0 to 47. Sample mean \(\approx 12.1\), sample variance \(\approx 11.8\).

Dataset C: \(n = 800\) observations, values are positive reals ranging from 0.001 to 14.2. Sample mean \(\approx 2.5\), sample variance \(\approx 6.3\).

For each dataset:

    1. Identify the most likely distribution family.
    1. Estimate the parameter(s) of that distribution from the summary statistics.
    1. For Dataset B, the researcher notices that these are counts of customer complaints per day at a call center. Does this context support your answer? What if instead they were counts of “successes” in 50 independent trials per observation?

The trick: distributions leave fingerprints in their mean–variance relationship.

Distribution Mean Variance Variance / Mean
Bernoulli\((p)\) \(p\) \(p(1-p)\) \(1 - p\)
Binomial\((n, p)\) \(np\) \(np(1-p)\) \(1 - p\)
Poisson\((\lambda)\) \(\lambda\) \(\lambda\) \(1\)
Exponential\((\lambda)\) \(1/\lambda\) \(1/\lambda^2\) \(1/\lambda\) (so var/mean\(^2\) = 1)

Dataset A — Bernoulli.

Values \(\in \{0, 1\}\) rules everything else out. Estimate \(\hat p = \bar X = 0.23\).

Sanity check: \(\hat p (1 - \hat p) = 0.23 \cdot 0.77 = 0.1771\), matches the reported variance \(0.177\). Good.

Dataset B — Poisson.

Counts \(\{0, 1, 2, \dots, 47\}\), mean \(\approx\) variance (\(12.1\) vs \(11.8\)). The Poisson signature is precisely \(\text{mean} = \text{variance}\). Estimate \(\hat\lambda = 12.1\).

A Binomial would also have non-negative integer values, but its variance is \(np(1-p) < np\) — strictly smaller than the mean. The fact that variance \(\approx\) mean here, not \(\ll\) mean, points to Poisson.

Dataset C — Exponential.

Continuous, positive, right-skewed (range starts near 0, mean 2.5, max 14.2 — much further above the mean than below). The Exponential signature: \(\text{Var} = (\mathbb{E}[X])^2\). Check: \(2.5^2 = 6.25 \approx 6.3\). Estimate \(\hat\lambda = 1/\bar X = 1/2.5 = 0.4\).

c) Customer complaints per day: independent rare events accumulating at a roughly constant rate — exactly the Poisson regime. Context confirms.

If instead it were \(50\) independent trials per day, we’d want \(\text{Bin}(50, p)\) with \(\hat p = 12.1 / 50 = 0.242\). Then variance would be \(50 \cdot 0.242 \cdot 0.758 \approx 9.17\), noticeably less than the observed \(11.8\). So the Binomial interpretation is a worse fit — the data really does look Poisson, not Binomial-with-50-trials.

Why this matters in ML. When you fit a count model and the empirical variance is much larger than the mean, that’s “overdispersion” and a sign Poisson is the wrong model — typical fix is Negative Binomial. Mean-variance ratios are the first thing to check.


Discrete Distributions

04 The “obvious” Bernoulli that isn’t

A weighted die shows 6 with probability \(\frac{1}{3}\) and each of 1–5 with probability \(\frac{2}{15}\).

    1. Define a Bernoulli random variable \(X\) for “rolling a 6.” State \(p\) and compute \(E[X]\) and \(\text{Var}[X]\).
    1. Define a different Bernoulli random variable \(Y\) for “rolling an even number.” Compute \(E[Y]\) and \(\text{Var}[Y]\).
    1. For which event is the variance larger? Explain intuitively why maximum Bernoulli variance occurs at \(p = 0.5\).

a) \(X = \mathbb{1}\{\text{rolled a } 6\}\), so \(p = 1/3\).

\[\mathbb{E}[X] = p = \tfrac{1}{3}, \quad \text{Var}(X) = p(1 - p) = \tfrac{1}{3} \cdot \tfrac{2}{3} = \tfrac{2}{9}\]

b) \(Y = \mathbb{1}\{\text{rolled even}\}\). Even faces are \(\{2, 4, 6\}\):

\[P(Y = 1) = P(2) + P(4) + P(6) = \tfrac{2}{15} + \tfrac{2}{15} + \tfrac{1}{3} = \tfrac{2}{15} + \tfrac{2}{15} + \tfrac{5}{15} = \tfrac{9}{15} = \tfrac{3}{5}\]

So \(Y \sim \text{Bernoulli}(3/5)\):

\[\mathbb{E}[Y] = \tfrac{3}{5}, \quad \text{Var}(Y) = \tfrac{3}{5} \cdot \tfrac{2}{5} = \tfrac{6}{25} = 0.24\]

c) Compare: \(\text{Var}(X) = 2/9 \approx 0.222\) vs \(\text{Var}(Y) = 0.24\). \(Y\) has larger variance.

The function \(p(1-p)\) on \([0, 1]\) is a downward parabola, maximized at \(p = 0.5\) with value \(0.25\). Since \(p_Y = 0.6\) is closer to \(0.5\) than \(p_X = 1/3\), \(Y\)’s variance is closer to the maximum.

Intuition. A Bernoulli’s variance measures uncertainty about which outcome will happen. If \(p = 0.99\), you’re almost sure it’ll be \(1\) — low uncertainty, low variance. Same for \(p = 0.01\). Maximum confusion is at \(p = 0.5\), where the two outcomes are equally likely.


05 Memoryless waiting: Geometric intuition

A slot machine pays out with probability \(p = 0.05\) on each play.

    1. What is the expected number of plays until the first payout?
    1. You’ve already played 50 times with no payout. What is the expected additional number of plays until you win?
    1. A gambler says: “I’m due for a win soon because I’ve lost so many times.” In 2–3 sentences, explain why this reasoning is flawed.

Let \(X\) = number of plays until the first payout. \(X \sim \text{Geometric}(p = 0.05)\).

a) \(\mathbb{E}[X] = 1/p = 1/0.05 = 20\) plays.

b) Still 20 plays. The Geometric distribution is memoryless:

\[P(X > 50 + k \mid X > 50) = P(X > k)\]

Conditional on having lost the first \(50\) plays, the number of additional plays needed is again Geometric\((0.05)\), with the same expected value \(20\).

c) Gambler’s fallacy. The slot machine has no memory — each play is an independent Bernoulli trial with probability \(p = 0.05\), regardless of what came before. The probability of winning on play \(51\) is still \(0.05\), exactly as it was on play \(1\). “Being due” would require the past to push future outcomes, which independence directly forbids. (We’ll see the same reasoning error appear as the prosecutor’s fallacy in Problem 14.)


06 Binomial: Quality control decision

A factory produces chips with defect probability \(p = 0.02\). A batch of \(n = 100\) chips is inspected.

    1. Let \(X\) be the number of defective chips. State the distribution of \(X\) and compute \(E[X]\) and \(\text{Var}[X]\).
    1. The batch is rejected if more than 5 chips are defective. Without computing \(P[X > 5]\) exactly, explain why \(P[X > 5]\) is small.
    1. If \(p\) increases to \(0.10\), recompute \(E[X]\). How does this change the rejection decision intuitively?

a) \(X \sim \text{Binomial}(n = 100, p = 0.02)\).

\[\mathbb{E}[X] = np = 2, \quad \text{Var}(X) = np(1 - p) = 100 \cdot 0.02 \cdot 0.98 = 1.96\]

So \(\text{SD}(X) \approx 1.4\).

b) Rejection threshold \(X > 5\) is more than \(\frac{5 - 2}{1.4} \approx 2.14\) standard deviations above the mean. By Chebyshev (HW 17), \(P(|X - 2| \geq 3) \leq 1.96/9 \approx 0.22\). Even a loose bound says rejection is uncommon, and the true probability is much smaller (around \(0.016\)).

c) With \(p = 0.10\): \(\mathbb{E}[X] = 100 \cdot 0.10 = 10\). Now the expected number of defectives is already double the rejection threshold. So rejection becomes the rule, not the exception. A \(5\times\) jump in defect rate moves the batch from “almost always pass” to “almost always fail” — quality control is sensitive precisely because the threshold sits in the right tail of the distribution.


07 Poisson: Rare events approximation

A website has 10,000 visitors per day. Each visitor independently has a \(0.0003\) probability of reporting a bug.

    1. Let \(X\) be the number of bug reports per day. Which distribution is a good approximation here, and what is the parameter?
    1. What is the probability of receiving at least one bug report?

a) Exactly, \(X \sim \text{Binomial}(n = 10000, p = 0.0003)\). But \(n\) is large and \(p\) is small with \(np = 3\) moderate — that’s the Poisson regime:

\[X \approx \text{Poisson}(\lambda = np = 3)\]

The Poisson approximation works because \(\binom{n}{k} p^k (1-p)^{n-k} \to \frac{e^{-\lambda} \lambda^k}{k!}\) as \(n \to \infty\) with \(\lambda = np\) held fixed. In practice, this is excellent whenever \(n \geq 20\) and \(p \leq 0.05\) or so.

b)

\[P(X \geq 1) = 1 - P(X = 0) = 1 - e^{-3} \approx 1 - 0.0498 \approx 0.9502\]

So about a \(95\%\) chance of at least one bug report on any given day.

Why use Poisson at all? The Binomial PMF involves \(\binom{10000}{k}\), which is unwieldy for hand calculation. Poisson collapses everything into one parameter \(\lambda\) and a clean exponential — much easier to reason about, with negligible error in this regime.


Continuous Distributions

08 Exponential: Memoryless lifetimes

A light bulb’s lifetime (in years) follows \(\text{Exp}(\lambda = 0.5)\).

    1. Compute \(E[X]\) and the probability that the bulb lasts more than 3 years.
    1. Given that the bulb has already lasted 2 years, what is the probability it lasts at least 1 more year?
    1. Compare with the discrete case: if bulb failure each year is Bernoulli with \(p = 0.4\), and \(Y \sim \text{Geo}(0.4)\) counts years until failure, compute \(P[Y > 3 \mid Y > 2]\) and \(P[Y > 1]\). What do you notice?

a) \(X \sim \text{Exp}(0.5)\):

\[\mathbb{E}[X] = \tfrac{1}{\lambda} = 2 \text{ years}\]

For \(X \sim \text{Exp}(\lambda)\), \(P(X > t) = e^{-\lambda t}\):

\[P(X > 3) = e^{-0.5 \cdot 3} = e^{-1.5} \approx 0.2231\]

b) Memorylessness:

\[P(X > 2 + 1 \mid X > 2) = P(X > 1) = e^{-0.5} \approx 0.6065\]

The bulb’s “remaining” lifetime is a fresh Exp\((0.5)\) regardless of how long it’s already lasted.

c) With \(Y \sim \text{Geo}(0.4)\), \(P(Y > k) = (1 - p)^k = 0.6^k\) (probability of \(k\) consecutive non-failures).

\[P(Y > 3 \mid Y > 2) = \frac{P(Y > 3)}{P(Y > 2)} = \frac{0.6^3}{0.6^2} = 0.6\]

\[P(Y > 1) = 0.6\]

They’re equal. The Geometric distribution is also memoryless — and in fact it’s the only discrete distribution on \(\{1, 2, \dots\}\) with this property, just as Exponential is the only continuous distribution on \([0, \infty)\) with it. Memorylessness is the discrete-continuous bridge between Geometric and Exponential, and it’s the same property that fueled the gambler’s fallacy in Problem 05.


09 Uniform: The broken stick problem

A stick of length 1 is broken at a uniformly random point \(X \sim U(0, 1)\).

    1. What is the expected length of the left piece?
    1. Let \(Y = X(1 - X)\) be the product of the two piece lengths. Compute \(E[Y]\).
    1. What break point \(x\) maximizes \(Y = x(1-x)\)? Compare this to \(E[X]\).

a) Left piece has length \(X \sim U(0, 1)\):

\[\mathbb{E}[X] = \tfrac{0 + 1}{2} = \tfrac{1}{2}\]

b) Using LOTUS (HW 17) with \(f_X(x) = 1\) on \([0, 1]\):

\[\mathbb{E}[Y] = \mathbb{E}[X(1 - X)] = \mathbb{E}[X] - \mathbb{E}[X^2]\]

For \(X \sim U(0, 1)\): \(\mathbb{E}[X] = 1/2\) and \(\mathbb{E}[X^2] = \int_0^1 x^2\,dx = 1/3\).

\[\mathbb{E}[Y] = \tfrac{1}{2} - \tfrac{1}{3} = \tfrac{1}{6}\]

c) \(\frac{d}{dx} x(1 - x) = 1 - 2x = 0 \implies x^* = 1/2\), with \(Y^* = 1/4\).

So the break point that maximizes the product is exactly \(\mathbb{E}[X] = 1/2\), but \(\mathbb{E}[Y] = 1/6 < 1/4 = Y(\mathbb{E}[X])\).

Jensen’s inequality. For the concave function \(g(x) = x(1-x)\):

\[\mathbb{E}[g(X)] \leq g(\mathbb{E}[X])\]

That’s exactly \(1/6 \leq 1/4\). The “expected outcome of a function” is generally not the “function of the expected outcome” — a recurring trap when people compute \(f(\bar X)\) thinking it equals \(\overline{f(X)}\).


10 Normal: The 68-95-99.7 rule in action

Human heights in a population follow \(N(170, 100)\) (mean 170 cm, variance 100 cm²).

    1. What is \(\sigma\)? What proportion of people are between 160 cm and 180 cm tall?
    1. A person is 2.5 standard deviations above the mean. How tall are they?
    1. Standardize the height \(X = 155\) cm. Interpret the z-score: is this person unusually short?

\(X \sim N(170, 100)\) means \(\mu = 170\), \(\sigma^2 = 100\), so \(\sigma = 10\) cm.

a) \(\sigma = 10\) cm. The interval \([160, 180]\) is exactly \([\mu - \sigma, \mu + \sigma]\), so by the 68-95-99.7 rule about \(68\%\) of people are between 160 and 180 cm tall.

b) Height \(= \mu + 2.5\sigma = 170 + 25 = 195\) cm.

c)

\[z = \frac{X - \mu}{\sigma} = \frac{155 - 170}{10} = -1.5\]

This person is \(1.5\) standard deviations below the mean. About \(93\%\) of the population is taller, and about \(7\%\) is shorter (from \(P(Z < -1.5) \approx 0.067\)). Short, but not extreme — you’d see plenty of people this height in a crowd.

Why standardize? The z-score strips off units and the choice of mean/scale, so heights, test scores, and salaries all live on the same ruler. “\(2\) standard deviations below” means the same thing everywhere.


11 Normal: Standardization and comparison

Test A has scores \(\sim N(500, 10000)\) (so \(\sigma = 100\)). Test B has scores \(\sim N(50, 100)\) (so \(\sigma = 10\)).

    1. Alice scores 680 on Test A. Bob scores 72 on Test B. Compute both z-scores.
    1. Who performed better relative to their test population?
    1. Explain why comparing raw scores (680 vs 72) is meaningless without standardization.
    1. You’re given a coin that shows heads with unknown probability \(p\). You flip it 100 times and observe 65 heads. If the coin were fair (\(p = 0.5\)), what are \(E[X]\) and \(\text{SD}[X]\) for the number of heads? How many standard deviations away from the mean is 65? What can you conclude about whether \(p = 0.5\)?

a) Test A: \(z_A = \frac{680 - 500}{100} = 1.8\). Test B: \(z_B = \frac{72 - 50}{10} = 2.2\).

b) Bob. A higher z-score means he’s further into the right tail of his test’s distribution. About \(\Phi(1.8) \approx 96.4\%\) of test-A takers scored below Alice; about \(\Phi(2.2) \approx 98.6\%\) of test-B takers scored below Bob.

c) Tests A and B use entirely different scales (\(\mu_A = 500\) vs \(\mu_B = 50\)). \(680 > 72\) tells you nothing about relative performance — it’s like comparing temperatures in Celsius vs Fahrenheit. Standardization removes the units and makes the comparison meaningful.

d) If \(p = 0.5\) and \(X \sim \text{Binomial}(100, 0.5)\):

\[\mathbb{E}[X] = np = 50, \quad \text{Var}(X) = np(1-p) = 25, \quad \text{SD}(X) = 5\]

Distance of \(65\) from the mean:

\[z = \frac{65 - 50}{5} = 3\]

\(65\) is \(3\) standard deviations above the mean. By the 68-95-99.7 rule, less than \(0.3\%\) of fair coins flipped 100 times would produce \(65\) or more heads (actually about \(0.18\%\) on each tail). Either we just witnessed a roughly \(1\)-in-\(300\) event, or the coin isn’t fair. Strong evidence to reject \(p = 0.5\).

That’s hypothesis testing in miniature. Set up a null model (\(p = 0.5\)), compute how surprising the data would be under it, and reject the null when the data sits deep in the tail. We’ll formalize this later under significance levels and p-values.


Connections Between Distributions

12 The Poisson-Exponential connection

Customers arrive at a shop according to a Poisson process with rate \(\lambda = 4\) per hour.

    1. What distribution does the number of arrivals in 1 hour follow? State its mean and variance.
    1. What distribution does the time between consecutive arrivals follow? State its mean.
    1. If no customer has arrived in the last 15 minutes, what is the probability that the next customer arrives within 10 minutes?

a) Number of arrivals in 1 hour: \(N \sim \text{Poisson}(\lambda = 4)\). \(\mathbb{E}[N] = \text{Var}(N) = 4\).

b) Inter-arrival times: \(T \sim \text{Exponential}(\lambda = 4)\) per hour, with \(\mathbb{E}[T] = 1/4\) hour \(= 15\) minutes.

c) By memorylessness, the previous 15 minutes of waiting are irrelevant. Convert \(\lambda\) to “per minute”: \(\lambda = 4/60 = 1/15\).

\[P(T < 10) = 1 - e^{-(1/15) \cdot 10} = 1 - e^{-2/3} \approx 1 - 0.5134 \approx 0.4866\]

About a \(48.7\%\) chance.

Two views, one process. A Poisson process is described equivalently by counting events (“how many in a fixed window?”, Poisson) or by spacing them out (“how long until the next one?”, Exponential). The parameter \(\lambda\) is shared: events per unit time. This duality is what makes Poisson processes such a clean modeling tool — for arrivals, decay, server requests, neuron spikes.


13 The “inspection paradox”

Buses arrive according to a Poisson process with rate \(\lambda = 6\) per hour (i.e., one every 10 minutes on average). You arrive at the bus stop at a uniformly random time.

    1. What is the distribution of time between consecutive buses? Compute its expected value.
    1. Intuitively, would you expect your average wait time to be 5 minutes (half the inter-arrival time)?
    1. The “inspection paradox” says you’re more likely to arrive during a long gap than a short one. Without computing, explain in 2–3 sentences why your expected wait might actually be longer than 5 minutes.

a) Inter-arrival times for a Poisson process with rate \(\lambda = 6\) per hour are Exponential\((6)\):

\[\mathbb{E}[T] = \tfrac{1}{6} \text{ hour} = 10 \text{ minutes}\]

b) The naive answer is yes, 5 minutes — if buses come every 10 minutes on average and you show up randomly, surely you’d wait half a gap on average. But this intuition is wrong.

c) The inspection paradox.

If you arrive at a uniformly random moment, you’re not sampling a uniformly random gap — you’re sampling a time point, and longer gaps cover more of the timeline. A 20-minute gap is twice as likely to contain your arrival as a 10-minute gap. So the gap you land in is size-biased toward long gaps, not a typical gap.

In fact, for a Poisson process, you can show the expected length of the gap you land in is \(2/\lambda = 20\) minutes, and (by memorylessness of the Exponential!) your expected wait from arrival to the next bus is the full mean inter-arrival time — \(\mathbb{E}[\text{wait}] = 1/\lambda = 10\) minutes, not \(5\).

Where this matters.

  • Survey design. If you ask shoppers “how long is your visit?”, you oversample long visits — a person browsing for an hour is six times more likely to be in your sample than someone making a \(10\)-minute trip.
  • Hospital length-of-stay. A “snapshot” sample of currently admitted patients overrepresents long-stayers, so the average stay you observe is longer than the average over all admissions.
  • Friendship paradox. “Your friends have more friends than you do, on average” is the same length-bias applied to social graphs.

The lesson: sampling matters. Sampling at random times (or via random people, in the friendship case) is not the same as sampling random gaps (or random people uniformly over the population). Mismatch this and you get a systematically biased estimate.


Applications and Critical Thinking

14 The prosecutor’s fallacy: Conditional thinking

In a city of 1 million people, a crime is committed. DNA evidence matches the suspect with a 1-in-10,000 error rate (i.e., a random person matches with probability 0.0001).

    1. Model the number of matching individuals in the city as a random variable. What distribution is appropriate? What is its expected value?
    1. The prosecutor argues: “The probability of a false match is 0.0001, so the defendant is 99.99% certain to be guilty.” Is this reasoning correct?
    1. If we assume the guilty person is definitely in the city, use Bayes-like reasoning to argue that the suspect’s probability of guilt depends on the expected number of matches.

a) Each of \(N = 10^6\) people independently matches with probability \(p = 10^{-4}\). Number of matches \(M \sim \text{Binomial}(10^6, 10^{-4})\). Since \(n\) is huge and \(p\) is tiny with \(np = 100\), this is well-approximated by

\[M \approx \text{Poisson}(\lambda = 100), \quad \mathbb{E}[M] = 100\]

So in expectation, about 100 people in the city match the DNA profile, of whom only one is the actual criminal.

b) No — that is the prosecutor’s fallacy.

The prosecutor confuses two very different conditional probabilities:

  • \(P(\text{match} \mid \text{innocent}) = 0.0001\) — the false-match rate, which is given.
  • \(P(\text{innocent} \mid \text{match}) = ?\) — what the jury actually wants to know.

These are not the same. Reversing them is exactly the same logical error as confusing \(P(A \mid B)\) with \(P(B \mid A)\).

c) Bayesian breakdown.

Assume the true criminal is in the city. Before the DNA test, the suspect is just one of \(N = 10^6\) people, so

\[P(\text{guilty}) = \frac{1}{10^6}\]

After observing the match, condition on the event “this particular person matched”:

\[P(\text{guilty} \mid \text{match}) = \frac{P(\text{match} \mid \text{guilty}) \cdot P(\text{guilty})}{P(\text{match})}\]

Assume \(P(\text{match} \mid \text{guilty}) = 1\). The denominator \(P(\text{match})\) for a uniformly random person is essentially \(1/N + p \approx p = 10^{-4}\) (the prior chance they’re guilty plus the false-match rate). So

\[P(\text{guilty} \mid \text{match}) \approx \frac{1 \cdot 10^{-6}}{10^{-4}} = 10^{-2} = 1\%\]

A different way to see the same number: among the \(\approx 100\) matching people, exactly one is guilty (the criminal), so the chance any given match is the criminal is \(1/100 = 1\%\).

The prosecutor’s “\(99.99\%\) certain” is off by a factor of about \(100\). The DNA evidence didn’t identify the criminal; it narrowed the suspect pool from \(10^6\) down to about \(100\). To single out one person, you need additional independent evidence (location, motive, witnesses).

The general lesson. When the base rate is tiny (most people are innocent), even a very accurate test produces mostly false positives in absolute numbers. Posterior probabilities can be wildly different from $1 - $ test-error-rate. This is the same trap doctors fall into when interpreting medical screening results, and the same idea behind the gambler’s fallacy in Problem 05 — confusing what the data evidences with what the data implies. Bayes’ rule is the antidote.

🎲 xx+37 (xx)

Flag Counter