23: Statistics — MLE & MAP Estimation

📚 Նյութը

YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!

Դասախոսություն

Գործնական

🏡 Տնային


1) Method of Moments (Lecture 5)

01 Gamma MoM

A machine produces parts whose lifetimes \(X_1, \ldots, X_n\) are modeled as \(\mathrm{Gamma}(\alpha, \beta)\) with density \[f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0.\] The population mean is \(\mathbb{E}[X] = \alpha/\beta\) and the population variance is \(\operatorname{Var}(X) = \alpha/\beta^2\).

    1. Set the first two population moments equal to their sample counterparts and solve for \(\hat{\alpha}_{\mathrm{MoM}}\) and \(\hat{\beta}_{\mathrm{MoM}}\) in terms of \(\bar{X}\) and \(S^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\).
    1. You observe \(n = 100\) lifetimes with \(\bar{X} = 4.2\) and \(S^2 = 8.82\). Compute \(\hat{\alpha}\) and \(\hat{\beta}\).
    1. Can MoM ever give \(\hat{\alpha} < 0\) or \(\hat{\beta} < 0\)? Under what conditions? What does this tell us about a limitation of MoM?

02 MoM vs MLE for the Uniform

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Uniform}(0, \theta)\) where \(\theta > 0\) is unknown.

    1. Compute the MoM estimator \(\hat{\theta}_{\mathrm{MoM}}\) using the first moment \(\mathbb{E}[X] = \theta/2\).
    1. Compute the MLE \(\hat{\theta}_{\mathrm{MLE}} = X_{(n)} = \max(X_1, \ldots, X_n)\). (Show this by writing the likelihood and arguing about where it is maximized.)
    1. Compute the MSE of both estimators. You may use the facts that \(\mathbb{E}[X_{(n)}] = \frac{n}{n+1}\theta\) and \(\operatorname{Var}(X_{(n)}) = \frac{n\theta^2}{(n+1)^2(n+2)}\).
    1. Which estimator has smaller MSE? Does the answer surprise you, given that MLE is usually “optimal”?

Hint for (d): the Uniform is not in the exponential family, so the standard MLE optimality theorems (Cramér–Rao, asymptotic efficiency) do not apply here.


2) Maximum Likelihood Estimation (Lecture 5)

03 Geometric MLE & Efficiency

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Geometric}(p)\) with PMF \(f(x \mid p) = (1-p)^{x-1}p\) for \(x = 1, 2, \ldots\)

    1. Write the log-likelihood \(\ell(p)\) and derive the MLE \(\hat{p}_{\mathrm{MLE}}\).
    1. Compute the Fisher information \(I(p)\) for one observation.
    1. Using the Cramér–Rao bound, what is the smallest possible variance for any unbiased estimator of \(p\)?
    1. Is \(\hat{p}_{\mathrm{MLE}}\) unbiased? Is it asymptotically efficient?

04 Normal MLE Meets MoM

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma^2)\) where both \(\mu\) and \(\sigma^2\) are unknown.

    1. Derive the MLE for \(\mu\). (You already know the answer — just verify it.)
    1. Derive the MLE for \(\sigma^2\) by differentiating \(\ell(\mu, \sigma^2)\) with respect to \(\sigma^2\) and setting it to zero. Show all intermediate steps.
    1. Show that the MLE \(\hat{\sigma}^2_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\) coincides with the MoM estimator. Why is this not a coincidence? (Hint: for exponential families, MLE and MoM agree when both use the natural sufficient statistics.)
    1. Is \(\hat{\sigma}^2_{\mathrm{MLE}}\) unbiased? If not, what is its bias, and how does it compare to the Bessel-corrected \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\)?

05 MLE and Cross-Entropy

In binary classification, we model \(Y_i \in \{0, 1\}\) as \(Y_i \mid \mathbf{x}_i \sim \mathrm{Bernoulli}\!\big(\sigma(\mathbf{x}_i^\top \mathbf{w})\big)\) where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.

    1. Write the log-likelihood \(\ell(\mathbf{w})\) for observations \((y_1, \mathbf{x}_1), \ldots, (y_n, \mathbf{x}_n)\).
    1. Show that maximizing \(\ell(\mathbf{w})\) is equivalent to minimizing the binary cross-entropy loss: \[\mathcal{L}_{\mathrm{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\Big[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big]\] where \(\hat{p}_i = \sigma(\mathbf{x}_i^\top \mathbf{w})\).
    1. Similarly, recall from the lecture that for \(Y_i \sim N(f(\mathbf{x}_i; \mathbf{w}),\, \sigma^2)\), maximizing the log-likelihood is equivalent to minimizing MSE. Fill in the following table:
Noise model MLE objective Equivalent ML loss
Gaussian max \(\ell(\mathbf{w})\) ?
Bernoulli max \(\ell(\mathbf{w})\) ?

06 Invariance in Action

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(p)\).

    1. The MLE for \(p\) is \(\hat{p} = \bar{X}\). Using the invariance property of MLE, immediately write down the MLE for:
      1. The odds: \(\psi = p / (1 - p)\)
      1. The log-odds: \(\eta = \log\!\big(p/(1-p)\big)\)
      1. The variance of \(X\): \(v = p(1-p)\)
    1. Compute the MoM estimator for the odds \(\psi = p/(1-p)\). Is it the same as the MLE from (a-i)?
    1. What property does MLE have that MoM lacks in this context?

3) MAP Estimation & Bayesian Inference (Lecture 6)

Setup

A thermometer measures temperature with known measurement noise. Each reading is:

\[X_i \mid \mu \sim N(\mu,\; \sigma^2), \qquad \sigma = 2\;{}^\circ\text{C} \;\text{(known)}\]

You take \(n = 5\) independent readings:

\[x_1 = 21.3, \quad x_2 = 19.8, \quad x_3 = 22.1, \quad x_4 = 20.5, \quad x_5 = 23.0\]

The manufacturer says these sensors are calibrated around \(20\;{}^\circ\text{C}\), so you use a Gaussian prior:

\[\mu \sim N(m,\; \tau^2) = N(20,\; 3^2)\]

Goal: Estimate the true temperature \(\mu\) using (a) MLE and (b) MAP.


Part (a): Maximum Likelihood Estimate

Step 1: Write the likelihood.

Since the observations are independent and \(\sigma^2\) is known, the likelihood depends only on \(\mu\):

\[L(\mu) = \prod_{i=1}^n f(x_i \mid \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\,\sigma} \exp\!\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\]

Step 2: Log-likelihood.

\[\ell(\mu) = \ln L(\mu) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2\]

Only the second term depends on \(\mu\). Maximizing \(\ell(\mu)\) is equivalent to minimizing \(\sum(x_i - \mu)^2\).

Step 3: Differentiate and solve.

\[\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = \frac{1}{\sigma^2}\left(\sum x_i - n\mu\right) \stackrel{!}{=} 0\]

\[\boxed{\hat\mu_{\text{MLE}} = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i}\]

Step 4: Compute.

\[\hat\mu_{\text{MLE}} = \frac{21.3 + 19.8 + 22.1 + 20.5 + 23.0}{5} = \frac{106.7}{5} = 21.34\;{}^\circ\text{C}\]

Note: The MLE uses only the data. It completely ignores the manufacturer’s calibration information that the sensor should read around \(20\;{}^\circ\text{C}\).


Part (b): MAP Estimate

Step 1: Bayes’ theorem - what it says and why.

Bayes’ theorem tells us how to update our belief about \(\mu\) after seeing data. Start from the definition of conditional probability:

\[p(\mu \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid \mu)\; p(\mu)}{p(\mathbf{x})}\]

Let’s name each piece:

Symbol Name Meaning
\(p(\mu)\) Prior what we believe about \(\mu\) before seeing data
\(p(\mathbf{x} \mid \mu)\) Likelihood probability of the data given \(\mu\)
\(p(\mathbf{x})\) Evidence (marginal likelihood) \(\int_{-\infty}^{\infty} p(\mathbf{x} \mid \mu)\,p(\mu)\,d\mu\)
\(p(\mu \mid \mathbf{x})\) Posterior what we believe about \(\mu\) after seeing data

The evidence \(p(\mathbf{x})\) is a constant (it doesn’t depend on \(\mu\) - it’s already integrated out). So for the purpose of finding the MAP, we can ignore it:

\[p(\mu \mid \mathbf{x}) \propto \underbrace{p(\mathbf{x} \mid \mu)}_{\text{likelihood}} \cdot \underbrace{p(\mu)}_{\text{prior}}\]

In words: posterior \(\propto\) likelihood \(\times\) prior. The MAP estimate is the value of \(\mu\) that maximizes this product.

Step 2: Write out each piece explicitly.

Likelihood (from Part a):

\[p(\mathbf{x} \mid \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\,\sigma}\exp\!\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\]

Prior:

\[p(\mu) = \frac{1}{\sqrt{2\pi}\,\tau}\exp\!\left(-\frac{(\mu - m)^2}{2\tau^2}\right)\]

Log-posterior (taking \(\ln\) of the product, dropping all terms that don’t involve \(\mu\)):

\[\ln p(\mu \mid \mathbf{x}) = \ln p(\mathbf{x} \mid \mu) + \ln p(\mu) + \text{const}\]

\[= \underbrace{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2}_{\text{log-likelihood (data term)}} \;\;\underbrace{- \frac{(\mu - m)^2}{2\tau^2}}_{\text{log-prior (regularizer)}} + \text{const}\]

Notice the structure: the log-posterior is a sum of two quadratic penalties on \(\mu\):

  • The likelihood pulls \(\mu\) toward the data (toward \(\bar{x}\)).
  • The prior pulls \(\mu\) toward the prior mean \(m\).
  • The MAP estimate will balance these two forces.

Step 3: Differentiate and solve.

Take the derivative with respect to \(\mu\) and set it to zero:

\[\frac{d}{d\mu}\ln p(\mu \mid \mathbf{x}) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) - \frac{1}{\tau^2}(\mu - m) \stackrel{!}{=} 0\]

Expand the first term using \(\sum(x_i - \mu) = n\bar{x} - n\mu\):

\[\frac{n\bar{x} - n\mu}{\sigma^2} - \frac{\mu - m}{\tau^2} = 0\]

Distribute:

\[\frac{n\bar{x}}{\sigma^2} - \frac{n\mu}{\sigma^2} - \frac{\mu}{\tau^2} + \frac{m}{\tau^2} = 0\]

Move all \(\mu\) terms to the left, everything else to the right:

\[\mu\cdot\frac{n}{\sigma^2} + \mu\cdot\frac{1}{\tau^2} = \frac{n\bar{x}}{\sigma^2} + \frac{m}{\tau^2}\]

Factor out \(\mu\) and divide:

\[\mu\!\left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right) = \frac{n\bar{x}}{\sigma^2} + \frac{m}{\tau^2}\]

\[\boxed{\hat\mu_{\text{MAP}} = \frac{\dfrac{n}{\sigma^2}\,\bar{x} \;+\; \dfrac{1}{\tau^2}\,m}{\dfrac{n}{\sigma^2} + \dfrac{1}{\tau^2}}}\]

Step 4: Interpret - it’s a weighted average.

Rewrite the formula as:

\[\hat\mu_{\text{MAP}} = w_{\text{data}}\,\bar{x} + w_{\text{prior}}\,m, \qquad w_{\text{data}} + w_{\text{prior}} = 1\]

where:

\[w_{\text{data}} = \frac{n/\sigma^2}{n/\sigma^2 + 1/\tau^2}, \qquad w_{\text{prior}} = \frac{1/\tau^2}{n/\sigma^2 + 1/\tau^2}\]

What determines the weights? The quantities \(n/\sigma^2\) and \(1/\tau^2\) are precisions (inverse variances):

  • Data precision = \(n/\sigma^2\): how precisely the data pins down \(\mu\). Grows with \(n\) (more readings) and shrinks with \(\sigma^2\) (noisier sensor).
  • Prior precision = \(1/\tau^2\): how confident we are in the prior belief. A tight prior (\(\tau\) small) means high precision; a vague prior (\(\tau\) large) means low precision.

The rule is simple: whichever source of information is more precise gets more weight in the final estimate.

Step 5: Compute.

Plug in \(n = 5\), \(\sigma^2 = 4\), \(\bar{x} = 21.34\), \(m = 20\), \(\tau^2 = 9\):

  • Data precision: \(\dfrac{n}{\sigma^2} = \dfrac{5}{4} = 1.25\)
  • Prior precision: \(\dfrac{1}{\tau^2} = \dfrac{1}{9} \approx 0.111\)
  • Total precision: \(1.25 + 0.111 = 1.361\)
  • Weights: \(w_{\text{data}} = \dfrac{1.25}{1.361} = 0.918\), \(w_{\text{prior}} = \dfrac{0.111}{1.361} = 0.082\)

\[\boxed{\hat\mu_{\text{MAP}} = 0.918 \times 21.34 + 0.082 \times 20 = 19.59 + 1.64 = 21.23\;{}^\circ\text{C}}\]


Part (c): How much does the prior pull?

The MLE gives \(21.34\) and the MAP gives \(21.23\), so the prior pulls the estimate by:

\[21.34 - 21.23 = 0.11\;{}^\circ\text{C} \quad\text{toward the prior mean of } 20\]

This is a small shift. Why? Look at the weights: the data gets 91.8% of the weight, the prior only 8.2%.

Intuition: The data precision (\(n/\sigma^2 = 1.25\)) is much larger than the prior precision (\(1/\tau^2 = 0.111\)). In other words, the prior is quite vague (\(\tau = 3\) is wide) while 5 measurements with \(\sigma = 2\) already pin down \(\mu\) reasonably well. The data “speaks louder” than the prior.

Another way to see it: the prior says “\(\mu\) is somewhere in \((20 \pm 6)\) with 95% probability” (\(\tau = 3\), so the 95% interval is \(m \pm 2\tau\)). The data, with \(\text{SE} = \sigma/\sqrt{n} = 2/\sqrt{5} \approx 0.89\), says “\(\mu\) is in \((21.34 \pm 1.75)\).” The data’s interval is much narrower, so it dominates.

Mean Precision (= \(1/\text{Var}\))
Prior 20.00 \(1/9 = 0.111\)
Data (likelihood) 21.34 \(5/4 = 1.250\)
MAP (combined) 21.23 1.361

Key insight: The MAP precision is always greater than either the data or prior precision alone. Combining information always makes you more certain, never less.


Part (d): What happens with \(n = 100\) readings?

Now suppose we take \(n = 100\) readings with the same sample mean \(\bar{x} = 21.34\).

  • Data precision: \(\dfrac{n}{\sigma^2} = \dfrac{100}{4} = 25\)
  • Prior precision: \(\dfrac{1}{\tau^2} = \dfrac{1}{9} = 0.111\)
  • New weights: \(w_{\text{data}} = \dfrac{25}{25.111} = 0.9956\), \(w_{\text{prior}} = \dfrac{0.111}{25.111} = 0.0044\)

Or directly:

\[\hat\mu_{\text{MAP}} = \frac{25 \times 21.34 + 0.111 \times 20}{25.111} = \frac{533.5 + 2.22}{25.111} = \frac{535.72}{25.111} = 21.33\;{}^\circ\text{C}\]

Compare:

MLE MAP (\(n=5\)) MAP (\(n=100\))
Estimate 21.34 21.23 21.33
\(w_{\text{data}}\) 100% 91.8% 99.6%
\(w_{\text{prior}}\) 0% 8.2% 0.4%

With 100 readings, the MAP is virtually identical to the MLE. The prior’s influence has shrunk from 8.2% to 0.4%.

The lesson:

  • With little data, the prior matters - it stabilizes the estimate when the data alone is noisy.
  • With lots of data, the prior washes out - MAP \(\to\) MLE. The data overwhelms any prior belief.
  • Mathematically: as \(n \to \infty\), \(w_{\text{data}} = \dfrac{n/\sigma^2}{n/\sigma^2 + 1/\tau^2} \to 1\), so \(\hat\mu_{\text{MAP}} \to \bar{x} = \hat\mu_{\text{MLE}}\).
  • This is a fundamental property of Bayesian estimation: the posterior is asymptotically dominated by the likelihood. No matter what prior you start with (as long as it’s nonzero at the true value), enough data will override it.

07 Normal MAP: Shrinkage in Action

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma_0^2)\) with known \(\sigma_0^2\), and put a prior \(\mu \sim N(m, \tau^2)\).

    1. Write the log-posterior \(\log P(\mu \mid \text{data}) = \ell(\mu) + \log P(\mu) + \text{const}\).
    1. Take the derivative with respect to \(\mu\), set it to zero, and derive the MAP estimator. Show it equals: \[\hat{\mu}_{\mathrm{MAP}} = w \cdot \bar{X} + (1-w) \cdot m, \qquad w = \frac{n/\sigma_0^2}{n/\sigma_0^2 + 1/\tau^2}.\]
    1. Interpret \(w\) as a weight. What happens to \(\hat{\mu}_{\mathrm{MAP}}\) as:
      1. \(n \to \infty\) (lots of data)?
      1. \(\tau^2 \to \infty\) (very vague prior)?
      1. \(\tau^2 \to 0\) (very strong prior)?
    1. A medical device is calibrated to \(m = 100\)°C (prior mean). You take \(n = 5\) readings with \(\bar{X} = 103\)°C, knowing \(\sigma_0 = 2\)°C. Compute \(\hat{\mu}_{\mathrm{MAP}}\) for \(\tau = 1\)°C and \(\tau = 10\)°C. Interpret the difference.

08 Beta-Binomial MAP: How Much Does the Prior Pull?

A coin is flipped \(n = 20\) times and lands heads \(k = 14\) times.

    1. Compute the MLE \(\hat{p}_{\mathrm{MLE}}\).
    1. Compute the MAP estimate under each of the following priors, using the Beta-Binomial conjugacy formula \(\hat{p}_{\mathrm{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}\):
      1. \(\mathrm{Beta}(1, 1)\) — uniform (flat) prior
      1. \(\mathrm{Beta}(5, 5)\) — weakly informative
      1. \(\mathrm{Beta}(50, 50)\) — strongly informative (centered at 0.5)
    1. Which prior produces the MAP estimate closest to the MLE? Which one “pulls” the most toward 0.5? Explain why in terms of pseudo-observations.
    1. How many real coin flips would you need before the \(\mathrm{Beta}(50, 50)\) prior becomes negligible? (Think about when the pseudo-observations are \(< 10\%\) of the total.)

09 Deriving Ridge Regression from MAP

Consider the linear model \(y_i = \mathbf{x}_i^\top\boldsymbol{\theta} + \varepsilon_i\) with \(\varepsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\) and a Gaussian prior \(\theta_j \overset{\text{i.i.d.}}{\sim} N(0, \tau^2)\).

    1. Write the MAP objective: \(\max_{\boldsymbol{\theta}} \big[\ell(\boldsymbol{\theta}) + \log P(\boldsymbol{\theta})\big]\). Convert to a minimization problem.
    1. Show that the MAP objective simplifies to: \[\hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\min_{\boldsymbol{\theta}} \Big[\|\mathbf{y} - X\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|^2\Big], \qquad \lambda = \frac{\sigma^2}{\tau^2}.\]
    1. Set the gradient to zero and derive the closed-form solution \(\hat{\boldsymbol{\theta}} = (X^\top X + \lambda I)^{-1}X^\top \mathbf{y}\).
    1. Why does Ridge regression always have a unique solution, even when \(X^\top X\) is singular? (Hint: what are the eigenvalues of \(X^\top X + \lambda I\)?)
    1. What happens to \(\hat{\boldsymbol{\theta}}_{\mathrm{MAP}}\) when \(\lambda \to 0\)? When \(\lambda \to \infty\)? Interpret in terms of the prior.

10 Laplace Prior & Sparsity

Now replace the Gaussian prior with a Laplace prior: \(P(\theta_j) \propto \exp\!\big(-|\theta_j|/b\big)\).

    1. Show that the MAP objective becomes: \[\hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\min_{\boldsymbol{\theta}} \Big[\|\mathbf{y} - X\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|_1\Big]\] and express \(\lambda\) in terms of \(\sigma^2\) and \(b\).
    1. Unlike Ridge, the Lasso does not have a closed-form solution in general. However, for the special case of one parameter (\(p = 1\)) with \(X^\top X = I\) (orthonormal design), the solution is the soft-thresholding operator: \[\hat{\theta}_j = \mathrm{sign}(\hat{\theta}_j^{\mathrm{OLS}})\max\!\big(|\hat{\theta}_j^{\mathrm{OLS}}| - \lambda/2,\; 0\big).\] Verify this by sketching the Lasso objective for \(p = 1\) and finding where the derivative is zero (careful: \(|\theta|\) is not differentiable at 0!).
    1. Explain geometrically (using the diamond vs circle picture from the lecture) why the Lasso tends to produce exact zeros while Ridge does not.

4) Simulation & Comparison

11 MLE vs MAP Shootout

Write a Python simulation to compare MLE and MAP for estimating a Bernoulli parameter.

    1. Setup: The true probability is \(p_{\mathrm{true}} = 0.3\). Use a \(\mathrm{Beta}(3, 7)\) prior (centered at the truth). For each sample size \(n \in \{5, 10, 20, 50, 100, 500\}\):
    • Generate 10,000 datasets of size \(n\).
    • Compute \(\hat{p}_{\mathrm{MLE}} = k/n\) and \(\hat{p}_{\mathrm{MAP}} = \frac{k + 3 - 1}{n + 3 + 7 - 2}\) for each.
    • Record the MSE of both estimators.
    1. Plot MSE vs \(n\) for both estimators on the same graph. At what sample size does MLE start to match MAP?
    1. Repeat with a misspecified prior \(\mathrm{Beta}(50, 50)\) (centered at 0.5, far from \(p_{\mathrm{true}} = 0.3\)). What happens to the MAP estimator’s MSE for small \(n\)? For large \(n\)? Is MAP always better than MLE?
    1. What lesson does this teach about the choice of prior?

🎲 38 (01) TODO

Flag Counter