23: Statistics — MLE & MAP Estimation

📚 Նյութը

YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!

Դասախոսություն

Գործնական

🏡 Տնային


1) Method of Moments (Lecture 5)

01 Gamma MoM

A machine produces parts whose lifetimes \(X_1, \ldots, X_n\) are modeled as \(\mathrm{Gamma}(\alpha, \beta)\) with density \[f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0.\] The population mean is \(\mathbb{E}[X] = \alpha/\beta\) and the population variance is \(\operatorname{Var}(X) = \alpha/\beta^2\).

    1. Set the first two population moments equal to their sample counterparts and solve for \(\hat{\alpha}_{\mathrm{MoM}}\) and \(\hat{\beta}_{\mathrm{MoM}}\) in terms of \(\bar{X}\) and \(S^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\).
    1. You observe \(n = 100\) lifetimes with \(\bar{X} = 4.2\) and \(S^2 = 8.82\). Compute \(\hat{\alpha}\) and \(\hat{\beta}\).
    1. Can MoM ever give \(\hat{\alpha} < 0\) or \(\hat{\beta} < 0\)? Under what conditions? What does this tell us about a limitation of MoM?

02 MoM vs MLE for the Uniform

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Uniform}(0, \theta)\) where \(\theta > 0\) is unknown.

    1. Compute the MoM estimator \(\hat{\theta}_{\mathrm{MoM}}\) using the first moment \(\mathbb{E}[X] = \theta/2\).
    1. Compute the MLE \(\hat{\theta}_{\mathrm{MLE}} = X_{(n)} = \max(X_1, \ldots, X_n)\). (Show this by writing the likelihood and arguing about where it is maximized.)
    1. Compute the MSE of both estimators. You may use the facts that \(\mathbb{E}[X_{(n)}] = \frac{n}{n+1}\theta\) and \(\operatorname{Var}(X_{(n)}) = \frac{n\theta^2}{(n+1)^2(n+2)}\).
    1. Which estimator has smaller MSE? Does the answer surprise you, given that MLE is usually “optimal”?

Hint for (d): the Uniform is not in the exponential family, so the standard MLE optimality theorems (Cramér–Rao, asymptotic efficiency) do not apply here.


2) Maximum Likelihood Estimation (Lecture 5)

03 Geometric MLE & Efficiency

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Geometric}(p)\) with PMF \(f(x \mid p) = (1-p)^{x-1}p\) for \(x = 1, 2, \ldots\)

    1. Write the log-likelihood \(\ell(p)\) and derive the MLE \(\hat{p}_{\mathrm{MLE}}\).
    1. Compute the Fisher information \(I(p)\) for one observation.
    1. Using the Cramér–Rao bound, what is the smallest possible variance for any unbiased estimator of \(p\)?
    1. Is \(\hat{p}_{\mathrm{MLE}}\) unbiased? Is it asymptotically efficient?

04 Normal MLE Meets MoM

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma^2)\) where both \(\mu\) and \(\sigma^2\) are unknown.

    1. Derive the MLE for \(\mu\). (You already know the answer — just verify it.)
    1. Derive the MLE for \(\sigma^2\) by differentiating \(\ell(\mu, \sigma^2)\) with respect to \(\sigma^2\) and setting it to zero. Show all intermediate steps.
    1. Show that the MLE \(\hat{\sigma}^2_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\) coincides with the MoM estimator. Why is this not a coincidence? (Hint: for exponential families, MLE and MoM agree when both use the natural sufficient statistics.)
    1. Is \(\hat{\sigma}^2_{\mathrm{MLE}}\) unbiased? If not, what is its bias, and how does it compare to the Bessel-corrected \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\)?

05 MLE and Cross-Entropy

In binary classification, we model \(Y_i \in \{0, 1\}\) as \(Y_i \mid \mathbf{x}_i \sim \mathrm{Bernoulli}\!\big(\sigma(\mathbf{x}_i^\top \mathbf{w})\big)\) where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.

    1. Write the log-likelihood \(\ell(\mathbf{w})\) for observations \((y_1, \mathbf{x}_1), \ldots, (y_n, \mathbf{x}_n)\).
    1. Show that maximizing \(\ell(\mathbf{w})\) is equivalent to minimizing the binary cross-entropy loss: \[\mathcal{L}_{\mathrm{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\Big[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big]\] where \(\hat{p}_i = \sigma(\mathbf{x}_i^\top \mathbf{w})\).
    1. Similarly, recall from the lecture that for \(Y_i \sim N(f(\mathbf{x}_i; \mathbf{w}),\, \sigma^2)\), maximizing the log-likelihood is equivalent to minimizing MSE. Fill in the following table:
Noise model MLE objective Equivalent ML loss
Gaussian max \(\ell(\mathbf{w})\) ?
Bernoulli max \(\ell(\mathbf{w})\) ?

06 Invariance in Action

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(p)\).

    1. The MLE for \(p\) is \(\hat{p} = \bar{X}\). Using the invariance property of MLE, immediately write down the MLE for:
      1. The odds: \(\psi = p / (1 - p)\)
      1. The log-odds: \(\eta = \log\!\big(p/(1-p)\big)\)
      1. The variance of \(X\): \(v = p(1-p)\)
    1. Compute the MoM estimator for the odds \(\psi = p/(1-p)\). Is it the same as the MLE from (a-i)?
    1. What property does MLE have that MoM lacks in this context?

3) MAP Estimation & Bayesian Inference (Lecture 6)

07 Normal MAP: Shrinkage in Action

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma_0^2)\) with known \(\sigma_0^2\), and put a prior \(\mu \sim N(m, \tau^2)\).

    1. Write the log-posterior \(\log P(\mu \mid \text{data}) = \ell(\mu) + \log P(\mu) + \text{const}\).
    1. Take the derivative with respect to \(\mu\), set it to zero, and derive the MAP estimator. Show it equals: \[\hat{\mu}_{\mathrm{MAP}} = w \cdot \bar{X} + (1-w) \cdot m, \qquad w = \frac{n/\sigma_0^2}{n/\sigma_0^2 + 1/\tau^2}.\]
    1. Interpret \(w\) as a weight. What happens to \(\hat{\mu}_{\mathrm{MAP}}\) as:
      1. \(n \to \infty\) (lots of data)?
      1. \(\tau^2 \to \infty\) (very vague prior)?
      1. \(\tau^2 \to 0\) (very strong prior)?
    1. A medical device is calibrated to \(m = 100\)°C (prior mean). You take \(n = 5\) readings with \(\bar{X} = 103\)°C, knowing \(\sigma_0 = 2\)°C. Compute \(\hat{\mu}_{\mathrm{MAP}}\) for \(\tau = 1\)°C and \(\tau = 10\)°C. Interpret the difference.

08 Beta-Binomial MAP: How Much Does the Prior Pull?

A coin is flipped \(n = 20\) times and lands heads \(k = 14\) times.

    1. Compute the MLE \(\hat{p}_{\mathrm{MLE}}\).
    1. Compute the MAP estimate under each of the following priors, using the Beta-Binomial conjugacy formula \(\hat{p}_{\mathrm{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}\):
      1. \(\mathrm{Beta}(1, 1)\) — uniform (flat) prior
      1. \(\mathrm{Beta}(5, 5)\) — weakly informative
      1. \(\mathrm{Beta}(50, 50)\) — strongly informative (centered at 0.5)
    1. Which prior produces the MAP estimate closest to the MLE? Which one “pulls” the most toward 0.5? Explain why in terms of pseudo-observations.
    1. How many real coin flips would you need before the \(\mathrm{Beta}(50, 50)\) prior becomes negligible? (Think about when the pseudo-observations are \(< 10\%\) of the total.)

09 Deriving Ridge Regression from MAP

Consider the linear model \(y_i = \mathbf{x}_i^\top\boldsymbol{\theta} + \varepsilon_i\) with \(\varepsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\) and a Gaussian prior \(\theta_j \overset{\text{i.i.d.}}{\sim} N(0, \tau^2)\).

    1. Write the MAP objective: \(\max_{\boldsymbol{\theta}} \big[\ell(\boldsymbol{\theta}) + \log P(\boldsymbol{\theta})\big]\). Convert to a minimization problem.
    1. Show that the MAP objective simplifies to: \[\hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\min_{\boldsymbol{\theta}} \Big[\|\mathbf{y} - X\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|^2\Big], \qquad \lambda = \frac{\sigma^2}{\tau^2}.\]
    1. Set the gradient to zero and derive the closed-form solution \(\hat{\boldsymbol{\theta}} = (X^\top X + \lambda I)^{-1}X^\top \mathbf{y}\).
    1. Why does Ridge regression always have a unique solution, even when \(X^\top X\) is singular? (Hint: what are the eigenvalues of \(X^\top X + \lambda I\)?)
    1. What happens to \(\hat{\boldsymbol{\theta}}_{\mathrm{MAP}}\) when \(\lambda \to 0\)? When \(\lambda \to \infty\)? Interpret in terms of the prior.

10 Laplace Prior & Sparsity

Now replace the Gaussian prior with a Laplace prior: \(P(\theta_j) \propto \exp\!\big(-|\theta_j|/b\big)\).

    1. Show that the MAP objective becomes: \[\hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\min_{\boldsymbol{\theta}} \Big[\|\mathbf{y} - X\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|_1\Big]\] and express \(\lambda\) in terms of \(\sigma^2\) and \(b\).
    1. Unlike Ridge, the Lasso does not have a closed-form solution in general. However, for the special case of one parameter (\(p = 1\)) with \(X^\top X = I\) (orthonormal design), the solution is the soft-thresholding operator: \[\hat{\theta}_j = \mathrm{sign}(\hat{\theta}_j^{\mathrm{OLS}})\max\!\big(|\hat{\theta}_j^{\mathrm{OLS}}| - \lambda/2,\; 0\big).\] Verify this by sketching the Lasso objective for \(p = 1\) and finding where the derivative is zero (careful: \(|\theta|\) is not differentiable at 0!).
    1. Explain geometrically (using the diamond vs circle picture from the lecture) why the Lasso tends to produce exact zeros while Ridge does not.

4) Simulation & Comparison

11 MLE vs MAP Shootout

Write a Python simulation to compare MLE and MAP for estimating a Bernoulli parameter.

    1. Setup: The true probability is \(p_{\mathrm{true}} = 0.3\). Use a \(\mathrm{Beta}(3, 7)\) prior (centered at the truth). For each sample size \(n \in \{5, 10, 20, 50, 100, 500\}\):
    • Generate 10,000 datasets of size \(n\).
    • Compute \(\hat{p}_{\mathrm{MLE}} = k/n\) and \(\hat{p}_{\mathrm{MAP}} = \frac{k + 3 - 1}{n + 3 + 7 - 2}\) for each.
    • Record the MSE of both estimators.
    1. Plot MSE vs \(n\) for both estimators on the same graph. At what sample size does MLE start to match MAP?
    1. Repeat with a misspecified prior \(\mathrm{Beta}(50, 50)\) (centered at 0.5, far from \(p_{\mathrm{true}} = 0.3\)). What happens to the MAP estimator’s MSE for small \(n\)? For large \(n\)? Is MAP always better than MLE?
    1. What lesson does this teach about the choice of prior?

🎲 38 (01) TODO

Flag Counter