23: Statistics — MLE & MAP Estimation
📚 Նյութը
YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!
Դասախոսություն
Գործնական
🏡 Տնային
1) Method of Moments (Lecture 5)
01 Gamma MoM
A machine produces parts whose lifetimes \(X_1, \ldots, X_n\) are modeled as \(\mathrm{Gamma}(\alpha, \beta)\) with density \[f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0.\] The population mean is \(\mathbb{E}[X] = \alpha/\beta\) and the population variance is \(\operatorname{Var}(X) = \alpha/\beta^2\).
- Set the first two population moments equal to their sample counterparts and solve for \(\hat{\alpha}_{\mathrm{MoM}}\) and \(\hat{\beta}_{\mathrm{MoM}}\) in terms of \(\bar{X}\) and \(S^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\).
- You observe \(n = 100\) lifetimes with \(\bar{X} = 4.2\) and \(S^2 = 8.82\). Compute \(\hat{\alpha}\) and \(\hat{\beta}\).
- Can MoM ever give \(\hat{\alpha} < 0\) or \(\hat{\beta} < 0\)? Under what conditions? What does this tell us about a limitation of MoM?
02 MoM vs MLE for the Uniform
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Uniform}(0, \theta)\) where \(\theta > 0\) is unknown.
- Compute the MoM estimator \(\hat{\theta}_{\mathrm{MoM}}\) using the first moment \(\mathbb{E}[X] = \theta/2\).
- Compute the MLE \(\hat{\theta}_{\mathrm{MLE}} = X_{(n)} = \max(X_1, \ldots, X_n)\). (Show this by writing the likelihood and arguing about where it is maximized.)
- Compute the MSE of both estimators. You may use the facts that \(\mathbb{E}[X_{(n)}] = \frac{n}{n+1}\theta\) and \(\operatorname{Var}(X_{(n)}) = \frac{n\theta^2}{(n+1)^2(n+2)}\).
- Which estimator has smaller MSE? Does the answer surprise you, given that MLE is usually “optimal”?
Hint for (d): the Uniform is not in the exponential family, so the standard MLE optimality theorems (Cramér–Rao, asymptotic efficiency) do not apply here.
2) Maximum Likelihood Estimation (Lecture 5)
03 Geometric MLE & Efficiency
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Geometric}(p)\) with PMF \(f(x \mid p) = (1-p)^{x-1}p\) for \(x = 1, 2, \ldots\)
- Write the log-likelihood \(\ell(p)\) and derive the MLE \(\hat{p}_{\mathrm{MLE}}\).
- Compute the Fisher information \(I(p)\) for one observation.
- Using the Cramér–Rao bound, what is the smallest possible variance for any unbiased estimator of \(p\)?
- Is \(\hat{p}_{\mathrm{MLE}}\) unbiased? Is it asymptotically efficient?
04 Normal MLE Meets MoM
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma^2)\) where both \(\mu\) and \(\sigma^2\) are unknown.
- Derive the MLE for \(\mu\). (You already know the answer — just verify it.)
- Derive the MLE for \(\sigma^2\) by differentiating \(\ell(\mu, \sigma^2)\) with respect to \(\sigma^2\) and setting it to zero. Show all intermediate steps.
- Show that the MLE \(\hat{\sigma}^2_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\) coincides with the MoM estimator. Why is this not a coincidence? (Hint: for exponential families, MLE and MoM agree when both use the natural sufficient statistics.)
- Is \(\hat{\sigma}^2_{\mathrm{MLE}}\) unbiased? If not, what is its bias, and how does it compare to the Bessel-corrected \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\)?
05 MLE and Cross-Entropy
In binary classification, we model \(Y_i \in \{0, 1\}\) as \(Y_i \mid \mathbf{x}_i \sim \mathrm{Bernoulli}\!\big(\sigma(\mathbf{x}_i^\top \mathbf{w})\big)\) where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.
- Write the log-likelihood \(\ell(\mathbf{w})\) for observations \((y_1, \mathbf{x}_1), \ldots, (y_n, \mathbf{x}_n)\).
- Show that maximizing \(\ell(\mathbf{w})\) is equivalent to minimizing the binary cross-entropy loss: \[\mathcal{L}_{\mathrm{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\Big[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big]\] where \(\hat{p}_i = \sigma(\mathbf{x}_i^\top \mathbf{w})\).
- Similarly, recall from the lecture that for \(Y_i \sim N(f(\mathbf{x}_i; \mathbf{w}),\, \sigma^2)\), maximizing the log-likelihood is equivalent to minimizing MSE. Fill in the following table:
| Noise model | MLE objective | Equivalent ML loss |
|---|---|---|
| Gaussian | max \(\ell(\mathbf{w})\) | ? |
| Bernoulli | max \(\ell(\mathbf{w})\) | ? |
06 Invariance in Action
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(p)\).
- The MLE for \(p\) is \(\hat{p} = \bar{X}\). Using the invariance property of MLE, immediately write down the MLE for:
- The odds: \(\psi = p / (1 - p)\)
- The log-odds: \(\eta = \log\!\big(p/(1-p)\big)\)
- The variance of \(X\): \(v = p(1-p)\)
- Compute the MoM estimator for the odds \(\psi = p/(1-p)\). Is it the same as the MLE from (a-i)?
- What property does MLE have that MoM lacks in this context?
3) MAP Estimation & Bayesian Inference (Lecture 6)
07 Normal MAP: Shrinkage in Action
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma_0^2)\) with known \(\sigma_0^2\), and put a prior \(\mu \sim N(m, \tau^2)\).
- Write the log-posterior \(\log P(\mu \mid \text{data}) = \ell(\mu) + \log P(\mu) + \text{const}\).
- Take the derivative with respect to \(\mu\), set it to zero, and derive the MAP estimator. Show it equals: \[\hat{\mu}_{\mathrm{MAP}} = w \cdot \bar{X} + (1-w) \cdot m, \qquad w = \frac{n/\sigma_0^2}{n/\sigma_0^2 + 1/\tau^2}.\]
- Interpret \(w\) as a weight. What happens to \(\hat{\mu}_{\mathrm{MAP}}\) as:
- \(n \to \infty\) (lots of data)?
- \(\tau^2 \to \infty\) (very vague prior)?
- \(\tau^2 \to 0\) (very strong prior)?
- A medical device is calibrated to \(m = 100\)°C (prior mean). You take \(n = 5\) readings with \(\bar{X} = 103\)°C, knowing \(\sigma_0 = 2\)°C. Compute \(\hat{\mu}_{\mathrm{MAP}}\) for \(\tau = 1\)°C and \(\tau = 10\)°C. Interpret the difference.
08 Beta-Binomial MAP: How Much Does the Prior Pull?
A coin is flipped \(n = 20\) times and lands heads \(k = 14\) times.
- Compute the MLE \(\hat{p}_{\mathrm{MLE}}\).
- Compute the MAP estimate under each of the following priors, using the Beta-Binomial conjugacy formula \(\hat{p}_{\mathrm{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}\):
- \(\mathrm{Beta}(1, 1)\) — uniform (flat) prior
- \(\mathrm{Beta}(5, 5)\) — weakly informative
- \(\mathrm{Beta}(50, 50)\) — strongly informative (centered at 0.5)
- Which prior produces the MAP estimate closest to the MLE? Which one “pulls” the most toward 0.5? Explain why in terms of pseudo-observations.
- How many real coin flips would you need before the \(\mathrm{Beta}(50, 50)\) prior becomes negligible? (Think about when the pseudo-observations are \(< 10\%\) of the total.)
09 Deriving Ridge Regression from MAP
Consider the linear model \(y_i = \mathbf{x}_i^\top\boldsymbol{\theta} + \varepsilon_i\) with \(\varepsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\) and a Gaussian prior \(\theta_j \overset{\text{i.i.d.}}{\sim} N(0, \tau^2)\).
- Write the MAP objective: \(\max_{\boldsymbol{\theta}} \big[\ell(\boldsymbol{\theta}) + \log P(\boldsymbol{\theta})\big]\). Convert to a minimization problem.
- Show that the MAP objective simplifies to: \[\hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\min_{\boldsymbol{\theta}} \Big[\|\mathbf{y} - X\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|^2\Big], \qquad \lambda = \frac{\sigma^2}{\tau^2}.\]
- Set the gradient to zero and derive the closed-form solution \(\hat{\boldsymbol{\theta}} = (X^\top X + \lambda I)^{-1}X^\top \mathbf{y}\).
- Why does Ridge regression always have a unique solution, even when \(X^\top X\) is singular? (Hint: what are the eigenvalues of \(X^\top X + \lambda I\)?)
- What happens to \(\hat{\boldsymbol{\theta}}_{\mathrm{MAP}}\) when \(\lambda \to 0\)? When \(\lambda \to \infty\)? Interpret in terms of the prior.
10 Laplace Prior & Sparsity
Now replace the Gaussian prior with a Laplace prior: \(P(\theta_j) \propto \exp\!\big(-|\theta_j|/b\big)\).
- Show that the MAP objective becomes: \[\hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\min_{\boldsymbol{\theta}} \Big[\|\mathbf{y} - X\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|_1\Big]\] and express \(\lambda\) in terms of \(\sigma^2\) and \(b\).
- Unlike Ridge, the Lasso does not have a closed-form solution in general. However, for the special case of one parameter (\(p = 1\)) with \(X^\top X = I\) (orthonormal design), the solution is the soft-thresholding operator: \[\hat{\theta}_j = \mathrm{sign}(\hat{\theta}_j^{\mathrm{OLS}})\max\!\big(|\hat{\theta}_j^{\mathrm{OLS}}| - \lambda/2,\; 0\big).\] Verify this by sketching the Lasso objective for \(p = 1\) and finding where the derivative is zero (careful: \(|\theta|\) is not differentiable at 0!).
- Explain geometrically (using the diamond vs circle picture from the lecture) why the Lasso tends to produce exact zeros while Ridge does not.
4) Simulation & Comparison
11 MLE vs MAP Shootout
Write a Python simulation to compare MLE and MAP for estimating a Bernoulli parameter.
- Setup: The true probability is \(p_{\mathrm{true}} = 0.3\). Use a \(\mathrm{Beta}(3, 7)\) prior (centered at the truth). For each sample size \(n \in \{5, 10, 20, 50, 100, 500\}\):
- Generate 10,000 datasets of size \(n\).
- Compute \(\hat{p}_{\mathrm{MLE}} = k/n\) and \(\hat{p}_{\mathrm{MAP}} = \frac{k + 3 - 1}{n + 3 + 7 - 2}\) for each.
- Record the MSE of both estimators.
- Plot MSE vs \(n\) for both estimators on the same graph. At what sample size does MLE start to match MAP?
- Repeat with a misspecified prior \(\mathrm{Beta}(50, 50)\) (centered at 0.5, far from \(p_{\mathrm{true}} = 0.3\)). What happens to the MAP estimator’s MSE for small \(n\)? For large \(n\)? Is MAP always better than MLE?
- What lesson does this teach about the choice of prior?
🎲 38 (01) TODO
- ▶️ToDo
- 🔗Random link
- 🇦🇲🎶ToDo
- 🌐🎶ToDo
- 🤌Կարգին