22: Statistics — Estimator Properties, Fisher Info, Cramér-Rao
📚 Նյութը
YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!
Դասախոսություն
Գործնական
🏡 Տնային
1) Exponential Family & Sufficiency
01 Poisson Meets the Exponential Family
Let \(X\) be a random variable following a Poisson distribution with parameter \(\lambda > 0\), i.e., \(P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\) for \(k = 0, 1, 2, \ldots\)
- Show that the Poisson distribution belongs to the exponential family by writing its PMF in the form \[f(x \mid \lambda) = h(x) \exp\!\big(\eta(\lambda)\, T(x) - A(\lambda)\big).\] Identify \(h(x)\), \(\eta(\lambda)\), \(T(x)\), and \(A(\lambda)\).
- Using the exponential family form, what is the sufficient statistic for \(\lambda\) based on an i.i.d. sample \(X_1, \ldots, X_n\)?
a) Rewrite the PMF by exponentiating the log:
\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} = \frac{1}{k!} \exp\!\big(k \log \lambda - \lambda\big).\]
Matching \(h(x) \exp(\eta(\lambda) T(x) - A(\lambda))\):
\[h(x) = \frac{1}{x!}, \qquad \eta(\lambda) = \log \lambda, \qquad T(x) = x, \qquad A(\lambda) = \lambda.\]
So Poisson is a one-parameter exponential family with natural parameter \(\log \lambda\).
b) For an i.i.d. sample, the joint PMF factors as
\[\prod_{i=1}^n \frac{1}{x_i!} \exp\!\Big(\log \lambda \cdot \sum_i x_i - n\lambda\Big).\]
By the Fisher-Neyman factorization theorem, the sufficient statistic is \(\boxed{T(\mathbf{X}) = \sum_{i=1}^n X_i}\).
02 Slit Width Estimation
In an experiment, \(n\) drops of solution are released uniformly through a slit onto a surface. We model the one-dimensional impact points \(X_1, \ldots, X_n\) as i.i.d. \(\mathrm{Uniform}(0, d)\), where the unknown slit width \(d > 0\) is to be estimated.
- Write down the joint density \(f(\mathbf{x} \mid d)\) for the sample.
- Using the Fisher–Neyman factorization theorem, show that \(X_{(n)} = \max\{X_1, \ldots, X_n\}\) is sufficient for \(d\).
- Is \(X_{(n)}\) unbiased for \(d\)? If not, find an unbiased estimator based on \(X_{(n)}\).
Hint for (c): the CDF of \(X_{(n)}\) is \(F_{X_{(n)}}(x) = (x/d)^n\) for \(0 \le x \le d\).
a) Each \(X_i\) has density \(\frac{1}{d} \mathbf{1}\{0 \le x_i \le d\}\), so
\[f(\mathbf{x} \mid d) = d^{-n} \prod_{i=1}^n \mathbf{1}\{0 \le x_i \le d\} = d^{-n} \cdot \mathbf{1}\{x_{(1)} \ge 0\} \cdot \mathbf{1}\{x_{(n)} \le d\}.\]
b) Split the joint density into a \(d\)-dependent piece and a data-only piece:
\[f(\mathbf{x} \mid d) = \underbrace{d^{-n} \mathbf{1}\{x_{(n)} \le d\}}_{g(T(\mathbf{x}), d)} \cdot \underbrace{\mathbf{1}\{x_{(1)} \ge 0\}}_{h(\mathbf{x})}.\]
The \(d\)-dependent factor depends on the data only through \(T(\mathbf{x}) = x_{(n)}\). By the Fisher-Neyman factorization theorem, \(X_{(n)}\) is sufficient for \(d\).
c) From the hint, \(f_{X_{(n)}}(x) = n x^{n-1}/d^n\) for \(0 \le x \le d\). Then
\[\mathbb{E}[X_{(n)}] = \int_0^d x \cdot \frac{n x^{n-1}}{d^n} \, dx = \frac{n}{d^n} \cdot \frac{d^{n+1}}{n+1} = \frac{n}{n+1} d.\]
So \(X_{(n)}\) is biased (it underestimates \(d\)). The unbiased estimator is
\[\boxed{\hat{d} = \frac{n+1}{n} X_{(n)}}.\]
03 Normal Variance: Minimal Sufficiency
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma^2)\) where \(\sigma^2 > 0\) is unknown but \(\mu\) is known.
- Show that \(T(\mathbf{X}) = \sum_{i=1}^{n}(X_i - \mu)^2\) is sufficient for \(\sigma^2\) using the factorization theorem.
- Using the likelihood ratio criterion, show that \(T(\mathbf{X})\) is minimal sufficient for \(\sigma^2\).
Recall: \(T\) is minimal sufficient iff \(T(\mathbf{x}) = T(\mathbf{y})\) \(\Longleftrightarrow\) \(\frac{f(\mathbf{x} \mid \sigma^2)}{f(\mathbf{y} \mid \sigma^2)}\) is free of \(\sigma^2\).
a) The joint density is
\[f(\mathbf{x} \mid \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) = (2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\right).\]
Write this as
\[f(\mathbf{x} \mid \sigma^2) = \underbrace{(2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{T(\mathbf{x})}{2\sigma^2}\right)}_{g(T(\mathbf{x}), \sigma^2)} \cdot \underbrace{1}_{h(\mathbf{x})}\]
with \(T(\mathbf{x}) = \sum_i (x_i - \mu)^2\) (recall \(\mu\) is known, so \(T\) is a function of the data alone). By the Fisher-Neyman factorization theorem, \(T(\mathbf{X})\) is sufficient for \(\sigma^2\).
b) Form the likelihood ratio:
\[\frac{f(\mathbf{x} \mid \sigma^2)}{f(\mathbf{y} \mid \sigma^2)} = \exp\!\left(-\frac{T(\mathbf{x}) - T(\mathbf{y})}{2\sigma^2}\right).\]
This ratio is free of \(\sigma^2\) if and only if \(T(\mathbf{x}) - T(\mathbf{y}) = 0\), i.e., \(T(\mathbf{x}) = T(\mathbf{y})\). By the ratio characterization of minimal sufficiency, \(\boxed{T(\mathbf{X}) = \sum_i (X_i - \mu)^2}\) is minimal sufficient for \(\sigma^2\).
04 Binomial Sufficiency and Estimating \(\pi^2\)
Let \(\mathbf{X} = (X_1, \ldots, X_n)^\top\) with \(X_i \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(\pi)\), where \(\pi \in (0, 1)\). Define \(U(\mathbf{X}) = \sum_{i=1}^{n} X_i\).
- Show that \(U(\mathbf{X})/n\) is unbiased for \(\pi\).
- Show that \(U(\mathbf{X})\) is minimal sufficient for \(\pi\).
- Now consider the estimator for \(\pi^2\): \[V(\mathbf{X}) = \frac{U(\mathbf{X})\left[U(\mathbf{X}) - 1\right]}{n(n-1)}.\] Verify that \(V(\mathbf{X})\) is unbiased for \(\pi^2\).
Hint for (c): expand \(\mathbb{E}[U(U-1)]\) using \(\mathbb{E}[U] = n\pi\) and \(\operatorname{Var}(U) = n\pi(1-\pi)\).
a) Each \(X_i\) has \(\mathbb{E}[X_i] = \pi\), so
\[\mathbb{E}\!\left[\frac{U(\mathbf{X})}{n}\right] = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] = \pi.\]
b) The joint PMF is
\[f(\mathbf{x} \mid \pi) = \prod_{i=1}^n \pi^{x_i}(1-\pi)^{1-x_i} = \pi^{u}(1-\pi)^{n-u}, \qquad u = \sum_i x_i.\]
This depends on the data only through \(u = U(\mathbf{x})\), so by Fisher-Neyman, \(U\) is sufficient. For minimality, the ratio
\[\frac{f(\mathbf{x} \mid \pi)}{f(\mathbf{y} \mid \pi)} = \left(\frac{\pi}{1-\pi}\right)^{U(\mathbf{x}) - U(\mathbf{y})}\]
is free of \(\pi\) iff \(U(\mathbf{x}) = U(\mathbf{y})\). So \(\boxed{U(\mathbf{X})}\) is minimal sufficient for \(\pi\).
c) Using \(\operatorname{Var}(U) = \mathbb{E}[U^2] - \mathbb{E}[U]^2\), we get \(\mathbb{E}[U^2] = n\pi(1-\pi) + (n\pi)^2\). Then
\[\mathbb{E}[U(U-1)] = \mathbb{E}[U^2] - \mathbb{E}[U] = n\pi(1-\pi) + n^2\pi^2 - n\pi = n^2\pi^2 - n\pi^2 = n(n-1)\pi^2.\]
Therefore
\[\mathbb{E}[V(\mathbf{X})] = \frac{n(n-1)\pi^2}{n(n-1)} = \pi^2,\]
so \(V\) is unbiased for \(\pi^2\).
2) Fisher Information & Cram'{e}r–Rao
05 Fisher Information for the Exponential
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Exp}(\lambda)\) with density \(f(x \mid \lambda) = \lambda e^{-\lambda x}\) for \(x \geq 0\).
- Compute the score function \(s(\lambda) = \frac{\partial}{\partial \lambda} \log f(X \mid \lambda)\).
- Verify that \(\mathbb{E}[s(\lambda)] = 0\).
- Compute the Fisher information \(I(\lambda) = \operatorname{Var}[s(\lambda)]\).
- Verify your answer in (c) by computing \(I(\lambda)\) via the second-derivative formula: \(I(\lambda) = -\mathbb{E}\!\left[\frac{\partial^2}{\partial\lambda^2}\log f(X \mid \lambda)\right]\).
- The MLE for \(\lambda\) is \(\hat{\lambda} = 1/\bar{X}\), which has \(\operatorname{Var}(\hat{\lambda}) \approx \lambda^2/n\) for large \(n\). Compare this with the Cram'{e}r–Rao lower bound. Is \(\hat{\lambda}\) asymptotically efficient?
a) \(\log f(x \mid \lambda) = \log \lambda - \lambda x\). Differentiating with respect to \(\lambda\):
\[s(\lambda) = \frac{\partial}{\partial \lambda} \log f(X \mid \lambda) = \frac{1}{\lambda} - X.\]
b) Since \(X \sim \mathrm{Exp}(\lambda)\) has \(\mathbb{E}[X] = 1/\lambda\),
\[\mathbb{E}[s(\lambda)] = \frac{1}{\lambda} - \mathbb{E}[X] = \frac{1}{\lambda} - \frac{1}{\lambda} = 0.\]
c) Because \(\mathbb{E}[s] = 0\), \(I(\lambda) = \operatorname{Var}[s(\lambda)] = \operatorname{Var}(1/\lambda - X) = \operatorname{Var}(X) = 1/\lambda^2\). So
\[\boxed{I(\lambda) = \frac{1}{\lambda^2}}.\]
d) From \(s(\lambda) = 1/\lambda - X\),
\[\frac{\partial^2}{\partial \lambda^2} \log f(X \mid \lambda) = -\frac{1}{\lambda^2}.\]
Taking \(-\mathbb{E}\) of a constant gives \(I(\lambda) = 1/\lambda^2\), matching part (c).
e) The Cramér-Rao lower bound for unbiased estimators of \(\lambda\) based on \(n\) i.i.d. observations is
\[\operatorname{Var}(\hat{\lambda}) \ge \frac{1}{n \cdot I(\lambda)} = \frac{\lambda^2}{n}.\]
The MLE has \(\operatorname{Var}(\hat{\lambda}) \approx \lambda^2/n\), which matches the bound asymptotically. So \(\hat{\lambda} = 1/\bar{X}\) is asymptotically efficient.
06 Cramér–Rao: When Can We Beat \(1/n\)?
Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(p)\).
- We know \(I(p) = \frac{1}{p(1-p)}\). Write down the Cram'{e}r–Rao lower bound for any unbiased estimator of \(p\).
- The sample proportion \(\hat{p} = \bar{X}\) has \(\operatorname{Var}(\hat{p}) = \frac{p(1-p)}{n}\). Is \(\hat{p}\) efficient?
- Now consider estimating \(g(p) = p^2\) instead of \(p\). The Cram'{e}r–Rao bound for unbiased estimators of \(g(\theta)\) is \[\operatorname{Var}(\hat{g}) \geq \frac{[g'(\theta)]^2}{n \cdot I(\theta)}.\] Compute the CR bound for unbiased estimators of \(p^2\).
- We showed in Problem 04(c) that \(V(\mathbf{X}) = \frac{U(U-1)}{n(n-1)}\) is unbiased for \(p^2\). Compute \(\operatorname{Var}(V)\) (at least for large \(n\)). Does it achieve the CR bound?
a) With \(n\) i.i.d. observations,
\[\operatorname{Var}(\hat{p}) \ge \frac{1}{n \cdot I(p)} = \boxed{\frac{p(1-p)}{n}}.\]
b) \(\operatorname{Var}(\hat{p}) = p(1-p)/n\) exactly equals the CR bound, so \(\hat{p}\) is efficient (uniformly in \(p\), not just asymptotically).
c) Here \(g(p) = p^2\), so \(g'(p) = 2p\). The CR bound for unbiased estimators of \(p^2\) is
\[\operatorname{Var}(\hat{g}) \ge \frac{[g'(p)]^2}{n \cdot I(p)} = \frac{4p^2 \cdot p(1-p)}{n} = \boxed{\frac{4p^3(1-p)}{n}}.\]
d) Note \(V = \frac{U(U-1)}{n(n-1)} = \frac{n}{n-1} \hat{p}^2 - \frac{1}{n-1}\hat{p}\). For large \(n\), \(V \approx \hat{p}^2\).
To get \(\operatorname{Var}(\hat{p}^2)\) for large \(n\), use a first-order Taylor expansion of \(\phi(p) = p^2\) around the true \(p\). Since \(\hat{p}\) concentrates near \(p\) (its variance is \(O(1/n)\)), we can write
\[\hat{p}^2 \approx p^2 + 2p \cdot (\hat{p} - p),\]
so the random part of \(\hat{p}^2\) is approximately \(2p \cdot (\hat{p} - p)\). Therefore
\[\operatorname{Var}(\hat{p}^2) \approx (2p)^2 \cdot \operatorname{Var}(\hat{p}) = 4p^2 \cdot \frac{p(1-p)}{n} = \frac{4p^3(1-p)}{n}.\]
This matches the Cramér-Rao bound from (c) asymptotically, so \(V\) is asymptotically efficient for \(p^2\). (For finite \(n\), \(V\) does not exactly achieve the bound.)
3) Admissibility & Minimax
07 Admissibility: A Sketch Exercise
Suppose there exist exactly three estimators \(T_1\), \(T_2\), and \(T_3\) for a parameter \(\theta \in [0, 1]\).
- Sketch an example of the MSE curves \(\mathrm{MSE}(T_i, \theta)\) as functions of \(\theta\) such that \(T_1\) and \(T_2\) are admissible, but \(T_3\) is not admissible. Explain why your sketch works.
- Now sketch (possibly different) risk functions for \(T_1\), \(T_2\), and \(T_3\) such that \(T_1\) is the minimax estimator. Must \(T_1\) have the lowest MSE everywhere?
a) Plot \(\theta\) on the horizontal axis (over \([0, 1]\)) and MSE on the vertical axis. A working sketch:
- \(\mathrm{MSE}(T_1, \theta)\): a curve that is low near \(\theta = 0\) and rises monotonically toward \(\theta = 1\).
- \(\mathrm{MSE}(T_2, \theta)\): a curve that is low near \(\theta = 1\) and rises monotonically toward \(\theta = 0\) (mirror of \(T_1\)). The two curves cross somewhere in \((0, 1)\).
- \(\mathrm{MSE}(T_3, \theta)\): a curve that lies strictly above both \(\mathrm{MSE}(T_1, \theta)\) and \(\mathrm{MSE}(T_2, \theta)\) for every \(\theta \in [0, 1]\).
Why it works. Recall: \(T'\) dominates \(T\) iff \(\mathrm{MSE}(T', \theta) \le \mathrm{MSE}(T, \theta)\) for all \(\theta\) with strict inequality somewhere; \(T\) is admissible iff no \(T'\) dominates it.
- \(T_1\) admissible: \(T_2\) doesn’t dominate it (since \(T_1\) beats \(T_2\) at small \(\theta\)); \(T_3\) is uniformly worse, so certainly doesn’t dominate. No estimator dominates \(T_1\).
- \(T_2\) admissible: symmetric - \(T_1\) beats \(T_2\) on the left so \(T_1\) doesn’t dominate; \(T_3\) uniformly worse so doesn’t dominate.
- \(T_3\) inadmissible: \(T_1\) dominates \(T_3\) by construction (and so does \(T_2\)).
b) Minimax means smallest worst-case MSE: \(T_1\) minimizes \(\sup_{\theta} \mathrm{MSE}(T, \theta)\). A sketch:
- \(\mathrm{MSE}(T_1, \theta)\): roughly flat at a moderate level \(M\) across all \(\theta\).
- \(\mathrm{MSE}(T_2, \theta)\): very low for \(\theta\) near \(0\) but spikes well above \(M\) as \(\theta \to 1\).
- \(\mathrm{MSE}(T_3, \theta)\): very low for \(\theta\) near \(1\) but spikes well above \(M\) as \(\theta \to 0\).
Then \(\sup_\theta \mathrm{MSE}(T_1, \theta) = M\), while both \(\sup_\theta \mathrm{MSE}(T_2, \theta)\) and \(\sup_\theta \mathrm{MSE}(T_3, \theta)\) exceed \(M\). So \(T_1\) is minimax.
No - \(T_1\) does not need the lowest MSE everywhere. In fact, in this sketch \(T_2\) beats \(T_1\) for small \(\theta\) and \(T_3\) beats it for large \(\theta\). Minimaxity is purely about worst-case behavior, not pointwise dominance.
🎲 xx+37 (xx)
- ▶️ToDo
- 🔗Random link ToDo
- 🇦🇲🎶ToDo
- 🌐🎶ToDo
- 🤌Կարգին ToDo