22: Statistics — Estimator Properties, Fisher Info, Cramér-Rao

📚 Նյութը

YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!

Դասախոսություն

Գործնական

🏡 Տնային


1) Exponential Family & Sufficiency

01 Poisson Meets the Exponential Family

Let \(X\) be a random variable following a Poisson distribution with parameter \(\lambda > 0\), i.e., \(P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\) for \(k = 0, 1, 2, \ldots\)

    1. Show that the Poisson distribution belongs to the exponential family by writing its PMF in the form \[f(x \mid \lambda) = h(x) \exp\!\big(\eta(\lambda)\, T(x) - A(\lambda)\big).\] Identify \(h(x)\), \(\eta(\lambda)\), \(T(x)\), and \(A(\lambda)\).
    1. Using the exponential family form, what is the sufficient statistic for \(\lambda\) based on an i.i.d. sample \(X_1, \ldots, X_n\)?

a) Rewrite the PMF by exponentiating the log:

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} = \frac{1}{k!} \exp\!\big(k \log \lambda - \lambda\big).\]

Matching \(h(x) \exp(\eta(\lambda) T(x) - A(\lambda))\):

\[h(x) = \frac{1}{x!}, \qquad \eta(\lambda) = \log \lambda, \qquad T(x) = x, \qquad A(\lambda) = \lambda.\]

So Poisson is a one-parameter exponential family with natural parameter \(\log \lambda\).

b) For an i.i.d. sample, the joint PMF factors as

\[\prod_{i=1}^n \frac{1}{x_i!} \exp\!\Big(\log \lambda \cdot \sum_i x_i - n\lambda\Big).\]

By the Fisher-Neyman factorization theorem, the sufficient statistic is \(\boxed{T(\mathbf{X}) = \sum_{i=1}^n X_i}\).

02 Slit Width Estimation

In an experiment, \(n\) drops of solution are released uniformly through a slit onto a surface. We model the one-dimensional impact points \(X_1, \ldots, X_n\) as i.i.d. \(\mathrm{Uniform}(0, d)\), where the unknown slit width \(d > 0\) is to be estimated.

    1. Write down the joint density \(f(\mathbf{x} \mid d)\) for the sample.
    1. Using the Fisher–Neyman factorization theorem, show that \(X_{(n)} = \max\{X_1, \ldots, X_n\}\) is sufficient for \(d\).
    1. Is \(X_{(n)}\) unbiased for \(d\)? If not, find an unbiased estimator based on \(X_{(n)}\).

Hint for (c): the CDF of \(X_{(n)}\) is \(F_{X_{(n)}}(x) = (x/d)^n\) for \(0 \le x \le d\).

a) Each \(X_i\) has density \(\frac{1}{d} \mathbf{1}\{0 \le x_i \le d\}\), so

\[f(\mathbf{x} \mid d) = d^{-n} \prod_{i=1}^n \mathbf{1}\{0 \le x_i \le d\} = d^{-n} \cdot \mathbf{1}\{x_{(1)} \ge 0\} \cdot \mathbf{1}\{x_{(n)} \le d\}.\]

b) Split the joint density into a \(d\)-dependent piece and a data-only piece:

\[f(\mathbf{x} \mid d) = \underbrace{d^{-n} \mathbf{1}\{x_{(n)} \le d\}}_{g(T(\mathbf{x}), d)} \cdot \underbrace{\mathbf{1}\{x_{(1)} \ge 0\}}_{h(\mathbf{x})}.\]

The \(d\)-dependent factor depends on the data only through \(T(\mathbf{x}) = x_{(n)}\). By the Fisher-Neyman factorization theorem, \(X_{(n)}\) is sufficient for \(d\).

c) From the hint, \(f_{X_{(n)}}(x) = n x^{n-1}/d^n\) for \(0 \le x \le d\). Then

\[\mathbb{E}[X_{(n)}] = \int_0^d x \cdot \frac{n x^{n-1}}{d^n} \, dx = \frac{n}{d^n} \cdot \frac{d^{n+1}}{n+1} = \frac{n}{n+1} d.\]

So \(X_{(n)}\) is biased (it underestimates \(d\)). The unbiased estimator is

\[\boxed{\hat{d} = \frac{n+1}{n} X_{(n)}}.\]

03 Normal Variance: Minimal Sufficiency

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} N(\mu, \sigma^2)\) where \(\sigma^2 > 0\) is unknown but \(\mu\) is known.

    1. Show that \(T(\mathbf{X}) = \sum_{i=1}^{n}(X_i - \mu)^2\) is sufficient for \(\sigma^2\) using the factorization theorem.
    1. Using the likelihood ratio criterion, show that \(T(\mathbf{X})\) is minimal sufficient for \(\sigma^2\).

Recall: \(T\) is minimal sufficient iff \(T(\mathbf{x}) = T(\mathbf{y})\) \(\Longleftrightarrow\) \(\frac{f(\mathbf{x} \mid \sigma^2)}{f(\mathbf{y} \mid \sigma^2)}\) is free of \(\sigma^2\).

a) The joint density is

\[f(\mathbf{x} \mid \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) = (2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\right).\]

Write this as

\[f(\mathbf{x} \mid \sigma^2) = \underbrace{(2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{T(\mathbf{x})}{2\sigma^2}\right)}_{g(T(\mathbf{x}), \sigma^2)} \cdot \underbrace{1}_{h(\mathbf{x})}\]

with \(T(\mathbf{x}) = \sum_i (x_i - \mu)^2\) (recall \(\mu\) is known, so \(T\) is a function of the data alone). By the Fisher-Neyman factorization theorem, \(T(\mathbf{X})\) is sufficient for \(\sigma^2\).

b) Form the likelihood ratio:

\[\frac{f(\mathbf{x} \mid \sigma^2)}{f(\mathbf{y} \mid \sigma^2)} = \exp\!\left(-\frac{T(\mathbf{x}) - T(\mathbf{y})}{2\sigma^2}\right).\]

This ratio is free of \(\sigma^2\) if and only if \(T(\mathbf{x}) - T(\mathbf{y}) = 0\), i.e., \(T(\mathbf{x}) = T(\mathbf{y})\). By the ratio characterization of minimal sufficiency, \(\boxed{T(\mathbf{X}) = \sum_i (X_i - \mu)^2}\) is minimal sufficient for \(\sigma^2\).

04 Binomial Sufficiency and Estimating \(\pi^2\)

Let \(\mathbf{X} = (X_1, \ldots, X_n)^\top\) with \(X_i \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(\pi)\), where \(\pi \in (0, 1)\). Define \(U(\mathbf{X}) = \sum_{i=1}^{n} X_i\).

    1. Show that \(U(\mathbf{X})/n\) is unbiased for \(\pi\).
    1. Show that \(U(\mathbf{X})\) is minimal sufficient for \(\pi\).
    1. Now consider the estimator for \(\pi^2\): \[V(\mathbf{X}) = \frac{U(\mathbf{X})\left[U(\mathbf{X}) - 1\right]}{n(n-1)}.\] Verify that \(V(\mathbf{X})\) is unbiased for \(\pi^2\).

Hint for (c): expand \(\mathbb{E}[U(U-1)]\) using \(\mathbb{E}[U] = n\pi\) and \(\operatorname{Var}(U) = n\pi(1-\pi)\).

a) Each \(X_i\) has \(\mathbb{E}[X_i] = \pi\), so

\[\mathbb{E}\!\left[\frac{U(\mathbf{X})}{n}\right] = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] = \pi.\]

b) The joint PMF is

\[f(\mathbf{x} \mid \pi) = \prod_{i=1}^n \pi^{x_i}(1-\pi)^{1-x_i} = \pi^{u}(1-\pi)^{n-u}, \qquad u = \sum_i x_i.\]

This depends on the data only through \(u = U(\mathbf{x})\), so by Fisher-Neyman, \(U\) is sufficient. For minimality, the ratio

\[\frac{f(\mathbf{x} \mid \pi)}{f(\mathbf{y} \mid \pi)} = \left(\frac{\pi}{1-\pi}\right)^{U(\mathbf{x}) - U(\mathbf{y})}\]

is free of \(\pi\) iff \(U(\mathbf{x}) = U(\mathbf{y})\). So \(\boxed{U(\mathbf{X})}\) is minimal sufficient for \(\pi\).

c) Using \(\operatorname{Var}(U) = \mathbb{E}[U^2] - \mathbb{E}[U]^2\), we get \(\mathbb{E}[U^2] = n\pi(1-\pi) + (n\pi)^2\). Then

\[\mathbb{E}[U(U-1)] = \mathbb{E}[U^2] - \mathbb{E}[U] = n\pi(1-\pi) + n^2\pi^2 - n\pi = n^2\pi^2 - n\pi^2 = n(n-1)\pi^2.\]

Therefore

\[\mathbb{E}[V(\mathbf{X})] = \frac{n(n-1)\pi^2}{n(n-1)} = \pi^2,\]

so \(V\) is unbiased for \(\pi^2\).


2) Fisher Information & Cram'{e}r–Rao

05 Fisher Information for the Exponential

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Exp}(\lambda)\) with density \(f(x \mid \lambda) = \lambda e^{-\lambda x}\) for \(x \geq 0\).

    1. Compute the score function \(s(\lambda) = \frac{\partial}{\partial \lambda} \log f(X \mid \lambda)\).
    1. Verify that \(\mathbb{E}[s(\lambda)] = 0\).
    1. Compute the Fisher information \(I(\lambda) = \operatorname{Var}[s(\lambda)]\).
    1. Verify your answer in (c) by computing \(I(\lambda)\) via the second-derivative formula: \(I(\lambda) = -\mathbb{E}\!\left[\frac{\partial^2}{\partial\lambda^2}\log f(X \mid \lambda)\right]\).
    1. The MLE for \(\lambda\) is \(\hat{\lambda} = 1/\bar{X}\), which has \(\operatorname{Var}(\hat{\lambda}) \approx \lambda^2/n\) for large \(n\). Compare this with the Cram'{e}r–Rao lower bound. Is \(\hat{\lambda}\) asymptotically efficient?

a) \(\log f(x \mid \lambda) = \log \lambda - \lambda x\). Differentiating with respect to \(\lambda\):

\[s(\lambda) = \frac{\partial}{\partial \lambda} \log f(X \mid \lambda) = \frac{1}{\lambda} - X.\]

b) Since \(X \sim \mathrm{Exp}(\lambda)\) has \(\mathbb{E}[X] = 1/\lambda\),

\[\mathbb{E}[s(\lambda)] = \frac{1}{\lambda} - \mathbb{E}[X] = \frac{1}{\lambda} - \frac{1}{\lambda} = 0.\]

c) Because \(\mathbb{E}[s] = 0\), \(I(\lambda) = \operatorname{Var}[s(\lambda)] = \operatorname{Var}(1/\lambda - X) = \operatorname{Var}(X) = 1/\lambda^2\). So

\[\boxed{I(\lambda) = \frac{1}{\lambda^2}}.\]

d) From \(s(\lambda) = 1/\lambda - X\),

\[\frac{\partial^2}{\partial \lambda^2} \log f(X \mid \lambda) = -\frac{1}{\lambda^2}.\]

Taking \(-\mathbb{E}\) of a constant gives \(I(\lambda) = 1/\lambda^2\), matching part (c).

e) The Cramér-Rao lower bound for unbiased estimators of \(\lambda\) based on \(n\) i.i.d. observations is

\[\operatorname{Var}(\hat{\lambda}) \ge \frac{1}{n \cdot I(\lambda)} = \frac{\lambda^2}{n}.\]

The MLE has \(\operatorname{Var}(\hat{\lambda}) \approx \lambda^2/n\), which matches the bound asymptotically. So \(\hat{\lambda} = 1/\bar{X}\) is asymptotically efficient.

06 Cramér–Rao: When Can We Beat \(1/n\)?

Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathrm{Bernoulli}(p)\).

    1. We know \(I(p) = \frac{1}{p(1-p)}\). Write down the Cram'{e}r–Rao lower bound for any unbiased estimator of \(p\).
    1. The sample proportion \(\hat{p} = \bar{X}\) has \(\operatorname{Var}(\hat{p}) = \frac{p(1-p)}{n}\). Is \(\hat{p}\) efficient?
    1. Now consider estimating \(g(p) = p^2\) instead of \(p\). The Cram'{e}r–Rao bound for unbiased estimators of \(g(\theta)\) is \[\operatorname{Var}(\hat{g}) \geq \frac{[g'(\theta)]^2}{n \cdot I(\theta)}.\] Compute the CR bound for unbiased estimators of \(p^2\).
    1. We showed in Problem 04(c) that \(V(\mathbf{X}) = \frac{U(U-1)}{n(n-1)}\) is unbiased for \(p^2\). Compute \(\operatorname{Var}(V)\) (at least for large \(n\)). Does it achieve the CR bound?

a) With \(n\) i.i.d. observations,

\[\operatorname{Var}(\hat{p}) \ge \frac{1}{n \cdot I(p)} = \boxed{\frac{p(1-p)}{n}}.\]

b) \(\operatorname{Var}(\hat{p}) = p(1-p)/n\) exactly equals the CR bound, so \(\hat{p}\) is efficient (uniformly in \(p\), not just asymptotically).

c) Here \(g(p) = p^2\), so \(g'(p) = 2p\). The CR bound for unbiased estimators of \(p^2\) is

\[\operatorname{Var}(\hat{g}) \ge \frac{[g'(p)]^2}{n \cdot I(p)} = \frac{4p^2 \cdot p(1-p)}{n} = \boxed{\frac{4p^3(1-p)}{n}}.\]

d) Note \(V = \frac{U(U-1)}{n(n-1)} = \frac{n}{n-1} \hat{p}^2 - \frac{1}{n-1}\hat{p}\). For large \(n\), \(V \approx \hat{p}^2\).

To get \(\operatorname{Var}(\hat{p}^2)\) for large \(n\), use a first-order Taylor expansion of \(\phi(p) = p^2\) around the true \(p\). Since \(\hat{p}\) concentrates near \(p\) (its variance is \(O(1/n)\)), we can write

\[\hat{p}^2 \approx p^2 + 2p \cdot (\hat{p} - p),\]

so the random part of \(\hat{p}^2\) is approximately \(2p \cdot (\hat{p} - p)\). Therefore

\[\operatorname{Var}(\hat{p}^2) \approx (2p)^2 \cdot \operatorname{Var}(\hat{p}) = 4p^2 \cdot \frac{p(1-p)}{n} = \frac{4p^3(1-p)}{n}.\]

This matches the Cramér-Rao bound from (c) asymptotically, so \(V\) is asymptotically efficient for \(p^2\). (For finite \(n\), \(V\) does not exactly achieve the bound.)


3) Admissibility & Minimax

07 Admissibility: A Sketch Exercise

Suppose there exist exactly three estimators \(T_1\), \(T_2\), and \(T_3\) for a parameter \(\theta \in [0, 1]\).

    1. Sketch an example of the MSE curves \(\mathrm{MSE}(T_i, \theta)\) as functions of \(\theta\) such that \(T_1\) and \(T_2\) are admissible, but \(T_3\) is not admissible. Explain why your sketch works.
    1. Now sketch (possibly different) risk functions for \(T_1\), \(T_2\), and \(T_3\) such that \(T_1\) is the minimax estimator. Must \(T_1\) have the lowest MSE everywhere?

a) Plot \(\theta\) on the horizontal axis (over \([0, 1]\)) and MSE on the vertical axis. A working sketch:

  • \(\mathrm{MSE}(T_1, \theta)\): a curve that is low near \(\theta = 0\) and rises monotonically toward \(\theta = 1\).
  • \(\mathrm{MSE}(T_2, \theta)\): a curve that is low near \(\theta = 1\) and rises monotonically toward \(\theta = 0\) (mirror of \(T_1\)). The two curves cross somewhere in \((0, 1)\).
  • \(\mathrm{MSE}(T_3, \theta)\): a curve that lies strictly above both \(\mathrm{MSE}(T_1, \theta)\) and \(\mathrm{MSE}(T_2, \theta)\) for every \(\theta \in [0, 1]\).

Why it works. Recall: \(T'\) dominates \(T\) iff \(\mathrm{MSE}(T', \theta) \le \mathrm{MSE}(T, \theta)\) for all \(\theta\) with strict inequality somewhere; \(T\) is admissible iff no \(T'\) dominates it.

  • \(T_1\) admissible: \(T_2\) doesn’t dominate it (since \(T_1\) beats \(T_2\) at small \(\theta\)); \(T_3\) is uniformly worse, so certainly doesn’t dominate. No estimator dominates \(T_1\).
  • \(T_2\) admissible: symmetric - \(T_1\) beats \(T_2\) on the left so \(T_1\) doesn’t dominate; \(T_3\) uniformly worse so doesn’t dominate.
  • \(T_3\) inadmissible: \(T_1\) dominates \(T_3\) by construction (and so does \(T_2\)).

b) Minimax means smallest worst-case MSE: \(T_1\) minimizes \(\sup_{\theta} \mathrm{MSE}(T, \theta)\). A sketch:

  • \(\mathrm{MSE}(T_1, \theta)\): roughly flat at a moderate level \(M\) across all \(\theta\).
  • \(\mathrm{MSE}(T_2, \theta)\): very low for \(\theta\) near \(0\) but spikes well above \(M\) as \(\theta \to 1\).
  • \(\mathrm{MSE}(T_3, \theta)\): very low for \(\theta\) near \(1\) but spikes well above \(M\) as \(\theta \to 0\).

Then \(\sup_\theta \mathrm{MSE}(T_1, \theta) = M\), while both \(\sup_\theta \mathrm{MSE}(T_2, \theta)\) and \(\sup_\theta \mathrm{MSE}(T_3, \theta)\) exceed \(M\). So \(T_1\) is minimax.

No - \(T_1\) does not need the lowest MSE everywhere. In fact, in this sketch \(T_2\) beats \(T_1\) for small \(\theta\) and \(T_3\) beats it for large \(\theta\). Minimaxity is purely about worst-case behavior, not pointwise dominance.

🎲 xx+37 (xx)

Flag Counter