29: Information Theory - MLE, MaxEnt & Mutual Information

📚 Նյութը

Դասախոսություն

🏡 Տնային


01 ✏️ XOR: When Single Features Are Useless

Let \(X, Y\) be independent fair coins (\(\text{Bern}(0.5)\)), and define \(Z = X \oplus Y\) (XOR: \(Z = 0\) if \(X = Y\), else \(Z = 1\)).

a) Compute the mutual information \(I(X; Z)\) and \(I(Y; Z)\). (Hint: \(I(X;Z) = H(Z) - H(Z \mid X)\).)

b) Compute the joint mutual information \(I(X, Y; Z)\).

c) The shock: \(I(X; Z) = I(Y; Z) = 0\) but \(I(X, Y; Z) = 1\) bit. Explain in plain language. What does it imply for a feature-selection method that scores features one at a time?

d) Now add noise: \(Z = X \oplus Y \oplus N\) with \(N \sim \text{Bern}(0.1)\) independent. Compute \(I(X; Z)\) and the conditional mutual information \(I(X; Z \mid Y)\). Why does conditioning on \(Y\) make \(X\) informative about \(Z\)?

e) (ML tie-in) Greedy, one-feature-at-a-time selection completely misses the XOR signal. Name one model family that captures it anyway, and say in a few words why.

(a) \(Z\) is marginally fair: \(P(Z{=}0) = P(X{=}Y) = \tfrac12\), so \(H(Z) = 1\) bit. Given \(X = x\), \(Z = x \oplus Y\) is still a fair coin (it’s just \(Y\) or \(1-Y\)), so \(H(Z \mid X) = 1\) bit. Therefore \[I(X; Z) = H(Z) - H(Z \mid X) = 1 - 1 = \mathbf{0}, \qquad I(Y; Z) = 0 \text{ by symmetry.}\]

(b) Once both \(X\) and \(Y\) are known, \(Z = X \oplus Y\) is determined, so \(H(Z \mid X, Y) = 0\) and \[I(X, Y; Z) = H(Z) - H(Z \mid X, Y) = 1 - 0 = \mathbf{1 \text{ bit}}.\]

(c) Each feature alone tells you nothing about \(Z\), yet together they determine it completely. The information lives entirely in the interaction of \(X\) and \(Y\), not in either marginally. A method that ranks features one at a time (correlation filters, single-feature information gain, even Lasso) scores both \(X\) and \(Y\) as useless and discards them — missing the signal entirely.

(d) Even with noise, \(X\) alone is uninformative: given \(X = x\), \(Z = x \oplus Y \oplus N\) where \(Y\) is fair, so \(Z\) is still fair ⇒ \(I(X; Z) = 0\). But conditioning on \(Y\): \[I(X; Z \mid Y) = H(Z \mid Y) - H(Z \mid X, Y).\] Given \(Y\), \(Z = X \oplus y \oplus N\) is still fair (\(X\) fair) ⇒ \(H(Z \mid Y) = 1\). Given both \(X\) and \(Y\), \(Z = (X{\oplus}Y) \oplus N\) depends only on the noise ⇒ \(H(Z \mid X, Y) = H(N) = h(0.1) \approx 0.469\). So \[I(X; Z \mid Y) = 1 - 0.469 = \mathbf{0.531 \text{ bits}}.\] Knowing \(Y\) “unlocks” \(X\): once \(Y\) is fixed, \(X\) flips \(Z\) deterministically (up to the noise), so \(X\) becomes highly informative.

(e) Decision trees (or neural networks). A tree can split on \(X\), then split each branch on \(Y\), recovering \(Z\) exactly; a neural net’s hidden layer can compute the compositional feature \(X \oplus Y\) before the final classifier sees it. Both exploit feature combinations, unlike a one-feature-at-a-time ranker.


More exercises coming soon.

🎲 38 (01) TODO

Flag Counter