29: Information Theory - MLE, MaxEnt & Mutual Information

📚 Նյութը

Դասախոսություն

🏡 Տնային

01 ✏️ XOR: When Single Features Are Useless

Let $X, Y$ be independent fair coins ($\text{Bern}(0.5)$), and define $Z = X \oplus Y$ (XOR: $Z = 0$ if $X = Y$, else $Z = 1$).

a) Compute the mutual information $I(X; Z)$ and $I(Y; Z)$. (Hint: $I(X;Z) = H(Z) - H(Z \mid X)$.)

b) Compute the joint mutual information $I(X, Y; Z)$.

c) The shock: $I(X; Z) = I(Y; Z) = 0$ but $I(X, Y; Z) = 1$ bit. Explain in plain language. What does it imply for a feature-selection method that scores features one at a time?

d) Now add noise: $Z = X \oplus Y \oplus N$ with $N \sim \text{Bern}(0.1)$ independent. Compute $I(X; Z)$ and the conditional mutual information $I(X; Z \mid Y)$. Why does conditioning on $Y$ make $X$ informative about $Z$?

e) (ML tie-in) Greedy, one-feature-at-a-time selection completely misses the XOR signal. Name one model family that captures it anyway, and say in a few words why.

Solution

(a) $Z$ is marginally fair: $P(Z{=}0) = P(X{=}Y) = \tfrac12$, so $H(Z) = 1$ bit. Given $X = x$, $Z = x \oplus Y$ is still a fair coin (it’s just $Y$ or $1-Y$), so $H(Z \mid X) = 1$ bit. Therefore \[I(X; Z) = H(Z) - H(Z \mid X) = 1 - 1 = \mathbf{0}, \qquad I(Y; Z) = 0 \text{ by symmetry.}\]

(b) Once both $X$ and $Y$ are known, $Z = X \oplus Y$ is determined, so $H(Z \mid X, Y) = 0$ and \[I(X, Y; Z) = H(Z) - H(Z \mid X, Y) = 1 - 0 = \mathbf{1 \text{ bit}}.\]

(c) Each feature alone tells you nothing about $Z$, yet together they determine it completely. The information lives entirely in the interaction of $X$ and $Y$, not in either marginally. A method that ranks features one at a time (correlation filters, single-feature information gain, even Lasso) scores both $X$ and $Y$ as useless and discards them — missing the signal entirely.

(d) Even with noise, $X$ alone is uninformative: given $X = x$, $Z = x \oplus Y \oplus N$ where $Y$ is fair, so $Z$ is still fair ⇒ $I(X; Z) = 0$. But conditioning on $Y$: \[I(X; Z \mid Y) = H(Z \mid Y) - H(Z \mid X, Y).\] Given $Y$, $Z = X \oplus y \oplus N$ is still fair ($X$ fair) ⇒ $H(Z \mid Y) = 1$. Given both $X$ and $Y$, $Z = (X{\oplus}Y) \oplus N$ depends only on the noise ⇒ $H(Z \mid X, Y) = H(N) = h(0.1) \approx 0.469$. So \[I(X; Z \mid Y) = 1 - 0.469 = \mathbf{0.531 \text{ bits}}.\] Knowing $Y$ “unlocks” $X$: once $Y$ is fixed, $X$ flips $Z$ deterministically (up to the noise), so $X$ becomes highly informative.

(e) Decision trees (or neural networks). A tree can split on $X$, then split each branch on $Y$, recovering $Z$ exactly; a neural net’s hidden layer can compute the compositional feature $X \oplus Y$ before the final classifier sees it. Both exploit feature combinations, unlike a one-feature-at-a-time ranker.

More exercises coming soon.

🎲 38 (01) TODO

▶️ToDo
🔗Random link
🇦🇲🎶ToDo
🌐🎶ToDo
🤌Կարգին

--- title: "29: Information Theory - MLE, MaxEnt & Mutual Information" format: html: css: homework-styles.css --- <script src="homework-scripts.js"></script> # 📚 Նյութը ## Դասախոսություն - [📺 Դասախոսություն — MLE, MaxEnt, Mutual Information](https://youtu.be/zXwYdYFj6io) - [🎞️ Սլայդեր — info 02](Lectures/info/info_02.pdf), [📝 Notes](Lectures/info/info_02_notes.pdf) # 🏡 Տնային --- ## 01 ✏️ XOR: When Single Features Are Useless {data-difficulty="3"} Let $X, Y$ be independent fair coins ($\text{Bern}(0.5)$), and define $Z = X \oplus Y$ (XOR: $Z = 0$ if $X = Y$, else $Z = 1$). **a)** Compute the mutual information $I(X; Z)$ and $I(Y; Z)$. *(Hint: $I(X;Z) = H(Z) - H(Z \mid X)$.)* **b)** Compute the joint mutual information $I(X, Y; Z)$. **c)** The shock: $I(X; Z) = I(Y; Z) = 0$ but $I(X, Y; Z) = 1$ bit. Explain in plain language. What does it imply for a feature-selection method that scores features **one at a time**? **d)** Now add noise: $Z = X \oplus Y \oplus N$ with $N \sim \text{Bern}(0.1)$ independent. Compute $I(X; Z)$ and the **conditional** mutual information $I(X; Z \mid Y)$. Why does conditioning on $Y$ make $X$ informative about $Z$? **e)** (ML tie-in) Greedy, one-feature-at-a-time selection completely misses the XOR signal. Name one model family that captures it anyway, and say in a few words why. ::: {.callout-tip collapse="true" title="Solution"} **(a)** $Z$ is marginally fair: $P(Z{=}0) = P(X{=}Y) = \tfrac12$, so $H(Z) = 1$ bit. Given $X = x$, $Z = x \oplus Y$ is still a fair coin (it's just $Y$ or $1-Y$), so $H(Z \mid X) = 1$ bit. Therefore $$I(X; Z) = H(Z) - H(Z \mid X) = 1 - 1 = \mathbf{0}, \qquad I(Y; Z) = 0 \text{ by symmetry.}$$ **(b)** Once both $X$ and $Y$ are known, $Z = X \oplus Y$ is determined, so $H(Z \mid X, Y) = 0$ and $$I(X, Y; Z) = H(Z) - H(Z \mid X, Y) = 1 - 0 = \mathbf{1 \text{ bit}}.$$ **(c)** Each feature **alone** tells you nothing about $Z$, yet **together** they determine it completely. The information lives entirely in the *interaction* of $X$ and $Y$, not in either marginally. A method that ranks features one at a time (correlation filters, single-feature information gain, even Lasso) scores both $X$ and $Y$ as useless and discards them — missing the signal entirely. **(d)** Even with noise, $X$ alone is uninformative: given $X = x$, $Z = x \oplus Y \oplus N$ where $Y$ is fair, so $Z$ is still fair ⇒ $I(X; Z) = 0$. But conditioning on $Y$: $$I(X; Z \mid Y) = H(Z \mid Y) - H(Z \mid X, Y).$$ Given $Y$, $Z = X \oplus y \oplus N$ is still fair ($X$ fair) ⇒ $H(Z \mid Y) = 1$. Given **both** $X$ and $Y$, $Z = (X{\oplus}Y) \oplus N$ depends only on the noise ⇒ $H(Z \mid X, Y) = H(N) = h(0.1) \approx 0.469$. So $$I(X; Z \mid Y) = 1 - 0.469 = \mathbf{0.531 \text{ bits}}.$$ Knowing $Y$ "unlocks" $X$: once $Y$ is fixed, $X$ flips $Z$ deterministically (up to the noise), so $X$ becomes highly informative. **(e)** **Decision trees** (or neural networks). A tree can split on $X$, then split each branch on $Y$, recovering $Z$ exactly; a neural net's hidden layer can compute the compositional feature $X \oplus Y$ before the final classifier sees it. Both exploit feature *combinations*, unlike a one-feature-at-a-time ranker. ::: --- *More exercises coming soon.* # 🎲 38 (01) TODO - ▶️[ToDo]() - 🔗[Random link]() - 🇦🇲🎶[ToDo]() - 🌐🎶[ToDo]() - 🤌[Կարգին]() <a href="http://s01.flagcounter.com/more/1oO"><img src="https://s01.flagcounter.com/count2/1oO/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_10/viewers_0/labels_0/pageviews_1/flags_0/percent_0/" alt="Flag Counter"></a>