18: Probability - Covariance & Correlation
📚 Նյութը
YouTube links in this section were auto-extracted. If you spot a mistake, please let me know!
Դասախոսություն
Գործնական
01 The Units Trap: covariance changes, correlation doesn’t
A researcher measures temperature \(T\) (°C) and ice-cream sales \(S\). They convert \(T\) to Fahrenheit:
\[F = 1.8T + 32.\]
- Using covariance properties, express \(\operatorname{Cov}(F,S)\) in terms of \(\operatorname{Cov}(T,S)\).
- Prove that the correlation is unchanged: \(\rho_{F,S}=\rho_{T,S}\), using the definition of correlation.
- Explain (2–4 sentences) why correlation is “unit-free,” while covariance is not.
a) Covariance under affine transformation.
Use bilinearity plus the fact that adding a constant doesn’t change covariance (\(\operatorname{Cov}(X+c,Y)=\operatorname{Cov}(X,Y)\)):
\[\operatorname{Cov}(F,S)=\operatorname{Cov}(1.8T+32,\,S)=1.8\,\operatorname{Cov}(T,S)+\operatorname{Cov}(32,S)=1.8\,\operatorname{Cov}(T,S)\]
So changing units from °C to °F multiplies the covariance by \(1.8\). The number itself is meaningless without remembering the units.
b) Correlation is unchanged.
By definition \(\rho_{F,S}=\dfrac{\operatorname{Cov}(F,S)}{\sigma_F\,\sigma_S}\). The standard deviation of \(F=1.8T+32\) is \(\sigma_F=|1.8|\,\sigma_T=1.8\,\sigma_T\) (the \(+32\) shift drops out of variance). So:
\[\rho_{F,S}=\frac{1.8\,\operatorname{Cov}(T,S)}{1.8\,\sigma_T\cdot\sigma_S}=\frac{\operatorname{Cov}(T,S)}{\sigma_T\,\sigma_S}=\rho_{T,S}\]
The factor \(1.8\) cancels — same in numerator and denominator.
c) Why correlation is unit-free.
Covariance carries units of “X-units × Y-units” (here, °C × dollars vs °F × dollars), so its numerical value depends on the measurement scale. Correlation divides by \(\sigma_X\sigma_Y\), which carries the same combined units, and the units cancel. The result is a pure number in \([-1,1]\), comparable across studies measured in different units, currencies, or scales. This is the reason correlation became the universal similarity measure — you can compare “height vs weight” and “temperature vs sales” on the same axis.
Sign check. If \(a<0\) in \(Y=aX+b\), the same algebra gives a sign flip: \(\operatorname{Cov}\to a\cdot\operatorname{Cov}\) but \(\sigma\to|a|\sigma\), so \(\rho\to\operatorname{sgn}(a)\,\rho\). Reversing the temperature scale would negate correlation. Multiplying by a positive constant preserves it.
02 Correlation \(=\pm 1\) as a detective test (constructive, not computational)
You are given \(x=[1,2,3,4,5,6]\).
- Construct integer-valued \(y\) such that the correlation is exactly \(+1\).
- Construct integer-valued \(y\) such that the correlation is exactly \(-1\).
- Justify both by referencing the condition for extremal correlation (the \(Y=aX+b\) characterization).
Key fact. \(\rho_{X,Y}=\pm 1\) if and only if \(Y=aX+b\) exactly (a perfect linear relationship), with the sign of \(\rho\) matching the sign of \(a\).
a) \(\rho=+1\). Pick any positive slope \(a>0\) and integer intercept \(b\). The simplest choice: \(a=1\), \(b=0\), giving \(y=x\):
\[y=[1,2,3,4,5,6]\]
Or \(a=2,\,b=-1\): \(y=[1,3,5,7,9,11]\). Any integer-affine increasing function works.
b) \(\rho=-1\). Same idea with negative slope. Take \(a=-1\), \(b=7\):
\[y=[6,5,4,3,2,1]\]
Or \(a=-2,\,b=10\): \(y=[8,6,4,2,0,-2]\).
c) Justification.
Correlation reaches \(\pm 1\) exactly when all data points lie on a single straight line — this is the equality case in the Cauchy–Schwarz inequality \(|\operatorname{Cov}(X,Y)|\le\sigma_X\sigma_Y\). Since the constructions above place every \((x_i,y_i)\) on the line \(y=ax+b\) with \(a\ne 0\), the correlation is \(+1\) when \(a>0\) and \(-1\) when \(a<0\). The intercept \(b\) doesn’t matter (it shifts both \(\bar y\) and each \(y_i\) by the same amount).
04 Pearson vs Spearman: monotonic but not linear
Consider \(x=[1,2,3,4,5,6]\) and \(y=[1,2,4,8,16,32]\).
- Argue (without calculating Pearson exactly) why Pearson correlation is not \(1\) (use the “linear relationship” criterion).
- Compute Spearman rank correlation exactly (no ties here).
- One sentence: why Spearman is the right tool here.
a) Why Pearson \(<1\).
Pearson correlation equals \(1\) only when the points lie on a straight line. Here \(y=2^{x-1}\) — exponential, not linear. A quick sanity check: from \(x=1\) to \(x=2\), \(y\) jumps by \(1\); from \(x=5\) to \(x=6\), \(y\) jumps by \(16\). The slope is wildly non-constant, so no straight line fits. Pearson will be high (the trend is monotonic and convex) but strictly less than \(1\). Numerically it’s about \(0.91\).
b) Spearman rank correlation.
Spearman is just Pearson applied to the ranks instead of the raw values. Ranks of \(x=[1,2,3,4,5,6]\) are \([1,2,3,4,5,6]\). Ranks of \(y=[1,2,4,8,16,32]\) are also \([1,2,3,4,5,6]\) (the function is strictly increasing, so it preserves order).
Both rank sequences are identical, so:
\[\rho_{\text{Spearman}}=\rho_{\text{Pearson}}\big(\text{rank}(x),\text{rank}(y)\big)=1\]
Equivalently, with no ties one can use the shortcut \(\rho_S=1-\dfrac{6\sum d_i^2}{n(n^2-1)}\) where \(d_i=\text{rank}(x_i)-\text{rank}(y_i)\). Here every \(d_i=0\), so \(\rho_S=1\).
c) Why Spearman is the right tool.
The relationship is monotonic but nonlinear, so Pearson under-reports the strength of association while Spearman correctly reports a perfect monotonic match — Spearman measures “is the order preserved?”, which is exactly what’s true here.
05 One outlier can flip the story
Common points:
\[(1,1),(2,2),(3,3),(4,4),(5,5).\]
Dataset A adds \((6,6)\). Dataset B adds \((6,-20)\).
- For each dataset, decide whether the sample correlation is positive or negative (justify using the sign of “products of deviations,” not full computation).
- Explain how one point can dominate this “one-number summary.”
The sample correlation has the same sign as \(\sum_i (x_i-\bar x)(y_i-\bar y)\), since dividing by the (always non-negative) standard deviations doesn’t change sign. So we just need to track the sign of the products of deviations.
a) Dataset A: \((1,1),(2,2),(3,3),(4,4),(5,5),(6,6)\).
Means: \(\bar x=\bar y=3.5\). Every point is on the line \(y=x\), so \((x_i-3.5)(y_i-3.5)=(x_i-3.5)^2\ge 0\) for every \(i\). Each term is non-negative, so the sum is positive (in fact, strictly so). Correlation is \(+1\) exactly — perfect line.
b) Dataset B: \((1,1),(2,2),(3,3),(4,4),(5,5),(6,-20)\).
Now \(\bar x=3.5\) as before, but \(\bar y=\tfrac{1+2+3+4+5-20}{6}=\tfrac{-5}{6}\approx -0.83\).
Look at the outlier \((6,-20)\):
\[(x-\bar x)(y-\bar y)=(6-3.5)(-20-(-0.83))=2.5\cdot(-19.17)\approx -47.9\]
That’s one term contributing about \(-48\). The five “common” points \((1,1)\ldots(5,5)\) each contribute small positive amounts (their \(y\)-deviations are now small positives, around \(1.83,2.83,3.83,4.83,5.83\), paired with \(x\)-deviations \(-2.5,-1.5,-0.5,0.5,1.5\)). Compute the sum of these five:
\[(-2.5)(1.83)+(-1.5)(2.83)+(-0.5)(3.83)+(0.5)(4.83)+(1.5)(5.83)\approx -4.58-4.25-1.92+2.42+8.75\approx 0.42\]
The five well-behaved points sum to about \(+0.4\); the single outlier alone contributes about \(-48\). Sign of the total is negative, so correlation in B is negative (numerically about \(-0.53\)).
b) Why one point can dominate.
Pearson correlation is built from products of deviations, and deviations grow with distance from the mean. An outlier far from the mean — both in \(x\) and in \(y\) — gets multiplied by both a large \((x-\bar x)\) and a large \((y-\bar y)\), so its contribution is roughly quadratic in how extreme it is. The other five points have small deviations and contribute roughly nothing. The result: a single point flipped the headline number from “perfect positive correlation” to “moderate negative correlation” — same five common points, just one new observation.
Practical takeaway. Correlation is a one-number summary of a 2-D cloud, and like all one-number summaries it can be misleading. Always plot the data first; consider robust alternatives (Spearman, or trimmed/winsorized Pearson) when outliers are plausible.
06 Diversification as a decision: when does combining reduce risk?
Two daily returns \(R_1, R_2\) satisfy:
\[\operatorname{Var}(R_1)=4,\quad \operatorname{Var}(R_2)=9,\quad \operatorname{Cov}(R_1,R_2)=c.\]
Let \(P=R_1+R_2\).
- Express \(\operatorname{Var}(P)\) as a function of \(c\).
- Compare \(c=+5,\,0,\,-5\). Which case yields the smallest portfolio variance, and why?
- Translate the result into plain English (what does negative covariance “do” for you?).
a) Variance of the sum.
Use the bilinearity of covariance, with \(\operatorname{Var}(X)=\operatorname{Cov}(X,X)\):
\[\operatorname{Var}(R_1+R_2)=\operatorname{Var}(R_1)+\operatorname{Var}(R_2)+2\operatorname{Cov}(R_1,R_2)=4+9+2c=13+2c\]
b) Comparing the three cases.
| \(c\) | \(\operatorname{Var}(P)=13+2c\) |
|---|---|
| \(+5\) | \(23\) |
| \(0\) | \(13\) |
| \(-5\) | \(3\) |
Smallest variance is at \(c=-5\): \(\operatorname{Var}(P)=3\). When the two assets co-move negatively, their fluctuations partially cancel inside the sum.
Sanity check on the bound. The covariance \(c\) is constrained by \(|c|\le\sigma_1\sigma_2=\sqrt{4}\cdot\sqrt{9}=6\) (Cauchy–Schwarz / \(|\rho|\le 1\)). At the extremes:
- \(c=+6\) (perfect positive correlation, \(\rho=+1\)): \(\operatorname{Var}(P)=25=(2+3)^2=(\sigma_1+\sigma_2)^2\). Standard deviations add.
- \(c=-6\) (perfect negative correlation, \(\rho=-1\)): \(\operatorname{Var}(P)=1=(3-2)^2=(\sigma_2-\sigma_1)^2\). Standard deviations subtract.
So in this problem \(c=-5\) is close to but not at the theoretical floor.
c) Plain English.
Negative covariance is the mathematics of diversification. When asset 1 has a bad day, asset 2 tends to have a good day, and vice-versa. Add them together and the day-to-day swings of the portfolio are smaller than the swings of either piece alone. This is why investors hold a mix of assets that don’t all move together: not because the expected return is higher (linearity of expectation gives \(\mathbb{E}[P]=\mathbb{E}[R_1]+\mathbb{E}[R_2]\) regardless of \(c\)), but because the risk — the variance — is lower. Modern portfolio theory (Markowitz, 1952) is built directly on this identity.
07 Correlation as geometry: angle between centered vectors
You are given mean-centered vectors:
\[u=(1,0,-1),\qquad v=(2,-1,-1).\]
- Compute \(\cos(\theta)=\dfrac{u\cdot v}{\lVert u\rVert\,\lVert v\rVert}\).
- Interpret the sign and magnitude of \(\cos(\theta)\) as a correlation analogue.
- Decide whether the variables are “roughly aligned,” “roughly orthogonal,” or “roughly opposite.”
a) Cosine of the angle.
Dot product:
\[u\cdot v=(1)(2)+(0)(-1)+(-1)(-1)=2+0+1=3\]
Norms:
\[\lVert u\rVert=\sqrt{1^2+0^2+(-1)^2}=\sqrt{2},\qquad \lVert v\rVert=\sqrt{2^2+(-1)^2+(-1)^2}=\sqrt{6}\]
\[\cos\theta=\frac{3}{\sqrt{2}\cdot\sqrt{6}}=\frac{3}{\sqrt{12}}=\frac{3}{2\sqrt{3}}=\frac{\sqrt{3}}{2}\approx 0.866\]
So \(\theta=\arccos(\sqrt{3}/2)=30°\).
b) Why this is correlation.
For data vectors \(X=(x_1,\ldots,x_n)\) and \(Y=(y_1,\ldots,y_n)\), the mean-centered vectors are
\[u=X-\bar x\mathbf{1},\qquad v=Y-\bar y\mathbf{1}\]
Then:
- \(u\cdot v=\sum_i(x_i-\bar x)(y_i-\bar y)=n\cdot\widehat{\operatorname{Cov}}(X,Y)\)
- \(\lVert u\rVert=\sqrt{\sum_i(x_i-\bar x)^2}=\sqrt{n}\,\hat\sigma_X\), similarly for \(\lVert v\rVert\)
So:
\[\frac{u\cdot v}{\lVert u\rVert\,\lVert v\rVert}=\frac{n\,\widehat{\operatorname{Cov}}(X,Y)}{n\,\hat\sigma_X\hat\sigma_Y}=\hat\rho_{X,Y}\]
Pearson correlation IS the cosine of the angle between mean-centered data vectors. This is one of those identities that quietly explains many things at once:
- Why \(\rho\in[-1,1]\) — because cosines are.
- Why \(\rho=\pm 1\) exactly when the vectors are parallel/antiparallel (\(Y=aX+b\)).
- Why \(\rho=0\) exactly when the centered vectors are orthogonal.
- Why correlation behaves nicely under rotations of the data axes.
The correlation coefficient lives at the intersection of statistics and Euclidean geometry — and many ML methods (PCA, linear regression, cosine similarity in embeddings) are easier to read once you see it that way.
c) Interpretation.
\(\cos\theta\approx 0.866\) is positive and large, so the vectors are roughly aligned. The angle is \(30°\) — small enough that they point in essentially the same direction, far from orthogonal (\(90°\), \(\rho=0\)) and far from opposite (\(180°\), \(\rho=-1\)). As a correlation analogue, this corresponds to a strong positive correlation of about \(0.87\).
🎲 xx+37 (xx)
- ▶️ToDo
- 🔗Random link ToDo
- 🇦🇲🎶ToDo
- 🌐🎶ToDo
- 🤌Կարգին ToDo