21: Statistics — Foundations & Descriptive Stats

📚 Նյութը

🏡 Տնային

Note

❗❗❗ DON’T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗
Please don’t hesitate to ask questions, never forget about the 🍊karalyok🍊 principle!
The harder the problem is, the more 🧀cheeses🧀 it has.
Problems with 🎁 are just extra bonuses.
If the problem involve many boring calculations, feel free to skip them.
Submit your solutions here (even if it’s unfinished)

01 Data Visualization

Go over the Data Visualization topic: https://hayktarkhanyan.github.io/python_math_ml_course/python_libs/06_data_viz.html

02 Exploratory Data Analysis

Pick a dataset (e.g. from Kaggle or armstat.am) and explore it:

Compute summary statistics (mean, median, SD, IQR, skewness, etc.)
Build histograms, boxplots, scatter plots, and the ECDF

03 Survivorship Bias & Simpson’s Paradox

Come up with your own examples of:

Survivorship bias
Simpson’s paradox

Reference examples

These are canonical illustrations - your own examples should differ. The point is to recognize the pattern in the wild.

Survivorship bias - drawing conclusions from a sample that already filtered out the failures.

WWII bomber armor (Abraham Wald, 1943). The Army inspected planes that returned from missions and proposed reinforcing the spots where bullet holes clustered (wings, tail). Wald pointed out the inverse: armor the spots with no holes - the engine and cockpit. Planes hit there never came back to be inspected. The sample was already filtered for “survived a hit,” so its damage map showed the non-fatal hit locations.
“College dropouts get rich.” Pop culture cites Gates, Jobs, Zuckerberg as evidence dropping out helps. But the millions of dropouts who did not found unicorns are absent from the sample. The base rate question - $P(\text{rich} \mid \text{dropout})$ - cannot be answered from the celebrity sample alone.
Mutual fund performance. Industry “average 10-year return” stats often only include funds that still exist. Funds that performed poorly got liquidated or merged, so they drop out of the dataset (called fund attrition). The historical average of surviving funds overstates the return an investor would have actually earned.

Simpson’s paradox - a trend that holds in every subgroup reverses when subgroups are pooled. The lurking variable is usually a confounder that correlates with both group membership and the outcome.

UC Berkeley graduate admissions (1973). Aggregate: men admitted at 44%, women at 35%, suggesting bias against women. But broken down by department, most departments admitted women at equal or higher rates than men. The reversal happened because women disproportionately applied to highly selective departments (humanities), while men applied to less competitive ones (engineering). Department was a confounder: it correlated with both gender and acceptance rate.
Kidney stone treatments (Charig et al., 1986). Treatment B looked better overall (83% success vs 78% for A). But split by stone size:

Small stones Large stones Overall

Treatment A 93% (81/87) 73% (192/263) 78%

Treatment B 87% (234/270) 69% (55/80) 83%

A wins in both subgroups but loses overall. Reason: doctors gave A to harder (large-stone) cases and B to easier (small-stone) cases. Stone size confounds the comparison.
Batting averages across two seasons. A player can have a higher BA than another in 2023 and 2024, but a lower combined BA, if their at-bat counts are skewed (e.g. many at-bats in their bad year, few in their good year). Same arithmetic issue: a weighted average of subgroup rates is not the unweighted comparison.

	Small stones	Large stones	Overall
Treatment A	93% (81/87)	73% (192/263)	78%
Treatment B	87% (234/270)	69% (55/80)	83%

The common thread. Both biases arise from forgetting how the data was collected before computing summary statistics. Survivorship bias = filtered sample. Simpson’s = unaccounted confounder during aggregation. The fix in both cases is the same: think carefully about how the data was generated and selected, not just about the spreadsheet you ended up with.

🎲 xx+37 (xx)

▶️ToDo
🔗Random link ToDo
🇦🇲🎶ToDo
🌐🎶ToDo
🤌Կարգին ToDo

--- title: "21: Statistics — Foundations & Descriptive Stats" format: html: css: homework-styles.css --- <script src="homework-scripts.js"></script> # 📚 Նյութը - [📺 Stat Foundations: Population, Sample, Loss, ERM](https://youtu.be/ulNS3QVenYo), [🎞️ Սլայդեր](Lectures/stat/01_stat.pdf), [📝 Notes](Lectures/stat/01_stat_notes.pdf) - [📺 Descriptive Statistics: Center, Spread, Shape, ...](https://youtu.be/ice5rtgBOcA), [🎞️ Սլայդեր](Lectures/stat/02_stat.pdf), [📝 Notes](Lectures/stat/02_stat_notes.pdf) - [🛠️📺 Practical (ToDo)]() --- # 🏡 Տնային ::: {.callout-note collapse="false"} 1. ❗❗❗ DON'T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗ 2. Please don't hesitate to ask questions, never forget about the 🍊karalyok🍊 principle! 3. The harder the problem is, the more 🧀cheeses🧀 it has. 4. Problems with 🎁 are just extra bonuses. 5. If the problem involve many boring calculations, feel free to skip them. 6. Submit your solutions [here](https://forms.gle/CFEvNqFiTSsDLiFc6) (even if it's unfinished) ::: ### 01 Data Visualization {data-difficulty="2"} Go over the **Data Visualization** topic: [https://hayktarkhanyan.github.io/python_math_ml_course/python_libs/06_data_viz.html](https://hayktarkhanyan.github.io/python_math_ml_course/python_libs/06_data_viz.html) ### 02 Exploratory Data Analysis {data-difficulty="2"} Pick a dataset (e.g. from [Kaggle](https://www.kaggle.com/datasets) or [armstat.am](http://armstat.am/)) and **explore** it: - Compute summary statistics (mean, median, SD, IQR, skewness, etc.) - Build histograms, boxplots, scatter plots, and the ECDF ### 03 Survivorship Bias & Simpson's Paradox {data-difficulty="2"} Come up with your own examples of: - **Survivorship bias** - **Simpson's paradox** ::: {.callout-tip collapse="true" title="Reference examples"} These are canonical illustrations - your own examples should differ. The point is to recognize the *pattern* in the wild. **Survivorship bias** - drawing conclusions from a sample that already filtered out the failures. 1. **WWII bomber armor (Abraham Wald, 1943).** The Army inspected planes that returned from missions and proposed reinforcing the spots where bullet holes clustered (wings, tail). Wald pointed out the inverse: armor the spots with *no* holes - the engine and cockpit. Planes hit *there* never came back to be inspected. The sample was already filtered for "survived a hit," so its damage map showed the *non-fatal* hit locations. 2. **"College dropouts get rich."** Pop culture cites Gates, Jobs, Zuckerberg as evidence dropping out helps. But the millions of dropouts who did *not* found unicorns are absent from the sample. The base rate question - $P(\text{rich} \mid \text{dropout})$ - cannot be answered from the celebrity sample alone. 3. **Mutual fund performance.** Industry "average 10-year return" stats often only include funds that still exist. Funds that performed poorly got liquidated or merged, so they drop out of the dataset (called *fund attrition*). The historical average of surviving funds overstates the return an investor would have actually earned. **Simpson's paradox** - a trend that holds in every subgroup reverses when subgroups are pooled. The lurking variable is usually a confounder that correlates with both group membership and the outcome. 1. **UC Berkeley graduate admissions (1973).** Aggregate: men admitted at 44%, women at 35%, suggesting bias against women. But broken down by department, *most departments* admitted women at equal or higher rates than men. The reversal happened because women disproportionately applied to highly selective departments (humanities), while men applied to less competitive ones (engineering). Department was a confounder: it correlated with both gender and acceptance rate. 2. **Kidney stone treatments (Charig et al., 1986).** Treatment B looked better overall (83% success vs 78% for A). But split by stone size: | | Small stones | Large stones | Overall | |---------------|--------------|--------------|-----------| | Treatment A | 93% (81/87) | 73% (192/263)| **78%** | | Treatment B | 87% (234/270)| 69% (55/80) | **83%** | A wins in *both* subgroups but loses overall. Reason: doctors gave A to harder (large-stone) cases and B to easier (small-stone) cases. Stone size confounds the comparison. 3. **Batting averages across two seasons.** A player can have a higher BA than another in 2023 *and* 2024, but a lower combined BA, if their at-bat counts are skewed (e.g. many at-bats in their bad year, few in their good year). Same arithmetic issue: a weighted average of subgroup rates is not the unweighted comparison. **The common thread.** Both biases arise from forgetting *how the data was collected* before computing summary statistics. Survivorship bias = filtered sample. Simpson's = unaccounted confounder during aggregation. The fix in both cases is the same: think carefully about how the data was generated and selected, not just about the spreadsheet you ended up with. ::: # 🎲 xx+37 (xx) - ▶️[ToDo]() - 🔗[Random link ToDo]() - 🇦🇲🎶[ToDo]() - 🌐🎶[ToDo]() - 🤌[Կարգին ToDo]() <a href="http://s01.flagcounter.com/more/1oO"><img src="https://s01.flagcounter.com/count2/1oO/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_10/viewers_0/labels_0/pageviews_1/flags_0/percent_0/" alt="Flag Counter"></a>