!pip install uv
Requirement already satisfied: uv in c:\users\hayk_\.conda\envs\lectures\lib\site-packages (0.7.19)
Աշտարակ, լուսանկարի հղումը, Հեղինակ՝ Anna Grigoryan
Ուսումնասիրում ենք նշանավոր 1.2 միլիոն մարդու տվյլաներ, ու արդյունքում վարժվում pandas
-ի հետ աշխատել։
Խորհուրդ ենք տալիս սկզբում մենակով բզբզալ տվյալները նոր նայել վիդեոն։
Վերցնել ցանկացած դատասեթ ու փորփրել։
Կարող եք դատան վերցնել Kaggle-ից։ Կամ եթե հայկական եք ուզում՝ Armstat-ից
!pip install uv
Requirement already satisfied: uv in c:\users\hayk_\.conda\envs\lectures\lib\site-packages (0.7.19)
!uv pip install kagglehub[pandas-datasets]
Using Python 3.10.18 environment at: C:\Users\hayk_\.conda\envs\lectures
Resolved 16 packages in 922ms
Prepared 1 package in 194ms
Installed 1 package in 40ms
+ kagglehub==0.3.12
import kagglehub
# Download latest version
= kagglehub.dataset_download("imoore/age-dataset")
path
print("Path to dataset files:", path)
c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Path to dataset files: C:\Users\hayk_\.cache\kagglehub\datasets\imoore\age-dataset\versions\1
import os
print(os.listdir(path))
= os.path.join(path, "AgeDataset-V1.csv") # Pathlib is better path_csv
['AgeDataset-V1.csv', 'assets']
import pandas as pd
= pd.read_csv(path_csv)
df df.head()
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Q23 | George Washington | 1st president of the United States (1732–1799) | Male | United States of America; Kingdom of Great Bri... | Politician | 1732 | 1799.0 | natural causes | 67.0 |
1 | Q42 | Douglas Adams | English writer and humorist | Male | United Kingdom | Artist | 1952 | 2001.0 | natural causes | 49.0 |
2 | Q91 | Abraham Lincoln | 16th president of the United States (1809-1865) | Male | United States of America | Politician | 1809 | 1865.0 | homicide | 56.0 |
3 | Q254 | Wolfgang Amadeus Mozart | Austrian composer of the Classical period | Male | Archduchy of Austria; Archbishopric of Salzburg | Artist | 1756 | 1791.0 | NaN | 35.0 |
4 | Q255 | Ludwig van Beethoven | German classical and romantic composer | Male | Holy Roman Empire; Austrian Empire | Artist | 1770 | 1827.0 | NaN | 57.0 |
from pathlib import Path
= Path("assets/people.csv")
new_path
=False) df.to_csv(new_path, index
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[1], line 5 1 from pathlib import Path 3 new_path = Path("assets/people.csv") ----> 5 df.to_csv(new_path, index=False) NameError: name 'df' is not defined
df
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Q23 | George Washington | 1st president of the United States (1732–1799) | Male | United States of America; Kingdom of Great Bri... | Politician | 1732 | 1799.0 | natural causes | 67.0 |
1 | Q42 | Douglas Adams | English writer and humorist | Male | United Kingdom | Artist | 1952 | 2001.0 | natural causes | 49.0 |
2 | Q91 | Abraham Lincoln | 16th president of the United States (1809-1865) | Male | United States of America | Politician | 1809 | 1865.0 | homicide | 56.0 |
3 | Q254 | Wolfgang Amadeus Mozart | Austrian composer of the Classical period | Male | Archduchy of Austria; Archbishopric of Salzburg | Artist | 1756 | 1791.0 | NaN | 35.0 |
4 | Q255 | Ludwig van Beethoven | German classical and romantic composer | Male | Holy Roman Empire; Austrian Empire | Artist | 1770 | 1827.0 | NaN | 57.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1223004 | Q77247326 | Marie-Fortunée Besson | Frans model (1907-1996) | NaN | France | Tailor; model | 1907 | 1996.0 | NaN | 89.0 |
1223005 | Q77249504 | Ron Thorsen | xugador de baloncestu canadianu (1948–2004) | NaN | Canada; United States of America | Athlete | 1948 | 2004.0 | NaN | 56.0 |
1223006 | Q77249818 | Diether Todenhagen | German navy officer and world war II U-boat co... | NaN | Germany | Military personnel | 1920 | 1944.0 | NaN | 24.0 |
1223007 | Q77253909 | Reginald Oswald Pearson | English artist, working in stained glass, prin... | Male | United Kingdom | Artist | 1887 | 1915.0 | NaN | 28.0 |
1223008 | Q77254864 | Horst Lerche | German painter | Male | Germany | Artist | 1938 | 2017.0 | NaN | 79.0 |
1223009 rows × 10 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1223009 entries, 0 to 1223008
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1223009 non-null object
1 Name 1223009 non-null object
2 Short description 1155109 non-null object
3 Gender 1089363 non-null object
4 Country 887500 non-null object
5 Occupation 1016095 non-null object
6 Birth year 1223009 non-null int64
7 Death year 1223008 non-null float64
8 Manner of death 53603 non-null object
9 Age of death 1223008 non-null float64
dtypes: float64(2), int64(1), object(7)
memory usage: 93.3+ MB
sum() / len(df) * 100 df.isna().
Id 0.000000
Name 0.000000
Short description 5.551881
Gender 10.927638
Country 27.433077
Occupation 16.918436
Birth year 0.000000
Death year 0.000082
Manner of death 95.617121
Age of death 0.000082
dtype: float64
df.shape
(1223009, 10)
df.describe()
Birth year | Death year | Age of death | |
---|---|---|---|
count | 1.223009e+06 | 1.223008e+06 | 1.223008e+06 |
mean | 1.844972e+03 | 1.914246e+03 | 6.927406e+01 |
std | 1.479390e+02 | 1.516898e+02 | 1.662938e+01 |
min | -2.700000e+03 | -2.659000e+03 | 0.000000e+00 |
25% | 1.828000e+03 | 1.895000e+03 | 6.000000e+01 |
50% | 1.887000e+03 | 1.955000e+03 | 7.200000e+01 |
75% | 1.918000e+03 | 1.994000e+03 | 8.100000e+01 |
max | 2.016000e+03 | 2.021000e+03 | 1.690000e+02 |
"Country"].nunique() df[
5961
"Country") df.value_counts(
Country
United States of America 152761
Germany 95081
France 78666
United Kingdom; United Kingdom of Great Britain and Ireland 29684
Sweden 26915
...
ducat de Bremen; Duchy of Holstein 1
emirate of Córdoba; Umayyad Caliphate 1
Zhao 1
Zimbabwe; Rhodesia; Federation of Rhodesia and Nyasaland; Southern Rhodesia; Zimbabwe Rhodesia 1
Zimbabwe; Rhodesia; Zimbabwe Rhodesia 1
Name: count, Length: 5961, dtype: int64
"Occupation") df.value_counts(
Occupation
Artist 281512
Politician 195390
Athlete 110943
Researcher 90709
Military personnel 52911
...
Zoology 1
Zoology; marine biology; biologist 1
École polytechnique 1
Academic; literary scholar 1
Wholesale; land owner; philanthropist 1
Name: count, Length: 9313, dtype: int64
"Occupation")["Age of death"].mean().sort_values(ascending=False) df.groupby(
Occupation
Farmer; lecturer 121.0
Deacon; preacher 99.0
Warrior; noble 99.0
Suffragette; philanthropist; social reformer; suffragist 99.0
Studienrat; lecturer 99.0
...
Basij 13.0
Lehnsmann 13.0
Servant of god 12.0
Pioneers-heroes 11.0
Miner; master builder 11.0
Name: Age of death, Length: 9313, dtype: float64
"Occupation")["Age of death"].describe() df.groupby(
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Occupation | ||||||||
1859 | 1.0 | 47.000000 | NaN | 47.0 | 47.0 | 47.0 | 47.0 | 47.0 |
Abbess | 36.0 | 60.694444 | 16.924740 | 24.0 | 49.0 | 63.0 | 73.0 | 90.0 |
Abbess; business executive | 1.0 | 86.000000 | NaN | 86.0 | 86.0 | 86.0 | 86.0 | 86.0 |
Abbess; christians jehovah’s witnesses | 1.0 | 81.000000 | NaN | 81.0 | 81.0 | 81.0 | 81.0 | 81.0 |
Abbé | 6.0 | 69.666667 | 16.070677 | 41.0 | 66.0 | 72.5 | 79.0 | 87.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Zoology | 1.0 | 44.000000 | NaN | 44.0 | 44.0 | 44.0 | 44.0 | 44.0 |
Zoology; marine biology; biologist | 1.0 | 73.000000 | NaN | 73.0 | 73.0 | 73.0 | 73.0 | 73.0 |
École polytechnique | 1.0 | 72.000000 | NaN | 72.0 | 72.0 | 72.0 | 72.0 | 72.0 |
Župan | 5.0 | 38.600000 | 27.061042 | 12.0 | 21.0 | 32.0 | 47.0 | 81.0 |
مجموعة الأنظمة منصة شليلة; serology; bacteriologist | 1.0 | 40.000000 | NaN | 40.0 | 40.0 | 40.0 | 40.0 | 40.0 |
9313 rows × 8 columns
"Occupation"].nunique() df[
9313
"Occupation"].value_counts() df[
Occupation
Artist True
Politician True
Athlete True
Researcher True
Military personnel True
...
Director; scout leader False
Salonnière; patron of the arts False
Servant of god False
Cleric; coal miner False
Goldsmith; metalsmith False
Name: count, Length: 9313, dtype: bool
= df["Occupation"].value_counts() occup_counts
> 1_000].index occup_counts[occup_counts
Index(['Artist', 'Politician', 'Athlete', 'Researcher', 'Military personnel',
'Religious figure', 'Businessperson', 'Architect', 'Journalist',
'Teacher', 'Physician', 'Engineer', 'Judge', 'Lawyer', 'Jurist',
'Aristocrat', 'Entrepreneur', 'Philosopher', 'Translator', 'Publisher',
'Librarian', 'Author', 'Surgeon', 'Merchant', 'Novelist', 'Rower',
'Astronomer', 'Pianist', 'Psychologist', 'Pastor', 'Minister', 'Farmer',
'Inventor', 'Psychiatrist', 'Rabbi', 'Explorer', 'Fencer',
'Police officer', 'Trade unionist'],
dtype='object', name='Occupation')
= occup_counts[occup_counts > 1_000].index occupations_more_than_100
# df[df["Occupation"].isin(occupations_more_than_100)]
= df[df["Occupation"].isin(occupations_more_than_100)] df
= df.groupby("Occupation")["Age of death"].mean()
age_by_occup age_by_occup
Occupation
Architect 72.085306
Aristocrat 53.006540
Artist 69.725145
Astronomer 71.152301
Athlete 68.772460
Author 70.094754
Businessperson 74.153054
Engineer 72.156611
Entrepreneur 73.222146
Explorer 61.799302
Farmer 71.240991
Fencer 72.496454
Inventor 73.129545
Journalist 69.591239
Judge 74.004850
Jurist 69.488372
Lawyer 71.231208
Librarian 73.437335
Merchant 68.316125
Military personnel 63.820056
Minister 69.176471
Novelist 71.452949
Pastor 68.870213
Philosopher 71.037957
Physician 70.683996
Pianist 71.754000
Police officer 64.403013
Politician 70.541558
Psychiatrist 73.231385
Psychologist 76.396378
Publisher 71.178990
Rabbi 71.741322
Religious figure 69.801273
Researcher 73.131376
Rower 71.019317
Surgeon 71.815642
Teacher 73.331995
Trade unionist 71.768421
Translator 72.046317
Name: Age of death, dtype: float64
df.columns
Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
'Birth year', 'Death year', 'Manner of death', 'Age of death'],
dtype='object')
"Gender"].value_counts(normalize=True) * 100 df[
Gender
Male 90.966985
Female 9.019897
Transgender Female 0.005605
Transgender Male 0.002862
Eunuch; Male 0.001908
Female; Male 0.000716
Intersex 0.000596
Transgender Male; Female 0.000358
Non-Binary 0.000239
Transgender Person; Intersex; Transgender Male 0.000119
Intersex; Male 0.000119
Transgender Female; Female 0.000119
Transgender Female; Male 0.000119
Intersex; Transgender Male 0.000119
Transgender Male; Male 0.000119
Female; Female 0.000119
Name: proportion, dtype: float64
"Manner of death"] == "Suicide"].empty df[df[
True
"Manner of death"].value_counts() df[
Manner of death
natural causes 29717
suicide 4647
accident 4217
homicide 3273
capital punishment 1813
...
rebellion 1
Holocaust victim 1
unknown 1
war; suicide 1
White Terror 1
Name: count, Length: 166, dtype: int64
= df[df["Manner of death"] == "suicide"]
df_suicide df_suicide
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | |
---|---|---|---|---|---|---|---|---|---|---|
23 | Q440 | Salvador Allende | 28th president of Chile (1908–1973) | Male | Chile | Politician | 1908 | 1973.0 | suicide | 65.0 |
131 | Q1322 | José Manuel Balmaceda | Chilean politician and President (1840-1891) | Male | Chile | Politician | 1840 | 1891.0 | suicide | 51.0 |
189 | Q2022 | Cesare Pavese | Italian poet, novelist, literary critic, and t... | Male | Italy; Kingdom of Italy | Researcher | 1908 | 1950.0 | suicide | 42.0 |
323 | Q4616 | Marilyn Monroe | American actress, model, and singer (1926-1962) | Female | United States of America | Artist | 1926 | 1962.0 | suicide | 36.0 |
327 | Q4673 | Paul Otto | German film actor and director | Male | Nazi Germany; Weimar Republic; German Empire | Artist | 1878 | 1943.0 | suicide | 65.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1212054 | Q70834687 | Karl Neumann | politician and director of the Deutsche Zeiche... | NaN | German Reich | Politician | 1900 | 1945.0 | suicide | 45.0 |
1213739 | Q73375287 | Peter Kuranda | Austrian journalist | Male | Austria; Austria-Hungary | Journalist | 1896 | 1938.0 | suicide | 42.0 |
1214539 | Q75135015 | Michael Benveniste | American pornographic film director | Male | United States of America | Artist | 1946 | 1982.0 | suicide | 36.0 |
1215398 | Q75336010 | George Dewey Sanford Jr. | United States Marine | Male | United States of America | Military personnel | 1925 | 1994.0 | suicide | 69.0 |
1217823 | Q75694915 | Gotthard Zimmer | fotograaf uit Oostenrijk-Hongarije (1847-1886) | NaN | Austria-Hungary | Artist | 1847 | 1886.0 | suicide | 39.0 |
4647 rows × 10 columns
= df_suicide["Country"].value_counts()
suicide_counts_country suicide_counts_country
Country
United States of America 991
France 362
Germany 321
United Kingdom 152
Japan 141
...
Qing dynasty; Ming dynasty; Kingdom of Tungning 1
Spain; Peru 1
West Germany 1
Qing dynasty; China 1
United States of America; Russian Empire 1
Name: count, Length: 354, dtype: int64
= df["Country"].value_counts()
country_counts country_counts
Country
United States of America 135127
Germany 78718
France 65572
United Kingdom; United Kingdom of Great Britain and Ireland 26642
Spain 21930
...
Afghanistan; Austria-Hungary 1
Syria; Ottoman Empire; State of Damascus; Arab Kingdom of Syria; State of Syria; Syrian Republic; United Arab Republic 1
Republic of Florence; Grand Duchy of Tuscany 1
Grand Duchy of Tuscany; Duchy of Lucca; Kingdom of Italy 1
Norway; Austria-Hungary; Union between Sweden and Norway 1
Name: count, Length: 5400, dtype: int64
= pd.merge(suicide_counts_country, country_counts,
suicide ="left",
how="Country", suffixes=("_suicide", "_overall"))
on suicide
count_suicide | count_overall | |
---|---|---|
Country | ||
United States of America | 991 | 135127 |
France | 362 | 65572 |
Germany | 321 | 78718 |
United Kingdom | 152 | 19127 |
Japan | 141 | 13209 |
... | ... | ... |
Qing dynasty; Ming dynasty; Kingdom of Tungning | 1 | 1 |
Spain; Peru | 1 | 23 |
West Germany | 1 | 21 |
Qing dynasty; China | 1 | 10 |
United States of America; Russian Empire | 1 | 151 |
354 rows × 2 columns
= pd.merge(suicide_counts_country, country_counts,
suicide ="left", on="Country",
how=("_suicide", "_overall"))
suffixes suicide
count_suicide | count_overall | |
---|---|---|
Country | ||
United States of America | 991 | 135127 |
France | 362 | 65572 |
Germany | 321 | 78718 |
United Kingdom | 152 | 19127 |
Japan | 141 | 13209 |
... | ... | ... |
Qing dynasty; Ming dynasty; Kingdom of Tungning | 1 | 1 |
Spain; Peru | 1 | 23 |
West Germany | 1 | 21 |
Qing dynasty; China | 1 | 10 |
United States of America; Russian Empire | 1 | 151 |
354 rows × 2 columns
"suicide_over_total"] = suicide["count_suicide"] / suicide["count_overall"]
suicide[ suicide
count_suicide | count_overall | suicide_over_total | |
---|---|---|---|
Country | |||
United States of America | 991 | 135127 | 0.007334 |
France | 362 | 65572 | 0.005521 |
Germany | 321 | 78718 | 0.004078 |
United Kingdom | 152 | 19127 | 0.007947 |
Japan | 141 | 13209 | 0.010675 |
... | ... | ... | ... |
Qing dynasty; Ming dynasty; Kingdom of Tungning | 1 | 1 | 1.000000 |
Spain; Peru | 1 | 23 | 0.043478 |
West Germany | 1 | 21 | 0.047619 |
Qing dynasty; China | 1 | 10 | 0.100000 |
United States of America; Russian Empire | 1 | 151 | 0.006623 |
354 rows × 3 columns
"suicide_per_1k"] = suicide["suicide_over_total"] * 1000 suicide[
= suicide.sort_values(by="suicide_per_1k", ascending=True)
suicide_sorted suicide_sorted
count_suicide | count_overall | suicide_over_total | suicide_per_1k | |
---|---|---|---|---|
Country | ||||
Spain | 31 | 21930 | 0.001414 | 1.413589 |
Denmark | 16 | 9187 | 0.001742 | 1.741591 |
Kingdom of England | 7 | 3920 | 0.001786 | 1.785714 |
Grand Duchy of Finland | 1 | 549 | 0.001821 | 1.821494 |
India; British Raj | 5 | 2642 | 0.001893 | 1.892506 |
... | ... | ... | ... | ... |
Northern Ireland; Ireland | 1 | 1 | 1.000000 | 1000.000000 |
People's Republic of Bulgaria | 1 | 1 | 1.000000 | 1000.000000 |
United States of America; French Third Republic; Second French Empire | 1 | 1 | 1.000000 | 1000.000000 |
Japan; China | 1 | 1 | 1.000000 | 1000.000000 |
Nazi Germany; Kingdom of Romania; West Germany | 1 | 1 | 1.000000 | 1000.000000 |
354 rows × 4 columns
10) suicide_sorted.tail(
count_suicide | count_overall | suicide_over_total | suicide_per_1k | |
---|---|---|---|---|
Country | ||||
North Korea; Soviet Union; Russian Empire | 1 | 1 | 1.0 | 1000.0 |
Classical Athens; Ancient Carthage | 1 | 1 | 1.0 | 1000.0 |
Qin | 1 | 1 | 1.0 | 1000.0 |
Germany; Nazi Germany; Austria-Hungary; Czechoslovakia | 1 | 1 | 1.0 | 1000.0 |
Ottoman Empire; Soviet Union; Russian Empire | 1 | 1 | 1.0 | 1000.0 |
Northern Ireland; Ireland | 1 | 1 | 1.0 | 1000.0 |
People's Republic of Bulgaria | 1 | 1 | 1.0 | 1000.0 |
United States of America; French Third Republic; Second French Empire | 1 | 1 | 1.0 | 1000.0 |
Japan; China | 1 | 1 | 1.0 | 1000.0 |
Nazi Germany; Kingdom of Romania; West Germany | 1 | 1 | 1.0 | 1000.0 |
"count_overall"] > 5_000]["suicide_per_1k"].tail(10).plot(kind="bar") suicide_sorted[suicide_sorted[
sum() df.isna().
Id 0
Name 0
Short description 8421
Gender 87633
Country 182030
Occupation 0
Birth year 0
Death year 0
Manner of death 881641
Age of death 0
dtype: int64
"Gender"].value_counts() df[
Gender
Male 762780
Female 75634
Transgender Female 47
Transgender Male 24
Eunuch; Male 16
Female; Male 6
Intersex 5
Transgender Male; Female 3
Non-Binary 2
Transgender Person; Intersex; Transgender Male 1
Intersex; Male 1
Transgender Female; Female 1
Transgender Female; Male 1
Intersex; Transgender Male 1
Transgender Male; Male 1
Female; Female 1
Name: count, dtype: int64
"Gender == 'Non-Binary'") # df[df["Gender"] == "Non-Binary"] df.query(
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | |
---|---|---|---|---|---|---|---|---|---|---|
39998 | Q219634 | Claude Cahun | French artist (1894-1954) | Non-Binary | France | Artist | 1894 | 1954.0 | NaN | 60.0 |
754386 | Q13562059 | Maxine Feldman | lesbian and non-binary musician | Non-Binary | United States of America | Artist | 1945 | 2007.0 | NaN | 62.0 |
df.columns
Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
'Birth year', 'Death year', 'Manner of death', 'Age of death'],
dtype='object')
"Gender")["Birth year"].max().sort_values() df.groupby(
Gender
Eunuch; Male 1451
Intersex; Male 1763
Transgender Male; Male 1869
Female; Female 1884
Transgender Person; Intersex; Transgender Male 1885
Intersex; Transgender Male 1912
Transgender Male; Female 1913
Intersex 1926
Non-Binary 1945
Transgender Female; Male 1947
Female; Male 1949
Transgender Female; Female 1949
Transgender Male 1986
Transgender Female 1991
Male 2002
Female 2002
Name: Birth year, dtype: int64
= df[df["Gender"].isin(["Male", "Female"])] df
"Occupation"].unique() df[
array(['Politician', 'Artist', 'Astronomer', 'Athlete', 'Researcher',
'Military personnel', 'Philosopher', 'Businessperson', 'Explorer',
'Architect', 'Teacher', 'Aristocrat', 'Entrepreneur', 'Journalist',
'Engineer', 'Author', 'Religious figure', 'Judge', 'Librarian',
'Translator', 'Physician', 'Inventor', 'Trade unionist',
'Merchant', 'Publisher', 'Pastor', 'Fencer', 'Rabbi',
'Psychologist', 'Lawyer', 'Rower', 'Jurist', 'Police officer',
'Surgeon', 'Psychiatrist', 'Pianist', 'Farmer', 'Minister',
'Novelist'], dtype=object)
= df[df["Occupation"] == "Researcher"] df_reserach
0] df_reserach.shape[
81735
len(df_reserach)
81735
"Gender").loc["Male"] / len(df_reserach) df_reserach.value_counts(
np.float64(0.9204624701780143)
"Gender"].value_counts(normalize=True).loc["Male"] df_reserach[
np.float64(0.9204624701780143)
def get_male_percentage(series):
return series.value_counts(normalize=True).loc["Male"] * 100
"Gender"]) get_male_percentage(df_reserach[
np.float64(0.9204624701780143)
for m in df["Occupation"].unique():
= df[df["Occupation"] == m]
df_filter print(m, get_male_percentage(df_filter["Gender"]))
Politician 0.9561554391245799
Artist 0.821756963672281
Astronomer 0.9173256649892164
Athlete 0.9672833532213965
Researcher 0.9204624701780143
Military personnel 0.9830178291619024
Philosopher 0.9450272765421738
Businessperson 0.9515949663447468
Explorer 0.9703315881326352
Architect 0.9670399592771698
Teacher 0.8632561613144137
Aristocrat 0.6248584371460929
Entrepreneur 0.9663496708119971
Journalist 0.8801171679645639
Engineer 0.9881951949455483
Author 0.8742255266418835
Religious figure 0.9743905658716888
Judge 0.9711538461538461
Librarian 0.7817745803357314
Translator 0.7956669498725574
Physician 0.9199198326943185
Inventor 0.9727497935590421
Trade unionist 0.8755980861244019
Merchant 0.9845261121856866
Publisher 0.9534782608695652
Pastor 0.9901071723000825
Fencer 0.875886524822695
Rabbi 0.9920704845814978
Psychologist 0.7916018662519441
Lawyer 0.939869484151647
Rower 0.9845460399227302
Jurist 0.988530990727184
Police officer 0.9582909460834181
Surgeon 0.9824890556597874
Psychiatrist 0.9107303877366997
Pianist 0.659037095501184
Farmer 0.9534109816971714
Minister 0.9712918660287081
Novelist 0.5951293759512938
= df.groupby("Occupation")["Gender"].apply(get_male_percentage).sort_values()
gender_occup gender_occup
Occupation
Novelist 59.512938
Aristocrat 62.485844
Pianist 65.903710
Librarian 78.177458
Psychologist 79.160187
Translator 79.566695
Artist 82.175696
Teacher 86.325616
Author 87.422553
Trade unionist 87.559809
Fencer 87.588652
Journalist 88.011717
Psychiatrist 91.073039
Astronomer 91.732566
Physician 91.991983
Researcher 92.046247
Lawyer 93.986948
Philosopher 94.502728
Businessperson 95.159497
Farmer 95.341098
Publisher 95.347826
Politician 95.615544
Police officer 95.829095
Entrepreneur 96.634967
Architect 96.703996
Athlete 96.728335
Explorer 97.033159
Judge 97.115385
Minister 97.129187
Inventor 97.274979
Religious figure 97.439057
Surgeon 98.248906
Military personnel 98.301783
Merchant 98.452611
Rower 98.454604
Engineer 98.819519
Jurist 98.853099
Pastor 99.010717
Rabbi 99.207048
Name: Gender, dtype: float64
= gender_occup.to_frame()
gender_occup_df ={"Gender": "Percentage Male"}, inplace=True) gender_occup_df.rename(columns
"Gender") df.value_counts(
Gender
Male 762780
Female 75634
Name: count, dtype: int64
0.8656544743501265 / 762780 * 1000
0.0011348678181784086
0.132965263400046 / 75634 * 1000 / (0.8656544743501265 / 762780 * 1000)
1.5490871388122094
=True) df_suicide.Gender.value_counts(normalize
Gender
Male 0.865654
Female 0.132965
Transgender Female 0.000690
Eunuch; Male 0.000230
Transgender Male 0.000230
Intersex 0.000230
Name: proportion, dtype: float64
հա,
հաճելի ա չէ՞
= "i am from Armenia"
t1 # t2 = "I am Armenian"
= ["rmenia", "armenian"]
options "Armenia" in t1
# contained = []
# for i in options:
# contained.append(i in t1)
= [i.lower() in t1.lower() for i in options]
contained print(any(contained))
True
def is_armenian(text):
= ["armenian", "armenia"]
keywords return any([k in text.lower() for k in keywords])
"Armenian"] = df["Short description"].apply(is_armenian)
df[ df
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[125], line 1 ----> 1 df["Armenian"] = df["Short description"].apply(is_armenian) 2 df File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\series.py:4935, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs) 4800 def apply( 4801 self, 4802 func: AggFuncType, (...) 4807 **kwargs, 4808 ) -> DataFrame | Series: 4809 """ 4810 Invoke function on values of Series. 4811 (...) 4926 dtype: float64 4927 """ 4928 return SeriesApply( 4929 self, 4930 func, 4931 convert_dtype=convert_dtype, 4932 by_row=by_row, 4933 args=args, 4934 kwargs=kwargs, -> 4935 ).apply() File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\apply.py:1422, in SeriesApply.apply(self) 1419 return self.apply_compat() 1421 # self.func is Callable -> 1422 return self.apply_standard() File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\apply.py:1502, in SeriesApply.apply_standard(self) 1496 # row-wise access 1497 # apply doesn't have a `na_action` keyword and for backward compat reasons 1498 # we need to give `na_action="ignore"` for categorical data. 1499 # TODO: remove the `na_action="ignore"` when that default has been changed in 1500 # Categorical (GH51645). 1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None -> 1502 mapped = obj._map_values( 1503 mapper=curried, na_action=action, convert=self.convert_dtype 1504 ) 1506 if len(mapped) and isinstance(mapped[0], ABCSeries): 1507 # GH#43986 Need to do list(mapped) in order to get treated as nested 1508 # See also GH#25959 regarding EA support 1509 return obj._constructor_expanddim(list(mapped), index=obj.index) File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert) 922 if isinstance(arr, ExtensionArray): 923 return arr.map(mapper, na_action=na_action) --> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert) File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert) 1741 values = arr.astype(object, copy=False) 1742 if na_action is None: -> 1743 return lib.map_infer(values, mapper, convert=convert) 1744 else: 1745 return lib.map_infer_mask( 1746 values, mapper, mask=isna(values).view(np.uint8), convert=convert 1747 ) File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer() Cell In[124], line 3, in is_armenian(text) 1 def is_armenian(text): 2 keywords = ["armenian", "armenia"] ----> 3 return any([k in text.lower() for k in keywords]) Cell In[124], line 3, in <listcomp>(.0) 1 def is_armenian(text): 2 keywords = ["armenian", "armenia"] ----> 3 return any([k in text.lower() for k in keywords]) AttributeError: 'float' object has no attribute 'lower'
"Short description"].isna()].fillna("") df[df[
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | |
---|---|---|---|---|---|---|---|---|---|---|
46515 | Q287430 | Pietro Guido II Torelli | Male | Aristocrat | 1450 | 1494.0 | 44.0 | |||
71941 | Q482302 | József Adamovich | Male | Religious figure | 1845 | 1887.0 | 42.0 | |||
75497 | Q516682 | István Agh | Male | Religious figure | 1709 | 1786.0 | 77.0 | |||
88055 | Q621272 | Dénes Alesius | Male | Religious figure | 1525 | 1577.0 | 52.0 | |||
92789 | Q689315 | Mátyás Ambrózy | Male | Pastor | 1797 | 1869.0 | 72.0 | |||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1219020 | Q75881383 | Virginia Downing | Female | Artist | 1904 | 1996.0 | 92.0 | |||
1219990 | Q76009843 | Edward Hunter Ludlow | Male | Physician | 1810 | 1884.0 | 74.0 | |||
1222371 | Q76328370 | James Gordon Dennis | Male | Military personnel | 1921 | 1944.0 | 23.0 | |||
1222650 | Q76375951 | John Calvin MacKay | Male | Religious figure | 1891 | 1986.0 | 95.0 | |||
1222675 | Q76401454 | Joan Marsden | Female | Researcher | 1922 | 2001.0 | 79.0 |
5612 rows × 10 columns
"Armenian"] = df["Short description"].fillna("na").apply(is_armenian) df[
df[df.Armenian]
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Armenian | |
---|---|---|---|---|---|---|---|---|---|---|---|
180 | Q1785 | Charles Aznavour | Armenian-French singer and diplomat | Male | France; Armenia | Artist | 1924 | 2018.0 | NaN | 94.0 | True |
311 | Q4452 | Thomas of Metsoph | Armenian cleric and chronicler | Male | NaN | Researcher | 1378 | 1446.0 | NaN | 68.0 | True |
354 | Q4924 | Isabella I, Queen of Armenia | queen regnant of Cilician Armenia | Female | Armenian Kingdom of Cilicia | Politician | 1216 | 1252.0 | NaN | 36.0 | True |
3462 | Q51472 | Rouben Mamoulian | Armenian American film and theatre director | Male | United States of America; Russian Empire | Artist | 1897 | 1987.0 | NaN | 90.0 | True |
3807 | Q55394 | Henri Verneuil | French-Armenian playwright and filmmaker | Male | France | Artist | 1920 | 2002.0 | NaN | 82.0 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1158947 | Q58030786 | Marie Balian | Armenian ceramic artist | Female | Israel | Artist | 1925 | 2017.0 | NaN | 92.0 | True |
1161788 | Q59394760 | Robert Kamoyan | Armenian director, artist | Male | Armenia; Soviet Union | Artist | 1937 | 2014.0 | NaN | 77.0 | True |
1166304 | Q59657412 | Giuseppe Arachial | Armenian Catholic bishop of Angora | Male | Ottoman Empire | Religious figure | 1811 | 1876.0 | NaN | 65.0 | True |
1191627 | Q63226473 | Boris Meliksetyan | Armenian geologist | Male | Armenia; Soviet Union | Researcher | 1928 | 1992.0 | NaN | 64.0 | True |
1198505 | Q64734343 | Pierre Tilkian | Armenian Catholic bishop | Male | NaN | Religious figure | 1809 | 1885.0 | NaN | 76.0 | True |
538 rows × 11 columns
"Country"] == "Armenia"] df[df[
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Armenian | |
---|---|---|---|---|---|---|---|---|---|---|---|
43970 | Q266968 | Gurgen Margaryan | Armenian soldier | Male | Armenia | Military personnel | 1978 | 2004.0 | homicide | 26.0 | True |
45653 | Q278864 | Andranik Ozanian | Armenian politician and military personnel (18... | Male | Armenia | Politician | 1865 | 1927.0 | NaN | 62.0 | True |
54084 | Q336104 | Jerry Tarkanian | American basketball coach | Male | Armenia | Athlete | 1930 | 2015.0 | NaN | 85.0 | False |
71000 | Q471374 | Karen Asrian | Armenian chess player | Male | Armenia | Athlete | 1980 | 2008.0 | natural causes | 28.0 | True |
79459 | Q544093 | Genrikh Kasparyan | Armenian chess player | Male | Armenia | Athlete | 1910 | 1995.0 | NaN | 85.0 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1003702 | Q24048886 | Robert Abajyan | Armenian military person, Hero of Artsakh | Male | Armenia | Military personnel | 1996 | 2016.0 | suicide | 20.0 | True |
1025037 | Q27349753 | Artur Sargsyan | Armenian sculptor | Male | Armenia | Artist | 1968 | 2017.0 | NaN | 49.0 | True |
1034887 | Q28114502 | Emma Khanzadyan | Armenian historian, archaeologist | Female | Armenia | Researcher | 1922 | 2007.0 | NaN | 85.0 | True |
1046490 | Q29033966 | Eduard Edigaryan | Armenian painter | Male | Armenia | Artist | 1943 | 2019.0 | NaN | 76.0 | True |
1084025 | Q47009214 | Pavel Chobanyan | Armenian orientalist | Male | Armenia | Researcher | 1948 | 2017.0 | NaN | 69.0 | True |
121 rows × 11 columns
= df[df["Country"].fillna("na").str.contains("Armenia")]
arm arm
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Armenian | |
---|---|---|---|---|---|---|---|---|---|---|---|
180 | Q1785 | Charles Aznavour | Armenian-French singer and diplomat | Male | France; Armenia | Artist | 1924 | 2018.0 | NaN | 94.0 | True |
354 | Q4924 | Isabella I, Queen of Armenia | queen regnant of Cilician Armenia | Female | Armenian Kingdom of Cilicia | Politician | 1216 | 1252.0 | NaN | 36.0 | True |
3201 | Q48112 | Ivan Bagramyan | Marshal of the Soviet Union (1897-1982) | Male | Soviet Union; Russian Empire; First Republic o... | Politician | 1897 | 1982.0 | NaN | 85.0 | False |
4983 | Q61130 | Luigi Colani | German industrial designer and design professor | Male | Germany; Armenia | Teacher | 1928 | 2019.0 | NaN | 91.0 | False |
5560 | Q62316 | Robert Sahakyants | animator | Male | Armenia; Soviet Union | Artist | 1950 | 2009.0 | NaN | 59.0 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1086289 | Q47457007 | Garnik Karapetyan | Armenian scientist and mathematician (1958–2018) | Male | Armenia; Soviet Union | Researcher | 1958 | 2018.0 | NaN | 60.0 | True |
1161788 | Q59394760 | Robert Kamoyan | Armenian director, artist | Male | Armenia; Soviet Union | Artist | 1937 | 2014.0 | NaN | 77.0 | True |
1182207 | Q62024298 | Diana Oucleba | Georgian poetess, artist | Female | Armenia; Soviet Union; Russian Empire | Artist | 1910 | 2001.0 | NaN | 91.0 | False |
1191627 | Q63226473 | Boris Meliksetyan | Armenian geologist | Male | Armenia; Soviet Union | Researcher | 1928 | 1992.0 | NaN | 64.0 | True |
1206411 | Q66132386 | Albert Ghazaryan | athlete, coach, referee | Male | Armenia; Soviet Union | Athlete | 1935 | 2020.0 | NaN | 85.0 | False |
301 rows × 11 columns
"num_countries"] = arm["Country"].str.split(";").apply(len)
arm[ arm
C:\Users\hayk_\AppData\Local\Temp\ipykernel_6640\2434009080.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
arm["num_countries"] = arm["Country"].str.split(";").apply(len)
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Armenian | num_countries | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
180 | Q1785 | Charles Aznavour | Armenian-French singer and diplomat | Male | France; Armenia | Artist | 1924 | 2018.0 | NaN | 94.0 | True | 2 |
354 | Q4924 | Isabella I, Queen of Armenia | queen regnant of Cilician Armenia | Female | Armenian Kingdom of Cilicia | Politician | 1216 | 1252.0 | NaN | 36.0 | True | 1 |
3201 | Q48112 | Ivan Bagramyan | Marshal of the Soviet Union (1897-1982) | Male | Soviet Union; Russian Empire; First Republic o... | Politician | 1897 | 1982.0 | NaN | 85.0 | False | 3 |
4983 | Q61130 | Luigi Colani | German industrial designer and design professor | Male | Germany; Armenia | Teacher | 1928 | 2019.0 | NaN | 91.0 | False | 2 |
5560 | Q62316 | Robert Sahakyants | animator | Male | Armenia; Soviet Union | Artist | 1950 | 2009.0 | NaN | 59.0 | False | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1086289 | Q47457007 | Garnik Karapetyan | Armenian scientist and mathematician (1958–2018) | Male | Armenia; Soviet Union | Researcher | 1958 | 2018.0 | NaN | 60.0 | True | 2 |
1161788 | Q59394760 | Robert Kamoyan | Armenian director, artist | Male | Armenia; Soviet Union | Artist | 1937 | 2014.0 | NaN | 77.0 | True | 2 |
1182207 | Q62024298 | Diana Oucleba | Georgian poetess, artist | Female | Armenia; Soviet Union; Russian Empire | Artist | 1910 | 2001.0 | NaN | 91.0 | False | 3 |
1191627 | Q63226473 | Boris Meliksetyan | Armenian geologist | Male | Armenia; Soviet Union | Researcher | 1928 | 1992.0 | NaN | 64.0 | True | 2 |
1206411 | Q66132386 | Albert Ghazaryan | athlete, coach, referee | Male | Armenia; Soviet Union | Athlete | 1935 | 2020.0 | NaN | 85.0 | False | 2 |
301 rows × 12 columns
="num_countries", ascending=False) arm.sort_values(by
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Armenian | num_countries | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
231807 | Q2047004 | Suren Yeremyan | Armenian historian | Male | Armenia; Soviet Union; Russian Empire; Russian... | Researcher | 1908 | 1992.0 | NaN | 84.0 | True | 9 |
71047 | Q471740 | Armen Dzhigarkhanyan | Armenian, Soviet, Russian actor | Male | United States of America; Russia; Armenia; Sov... | Artist | 1935 | 2020.0 | NaN | 85.0 | True | 4 |
100991 | Q738092 | Pavel Lisitsian | Russian singer | Male | Russia; Armenia; Soviet Union; Russian Empire | Artist | 1911 | 2004.0 | NaN | 93.0 | False | 4 |
370403 | Q4071165 | Tinatin Asatiani | Georgian physicist | Female | Armenia; Soviet Union; Democratic Republic of ... | Researcher | 1918 | 2011.0 | NaN | 93.0 | False | 4 |
370366 | Q4070512 | Varazdat Harutyunyan | Armenian architect | Male | Armenia; Ottoman Empire; Soviet Union; Russian... | Researcher | 1909 | 2008.0 | NaN | 99.0 | True | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
932345 | Q20509556 | Maria Petrosyan | Armenian philosopher | Female | Armenia | Philosopher | 1911 | 1971.0 | NaN | 60.0 | True | 1 |
932348 | Q20509639 | Aida Boyajyan | Armenian artist | Female | Armenia | Artist | 1932 | 2019.0 | NaN | 87.0 | True | 1 |
932353 | Q20509808 | Henrik Sevan | Armenian children's writer, translator, poet | Male | Armenia | Artist | 1925 | 2008.0 | NaN | 83.0 | True | 1 |
43970 | Q266968 | Gurgen Margaryan | Armenian soldier | Male | Armenia | Military personnel | 1978 | 2004.0 | homicide | 26.0 | True | 1 |
354 | Q4924 | Isabella I, Queen of Armenia | queen regnant of Cilician Armenia | Female | Armenian Kingdom of Cilicia | Politician | 1216 | 1252.0 | NaN | 36.0 | True | 1 |
301 rows × 12 columns
= df["Short description"].fillna("na").apply(is_armenian)
text_based_filter = df["Country"].fillna("na").str.contains("Armenia")
country_based_filter
~country_based_filter) & (text_based_filter)] df[(
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Armenian | |
---|---|---|---|---|---|---|---|---|---|---|---|
311 | Q4452 | Thomas of Metsoph | Armenian cleric and chronicler | Male | NaN | Researcher | 1378 | 1446.0 | NaN | 68.0 | True |
3462 | Q51472 | Rouben Mamoulian | Armenian American film and theatre director | Male | United States of America; Russian Empire | Artist | 1897 | 1987.0 | NaN | 90.0 | True |
3807 | Q55394 | Henri Verneuil | French-Armenian playwright and filmmaker | Male | France | Artist | 1920 | 2002.0 | NaN | 82.0 | True |
28166 | Q115683 | Michael Arlen | Armenian writer | Male | NaN | Artist | 1895 | 1956.0 | natural causes | 61.0 | True |
32775 | Q139636 | Zaven Biberyan | Armenian writer | Male | NaN | Artist | 1921 | 1984.0 | NaN | 63.0 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1118595 | Q55627228 | Arthur Beylerian | Armenian historian | Male | NaN | Artist | 1925 | 2005.0 | NaN | 80.0 | True |
1154710 | Q56650119 | Gregory Casparian | Turkish-Armenia-born painter, photo-engraver a... | Male | Turkey | Artist | 1856 | 1942.0 | NaN | 86.0 | True |
1158947 | Q58030786 | Marie Balian | Armenian ceramic artist | Female | Israel | Artist | 1925 | 2017.0 | NaN | 92.0 | True |
1166304 | Q59657412 | Giuseppe Arachial | Armenian Catholic bishop of Angora | Male | Ottoman Empire | Religious figure | 1811 | 1876.0 | NaN | 65.0 | True |
1198505 | Q64734343 | Pierre Tilkian | Armenian Catholic bishop | Male | NaN | Religious figure | 1809 | 1885.0 | NaN | 76.0 | True |
326 rows × 11 columns
print(pd.pivot_table(arm, index="Occupation", columns="Gender", values="Age of death",
=["mean", "count"])) aggfunc
mean count
Gender Female Male Female Male
Occupation
Architect 82.500000 85.000000 2.0 4.0
Artist 77.969697 74.555556 33.0 108.0
Astronomer 88.000000 NaN 1.0 NaN
Athlete NaN 65.470588 NaN 17.0
Businessperson NaN 88.000000 NaN 2.0
Engineer NaN 79.000000 NaN 3.0
Entrepreneur NaN 80.000000 NaN 2.0
Inventor NaN 87.000000 NaN 2.0
Journalist NaN 73.000000 NaN 4.0
Jurist NaN 65.500000 NaN 2.0
Lawyer NaN 88.000000 NaN 1.0
Military personnel NaN 39.800000 NaN 10.0
Philosopher 60.000000 NaN 1.0 NaN
Physician 85.000000 86.000000 1.0 1.0
Politician 53.750000 60.444444 4.0 36.0
Religious figure NaN 86.000000 NaN 1.0
Researcher 84.200000 71.716981 5.0 53.0
Surgeon NaN 82.000000 NaN 1.0
Teacher 76.000000 78.750000 1.0 4.0
Translator 75.000000 60.000000 1.0 1.0