05 Noble People Analysis

image.png Աշտարակ, լուսանկարի հղումը, Հեղինակ՝ Anna Grigoryan

📌 Նկարագիր

📚 Ամբողջական նյութը

Ուսումնասիրում ենք նշանավոր 1.2 միլիոն մարդու տվյլաներ, ու արդյունքում վարժվում pandas-ի հետ աշխատել։

  1. Ըստ մասնագիտության ապրելու միջին տարիքը
  2. Ըստ երկրի 1000-մարդուց ամենաշատը քանիսին են ինքնասպան լինում
  3. Սեռային բաշխվածությունը ըստ մասնագիտության
  4. Հայ նշանավոր մարդկանց վերլուծություն
  5. Էլի մի քանի մանր մունր բան

Խորհուրդ ենք տալիս սկզբում մենակով բզբզալ տվյալները նոր նայել վիդեոն։

📺 Տեսանյութեր

  1. Գործնական - Նշանավոր մարդկանց վերլուծություն
  2. Եթե դեռ չեք նայել, սկզբում նայեք տեսական դասերը՝ NumPy, Pandas 1, Pandas 2։,

🏡 Տնային

Վերցնել ցանկացած դատասեթ ու փորփրել։

Կարող եք դատան վերցնել Kaggle-ից։ Կամ եթե հայկական եք ուզում՝ Armstat-ից

🛠️ Գործնական

!pip install uv
Requirement already satisfied: uv in c:\users\hayk_\.conda\envs\lectures\lib\site-packages (0.7.19)
!uv pip install kagglehub[pandas-datasets]
Using Python 3.10.18 environment at: C:\Users\hayk_\.conda\envs\lectures
Resolved 16 packages in 922ms
Prepared 1 package in 194ms
Installed 1 package in 40ms
 + kagglehub==0.3.12
import kagglehub

# Download latest version
path = kagglehub.dataset_download("imoore/age-dataset")

print("Path to dataset files:", path)
c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Path to dataset files: C:\Users\hayk_\.cache\kagglehub\datasets\imoore\age-dataset\versions\1
import os 

print(os.listdir(path))

path_csv = os.path.join(path, "AgeDataset-V1.csv") # Pathlib is better
['AgeDataset-V1.csv', 'assets']
import pandas as pd

df = pd.read_csv(path_csv)
df.head()
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death
0 Q23 George Washington 1st president of the United States (1732–1799) Male United States of America; Kingdom of Great Bri... Politician 1732 1799.0 natural causes 67.0
1 Q42 Douglas Adams English writer and humorist Male United Kingdom Artist 1952 2001.0 natural causes 49.0
2 Q91 Abraham Lincoln 16th president of the United States (1809-1865) Male United States of America Politician 1809 1865.0 homicide 56.0
3 Q254 Wolfgang Amadeus Mozart Austrian composer of the Classical period Male Archduchy of Austria; Archbishopric of Salzburg Artist 1756 1791.0 NaN 35.0
4 Q255 Ludwig van Beethoven German classical and romantic composer Male Holy Roman Empire; Austrian Empire Artist 1770 1827.0 NaN 57.0
from pathlib import Path

new_path = Path("assets/people.csv")

df.to_csv(new_path, index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 5
      1 from pathlib import Path
      3 new_path = Path("assets/people.csv")
----> 5 df.to_csv(new_path, index=False)

NameError: name 'df' is not defined

Basic EDA (Exploratory Data Analysis)

df
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death
0 Q23 George Washington 1st president of the United States (1732–1799) Male United States of America; Kingdom of Great Bri... Politician 1732 1799.0 natural causes 67.0
1 Q42 Douglas Adams English writer and humorist Male United Kingdom Artist 1952 2001.0 natural causes 49.0
2 Q91 Abraham Lincoln 16th president of the United States (1809-1865) Male United States of America Politician 1809 1865.0 homicide 56.0
3 Q254 Wolfgang Amadeus Mozart Austrian composer of the Classical period Male Archduchy of Austria; Archbishopric of Salzburg Artist 1756 1791.0 NaN 35.0
4 Q255 Ludwig van Beethoven German classical and romantic composer Male Holy Roman Empire; Austrian Empire Artist 1770 1827.0 NaN 57.0
... ... ... ... ... ... ... ... ... ... ...
1223004 Q77247326 Marie-Fortunée Besson Frans model (1907-1996) NaN France Tailor; model 1907 1996.0 NaN 89.0
1223005 Q77249504 Ron Thorsen xugador de baloncestu canadianu (1948–2004) NaN Canada; United States of America Athlete 1948 2004.0 NaN 56.0
1223006 Q77249818 Diether Todenhagen German navy officer and world war II U-boat co... NaN Germany Military personnel 1920 1944.0 NaN 24.0
1223007 Q77253909 Reginald Oswald Pearson English artist, working in stained glass, prin... Male United Kingdom Artist 1887 1915.0 NaN 28.0
1223008 Q77254864 Horst Lerche German painter Male Germany Artist 1938 2017.0 NaN 79.0

1223009 rows × 10 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1223009 entries, 0 to 1223008
Data columns (total 10 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Id                 1223009 non-null  object 
 1   Name               1223009 non-null  object 
 2   Short description  1155109 non-null  object 
 3   Gender             1089363 non-null  object 
 4   Country            887500 non-null   object 
 5   Occupation         1016095 non-null  object 
 6   Birth year         1223009 non-null  int64  
 7   Death year         1223008 non-null  float64
 8   Manner of death    53603 non-null    object 
 9   Age of death       1223008 non-null  float64
dtypes: float64(2), int64(1), object(7)
memory usage: 93.3+ MB
df.isna().sum() / len(df) * 100
Id                    0.000000
Name                  0.000000
Short description     5.551881
Gender               10.927638
Country              27.433077
Occupation           16.918436
Birth year            0.000000
Death year            0.000082
Manner of death      95.617121
Age of death          0.000082
dtype: float64
df.shape
(1223009, 10)
df.describe()
Birth year Death year Age of death
count 1.223009e+06 1.223008e+06 1.223008e+06
mean 1.844972e+03 1.914246e+03 6.927406e+01
std 1.479390e+02 1.516898e+02 1.662938e+01
min -2.700000e+03 -2.659000e+03 0.000000e+00
25% 1.828000e+03 1.895000e+03 6.000000e+01
50% 1.887000e+03 1.955000e+03 7.200000e+01
75% 1.918000e+03 1.994000e+03 8.100000e+01
max 2.016000e+03 2.021000e+03 1.690000e+02
df["Country"].nunique()
5961
df.value_counts("Country")
Country
United States of America                                                                          152761
Germany                                                                                            95081
France                                                                                             78666
United Kingdom; United Kingdom of Great Britain and Ireland                                        29684
Sweden                                                                                             26915
                                                                                                   ...  
ducat de Bremen; Duchy of Holstein                                                                     1
emirate of Córdoba; Umayyad Caliphate                                                                  1
Zhao                                                                                                   1
Zimbabwe; Rhodesia; Federation of Rhodesia and Nyasaland; Southern Rhodesia; Zimbabwe Rhodesia         1
Zimbabwe; Rhodesia; Zimbabwe Rhodesia                                                                  1
Name: count, Length: 5961, dtype: int64
df.value_counts("Occupation")
Occupation
Artist                                   281512
Politician                               195390
Athlete                                  110943
Researcher                                90709
Military personnel                        52911
                                          ...  
Zoology                                       1
Zoology; marine biology; biologist            1
École polytechnique                           1
Academic; literary scholar                    1
Wholesale; land owner; philanthropist         1
Name: count, Length: 9313, dtype: int64
df.groupby("Occupation")["Age of death"].mean().sort_values(ascending=False)
Occupation
Farmer; lecturer                                            121.0
Deacon; preacher                                             99.0
Warrior; noble                                               99.0
Suffragette; philanthropist; social reformer; suffragist     99.0
Studienrat; lecturer                                         99.0
                                                            ...  
Basij                                                        13.0
Lehnsmann                                                    13.0
Servant of god                                               12.0
Pioneers-heroes                                              11.0
Miner; master builder                                        11.0
Name: Age of death, Length: 9313, dtype: float64

Age of death by occupation

df.groupby("Occupation")["Age of death"].describe()
count mean std min 25% 50% 75% max
Occupation
1859 1.0 47.000000 NaN 47.0 47.0 47.0 47.0 47.0
Abbess 36.0 60.694444 16.924740 24.0 49.0 63.0 73.0 90.0
Abbess; business executive 1.0 86.000000 NaN 86.0 86.0 86.0 86.0 86.0
Abbess; christians jehovah’s witnesses 1.0 81.000000 NaN 81.0 81.0 81.0 81.0 81.0
Abbé 6.0 69.666667 16.070677 41.0 66.0 72.5 79.0 87.0
... ... ... ... ... ... ... ... ...
Zoology 1.0 44.000000 NaN 44.0 44.0 44.0 44.0 44.0
Zoology; marine biology; biologist 1.0 73.000000 NaN 73.0 73.0 73.0 73.0 73.0
École polytechnique 1.0 72.000000 NaN 72.0 72.0 72.0 72.0 72.0
Župan 5.0 38.600000 27.061042 12.0 21.0 32.0 47.0 81.0
مجموعة الأنظمة منصة شليلة; serology; bacteriologist 1.0 40.000000 NaN 40.0 40.0 40.0 40.0 40.0

9313 rows × 8 columns

df["Occupation"].nunique()
9313
df["Occupation"].value_counts()
Occupation
Artist                             True
Politician                         True
Athlete                            True
Researcher                         True
Military personnel                 True
                                  ...  
Director; scout leader            False
Salonnière; patron of the arts    False
Servant of god                    False
Cleric; coal miner                False
Goldsmith; metalsmith             False
Name: count, Length: 9313, dtype: bool
occup_counts = df["Occupation"].value_counts()
occup_counts[occup_counts > 1_000].index
Index(['Artist', 'Politician', 'Athlete', 'Researcher', 'Military personnel',
       'Religious figure', 'Businessperson', 'Architect', 'Journalist',
       'Teacher', 'Physician', 'Engineer', 'Judge', 'Lawyer', 'Jurist',
       'Aristocrat', 'Entrepreneur', 'Philosopher', 'Translator', 'Publisher',
       'Librarian', 'Author', 'Surgeon', 'Merchant', 'Novelist', 'Rower',
       'Astronomer', 'Pianist', 'Psychologist', 'Pastor', 'Minister', 'Farmer',
       'Inventor', 'Psychiatrist', 'Rabbi', 'Explorer', 'Fencer',
       'Police officer', 'Trade unionist'],
      dtype='object', name='Occupation')
occupations_more_than_100 = occup_counts[occup_counts > 1_000].index
# df[df["Occupation"].isin(occupations_more_than_100)]
df = df[df["Occupation"].isin(occupations_more_than_100)]
age_by_occup = df.groupby("Occupation")["Age of death"].mean()
age_by_occup
Occupation
Architect             72.085306
Aristocrat            53.006540
Artist                69.725145
Astronomer            71.152301
Athlete               68.772460
Author                70.094754
Businessperson        74.153054
Engineer              72.156611
Entrepreneur          73.222146
Explorer              61.799302
Farmer                71.240991
Fencer                72.496454
Inventor              73.129545
Journalist            69.591239
Judge                 74.004850
Jurist                69.488372
Lawyer                71.231208
Librarian             73.437335
Merchant              68.316125
Military personnel    63.820056
Minister              69.176471
Novelist              71.452949
Pastor                68.870213
Philosopher           71.037957
Physician             70.683996
Pianist               71.754000
Police officer        64.403013
Politician            70.541558
Psychiatrist          73.231385
Psychologist          76.396378
Publisher             71.178990
Rabbi                 71.741322
Religious figure      69.801273
Researcher            73.131376
Rower                 71.019317
Surgeon               71.815642
Teacher               73.331995
Trade unionist        71.768421
Translator            72.046317
Name: Age of death, dtype: float64
age_by_occup.sort_values(ascending=True).plot()

df.columns
Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
       'Birth year', 'Death year', 'Manner of death', 'Age of death'],
      dtype='object')

Suicide

df["Gender"].value_counts(normalize=True) * 100
Gender
Male                                              90.966985
Female                                             9.019897
Transgender Female                                 0.005605
Transgender Male                                   0.002862
Eunuch; Male                                       0.001908
Female; Male                                       0.000716
Intersex                                           0.000596
Transgender Male; Female                           0.000358
Non-Binary                                         0.000239
Transgender Person; Intersex; Transgender Male     0.000119
Intersex; Male                                     0.000119
Transgender Female; Female                         0.000119
Transgender Female; Male                           0.000119
Intersex; Transgender Male                         0.000119
Transgender Male; Male                             0.000119
Female; Female                                     0.000119
Name: proportion, dtype: float64
df[df["Manner of death"] == "Suicide"].empty
True
df["Manner of death"].value_counts()
Manner of death
natural causes        29717
suicide                4647
accident               4217
homicide               3273
capital punishment     1813
                      ...  
rebellion                 1
Holocaust victim          1
unknown                   1
war; suicide              1
White Terror              1
Name: count, Length: 166, dtype: int64
df_suicide = df[df["Manner of death"] == "suicide"]
df_suicide
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death
23 Q440 Salvador Allende 28th president of Chile (1908–1973) Male Chile Politician 1908 1973.0 suicide 65.0
131 Q1322 José Manuel Balmaceda Chilean politician and President (1840-1891) Male Chile Politician 1840 1891.0 suicide 51.0
189 Q2022 Cesare Pavese Italian poet, novelist, literary critic, and t... Male Italy; Kingdom of Italy Researcher 1908 1950.0 suicide 42.0
323 Q4616 Marilyn Monroe American actress, model, and singer (1926-1962) Female United States of America Artist 1926 1962.0 suicide 36.0
327 Q4673 Paul Otto German film actor and director Male Nazi Germany; Weimar Republic; German Empire Artist 1878 1943.0 suicide 65.0
... ... ... ... ... ... ... ... ... ... ...
1212054 Q70834687 Karl Neumann politician and director of the Deutsche Zeiche... NaN German Reich Politician 1900 1945.0 suicide 45.0
1213739 Q73375287 Peter Kuranda Austrian journalist Male Austria; Austria-Hungary Journalist 1896 1938.0 suicide 42.0
1214539 Q75135015 Michael Benveniste American pornographic film director Male United States of America Artist 1946 1982.0 suicide 36.0
1215398 Q75336010 George Dewey Sanford Jr. United States Marine Male United States of America Military personnel 1925 1994.0 suicide 69.0
1217823 Q75694915 Gotthard Zimmer fotograaf uit Oostenrijk-Hongarije (1847-1886) NaN Austria-Hungary Artist 1847 1886.0 suicide 39.0

4647 rows × 10 columns

suicide_counts_country = df_suicide["Country"].value_counts()
suicide_counts_country
Country
United States of America                           991
France                                             362
Germany                                            321
United Kingdom                                     152
Japan                                              141
                                                  ... 
Qing dynasty; Ming dynasty; Kingdom of Tungning      1
Spain; Peru                                          1
West Germany                                         1
Qing dynasty; China                                  1
United States of America; Russian Empire             1
Name: count, Length: 354, dtype: int64
country_counts = df["Country"].value_counts()
country_counts
Country
United States of America                                                                                                  135127
Germany                                                                                                                    78718
France                                                                                                                     65572
United Kingdom; United Kingdom of Great Britain and Ireland                                                                26642
Spain                                                                                                                      21930
                                                                                                                           ...  
Afghanistan; Austria-Hungary                                                                                                   1
Syria; Ottoman Empire; State of Damascus; Arab Kingdom of Syria; State of Syria; Syrian Republic; United Arab Republic         1
Republic of Florence; Grand Duchy of Tuscany                                                                                   1
Grand Duchy of Tuscany; Duchy of Lucca; Kingdom of Italy                                                                       1
Norway; Austria-Hungary; Union between Sweden and Norway                                                                       1
Name: count, Length: 5400, dtype: int64
suicide = pd.merge(suicide_counts_country, country_counts,
                   how="left",
                   on="Country", suffixes=("_suicide", "_overall"))
suicide
count_suicide count_overall
Country
United States of America 991 135127
France 362 65572
Germany 321 78718
United Kingdom 152 19127
Japan 141 13209
... ... ...
Qing dynasty; Ming dynasty; Kingdom of Tungning 1 1
Spain; Peru 1 23
West Germany 1 21
Qing dynasty; China 1 10
United States of America; Russian Empire 1 151

354 rows × 2 columns

suicide = pd.merge(suicide_counts_country, country_counts, 
                   how="left", on="Country",
                   suffixes=("_suicide", "_overall"))
suicide
count_suicide count_overall
Country
United States of America 991 135127
France 362 65572
Germany 321 78718
United Kingdom 152 19127
Japan 141 13209
... ... ...
Qing dynasty; Ming dynasty; Kingdom of Tungning 1 1
Spain; Peru 1 23
West Germany 1 21
Qing dynasty; China 1 10
United States of America; Russian Empire 1 151

354 rows × 2 columns

suicide["suicide_over_total"] = suicide["count_suicide"] / suicide["count_overall"]
suicide
count_suicide count_overall suicide_over_total
Country
United States of America 991 135127 0.007334
France 362 65572 0.005521
Germany 321 78718 0.004078
United Kingdom 152 19127 0.007947
Japan 141 13209 0.010675
... ... ... ...
Qing dynasty; Ming dynasty; Kingdom of Tungning 1 1 1.000000
Spain; Peru 1 23 0.043478
West Germany 1 21 0.047619
Qing dynasty; China 1 10 0.100000
United States of America; Russian Empire 1 151 0.006623

354 rows × 3 columns

suicide["suicide_per_1k"] = suicide["suicide_over_total"] * 1000
suicide_sorted = suicide.sort_values(by="suicide_per_1k", ascending=True)
suicide_sorted
count_suicide count_overall suicide_over_total suicide_per_1k
Country
Spain 31 21930 0.001414 1.413589
Denmark 16 9187 0.001742 1.741591
Kingdom of England 7 3920 0.001786 1.785714
Grand Duchy of Finland 1 549 0.001821 1.821494
India; British Raj 5 2642 0.001893 1.892506
... ... ... ... ...
Northern Ireland; Ireland 1 1 1.000000 1000.000000
People's Republic of Bulgaria 1 1 1.000000 1000.000000
United States of America; French Third Republic; Second French Empire 1 1 1.000000 1000.000000
Japan; China 1 1 1.000000 1000.000000
Nazi Germany; Kingdom of Romania; West Germany 1 1 1.000000 1000.000000

354 rows × 4 columns

suicide_sorted.head(10)["suicide_per_1k"].plot(kind="bar")

suicide_sorted.tail(10)["suicide_per_1k"].plot(kind="bar")

suicide_sorted.tail(10)
count_suicide count_overall suicide_over_total suicide_per_1k
Country
North Korea; Soviet Union; Russian Empire 1 1 1.0 1000.0
Classical Athens; Ancient Carthage 1 1 1.0 1000.0
Qin 1 1 1.0 1000.0
Germany; Nazi Germany; Austria-Hungary; Czechoslovakia 1 1 1.0 1000.0
Ottoman Empire; Soviet Union; Russian Empire 1 1 1.0 1000.0
Northern Ireland; Ireland 1 1 1.0 1000.0
People's Republic of Bulgaria 1 1 1.0 1000.0
United States of America; French Third Republic; Second French Empire 1 1 1.0 1000.0
Japan; China 1 1 1.0 1000.0
Nazi Germany; Kingdom of Romania; West Germany 1 1 1.0 1000.0
suicide_sorted[suicide_sorted["count_overall"] > 5_000]["suicide_per_1k"].tail(10).plot(kind="bar")

df.isna().sum()
Id                        0
Name                      0
Short description      8421
Gender                87633
Country              182030
Occupation                0
Birth year                0
Death year                0
Manner of death      881641
Age of death              0
dtype: int64

Gender

df["Gender"].value_counts()
Gender
Male                                              762780
Female                                             75634
Transgender Female                                    47
Transgender Male                                      24
Eunuch; Male                                          16
Female; Male                                           6
Intersex                                               5
Transgender Male; Female                               3
Non-Binary                                             2
Transgender Person; Intersex; Transgender Male         1
Intersex; Male                                         1
Transgender Female; Female                             1
Transgender Female; Male                               1
Intersex; Transgender Male                             1
Transgender Male; Male                                 1
Female; Female                                         1
Name: count, dtype: int64
df.query("Gender == 'Non-Binary'") # df[df["Gender"] == "Non-Binary"]
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death
39998 Q219634 Claude Cahun French artist (1894-1954) Non-Binary France Artist 1894 1954.0 NaN 60.0
754386 Q13562059 Maxine Feldman lesbian and non-binary musician Non-Binary United States of America Artist 1945 2007.0 NaN 62.0
df.columns
Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
       'Birth year', 'Death year', 'Manner of death', 'Age of death'],
      dtype='object')
df.groupby("Gender")["Birth year"].max().sort_values()
Gender
Eunuch; Male                                      1451
Intersex; Male                                    1763
Transgender Male; Male                            1869
Female; Female                                    1884
Transgender Person; Intersex; Transgender Male    1885
Intersex; Transgender Male                        1912
Transgender Male; Female                          1913
Intersex                                          1926
Non-Binary                                        1945
Transgender Female; Male                          1947
Female; Male                                      1949
Transgender Female; Female                        1949
Transgender Male                                  1986
Transgender Female                                1991
Male                                              2002
Female                                            2002
Name: Birth year, dtype: int64
df = df[df["Gender"].isin(["Male", "Female"])]
df["Occupation"].unique()
array(['Politician', 'Artist', 'Astronomer', 'Athlete', 'Researcher',
       'Military personnel', 'Philosopher', 'Businessperson', 'Explorer',
       'Architect', 'Teacher', 'Aristocrat', 'Entrepreneur', 'Journalist',
       'Engineer', 'Author', 'Religious figure', 'Judge', 'Librarian',
       'Translator', 'Physician', 'Inventor', 'Trade unionist',
       'Merchant', 'Publisher', 'Pastor', 'Fencer', 'Rabbi',
       'Psychologist', 'Lawyer', 'Rower', 'Jurist', 'Police officer',
       'Surgeon', 'Psychiatrist', 'Pianist', 'Farmer', 'Minister',
       'Novelist'], dtype=object)
df_reserach = df[df["Occupation"] == "Researcher"]
df_reserach.shape[0]
81735
len(df_reserach)
81735
df_reserach.value_counts("Gender").loc["Male"] / len(df_reserach)
np.float64(0.9204624701780143)
df_reserach["Gender"].value_counts(normalize=True).loc["Male"]
np.float64(0.9204624701780143)
def get_male_percentage(series):
    return series.value_counts(normalize=True).loc["Male"] * 100
    
    
get_male_percentage(df_reserach["Gender"])
np.float64(0.9204624701780143)
for m in df["Occupation"].unique():
    df_filter = df[df["Occupation"] == m]
    print(m, get_male_percentage(df_filter["Gender"]))
Politician 0.9561554391245799
Artist 0.821756963672281
Astronomer 0.9173256649892164
Athlete 0.9672833532213965
Researcher 0.9204624701780143
Military personnel 0.9830178291619024
Philosopher 0.9450272765421738
Businessperson 0.9515949663447468
Explorer 0.9703315881326352
Architect 0.9670399592771698
Teacher 0.8632561613144137
Aristocrat 0.6248584371460929
Entrepreneur 0.9663496708119971
Journalist 0.8801171679645639
Engineer 0.9881951949455483
Author 0.8742255266418835
Religious figure 0.9743905658716888
Judge 0.9711538461538461
Librarian 0.7817745803357314
Translator 0.7956669498725574
Physician 0.9199198326943185
Inventor 0.9727497935590421
Trade unionist 0.8755980861244019
Merchant 0.9845261121856866
Publisher 0.9534782608695652
Pastor 0.9901071723000825
Fencer 0.875886524822695
Rabbi 0.9920704845814978
Psychologist 0.7916018662519441
Lawyer 0.939869484151647
Rower 0.9845460399227302
Jurist 0.988530990727184
Police officer 0.9582909460834181
Surgeon 0.9824890556597874
Psychiatrist 0.9107303877366997
Pianist 0.659037095501184
Farmer 0.9534109816971714
Minister 0.9712918660287081
Novelist 0.5951293759512938
gender_occup = df.groupby("Occupation")["Gender"].apply(get_male_percentage).sort_values()
gender_occup
Occupation
Novelist              59.512938
Aristocrat            62.485844
Pianist               65.903710
Librarian             78.177458
Psychologist          79.160187
Translator            79.566695
Artist                82.175696
Teacher               86.325616
Author                87.422553
Trade unionist        87.559809
Fencer                87.588652
Journalist            88.011717
Psychiatrist          91.073039
Astronomer            91.732566
Physician             91.991983
Researcher            92.046247
Lawyer                93.986948
Philosopher           94.502728
Businessperson        95.159497
Farmer                95.341098
Publisher             95.347826
Politician            95.615544
Police officer        95.829095
Entrepreneur          96.634967
Architect             96.703996
Athlete               96.728335
Explorer              97.033159
Judge                 97.115385
Minister              97.129187
Inventor              97.274979
Religious figure      97.439057
Surgeon               98.248906
Military personnel    98.301783
Merchant              98.452611
Rower                 98.454604
Engineer              98.819519
Jurist                98.853099
Pastor                99.010717
Rabbi                 99.207048
Name: Gender, dtype: float64
gender_occup_df = gender_occup.to_frame()
gender_occup_df.rename(columns={"Gender": "Percentage Male"}, inplace=True)
gender_occup_df.plot(kind="bar")

df.value_counts("Gender")
Gender
Male      762780
Female     75634
Name: count, dtype: int64
0.8656544743501265 / 762780 * 1000
0.0011348678181784086
0.132965263400046 / 75634 * 1000 / (0.8656544743501265 / 762780 * 1000)
1.5490871388122094
df_suicide.Gender.value_counts(normalize=True)
Gender
Male                  0.865654
Female                0.132965
Transgender Female    0.000690
Eunuch; Male          0.000230
Transgender Male      0.000230
Intersex              0.000230
Name: proportion, dtype: float64

Հա՞յ ես

հա,
հաճելի ա չէ՞

t1 = "i am from Armenia"
# t2 = "I am Armenian"

options = ["rmenia", "armenian"]
"Armenia" in t1

# contained = []
# for i in options:
#     contained.append(i in t1)
    
contained = [i.lower() in t1.lower() for i in options]
print(any(contained)) 
True
def is_armenian(text):
    keywords = ["armenian", "armenia"]
    return any([k in text.lower() for k in keywords])
df["Armenian"] = df["Short description"].apply(is_armenian)
df
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[125], line 1
----> 1 df["Armenian"] = df["Short description"].apply(is_armenian)
      2 df

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\series.py:4935, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4800 def apply(
   4801     self,
   4802     func: AggFuncType,
   (...)
   4807     **kwargs,
   4808 ) -> DataFrame | Series:
   4809     """
   4810     Invoke function on values of Series.
   4811 
   (...)
   4926     dtype: float64
   4927     """
   4928     return SeriesApply(
   4929         self,
   4930         func,
   4931         convert_dtype=convert_dtype,
   4932         by_row=by_row,
   4933         args=args,
   4934         kwargs=kwargs,
-> 4935     ).apply()

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\apply.py:1422, in SeriesApply.apply(self)
   1419     return self.apply_compat()
   1421 # self.func is Callable
-> 1422 return self.apply_standard()

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\apply.py:1502, in SeriesApply.apply_standard(self)
   1496 # row-wise access
   1497 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1498 # we need to give `na_action="ignore"` for categorical data.
   1499 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1500 #  Categorical (GH51645).
   1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1502 mapped = obj._map_values(
   1503     mapper=curried, na_action=action, convert=self.convert_dtype
   1504 )
   1506 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1507     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1508     #  See also GH#25959 regarding EA support
   1509     return obj._constructor_expanddim(list(mapped), index=obj.index)

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    922 if isinstance(arr, ExtensionArray):
    923     return arr.map(mapper, na_action=na_action)
--> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1741 values = arr.astype(object, copy=False)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:
   1745     return lib.map_infer_mask(
   1746         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1747     )

File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer()

Cell In[124], line 3, in is_armenian(text)
      1 def is_armenian(text):
      2     keywords = ["armenian", "armenia"]
----> 3     return any([k in text.lower() for k in keywords])

Cell In[124], line 3, in <listcomp>(.0)
      1 def is_armenian(text):
      2     keywords = ["armenian", "armenia"]
----> 3     return any([k in text.lower() for k in keywords])

AttributeError: 'float' object has no attribute 'lower'
df[df["Short description"].isna()].fillna("")
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death
46515 Q287430 Pietro Guido II Torelli Male Aristocrat 1450 1494.0 44.0
71941 Q482302 József Adamovich Male Religious figure 1845 1887.0 42.0
75497 Q516682 István Agh Male Religious figure 1709 1786.0 77.0
88055 Q621272 Dénes Alesius Male Religious figure 1525 1577.0 52.0
92789 Q689315 Mátyás Ambrózy Male Pastor 1797 1869.0 72.0
... ... ... ... ... ... ... ... ... ... ...
1219020 Q75881383 Virginia Downing Female Artist 1904 1996.0 92.0
1219990 Q76009843 Edward Hunter Ludlow Male Physician 1810 1884.0 74.0
1222371 Q76328370 James Gordon Dennis Male Military personnel 1921 1944.0 23.0
1222650 Q76375951 John Calvin MacKay Male Religious figure 1891 1986.0 95.0
1222675 Q76401454 Joan Marsden Female Researcher 1922 2001.0 79.0

5612 rows × 10 columns

df["Armenian"] = df["Short description"].fillna("na").apply(is_armenian)
df[df.Armenian]
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Armenian
180 Q1785 Charles Aznavour Armenian-French singer and diplomat Male France; Armenia Artist 1924 2018.0 NaN 94.0 True
311 Q4452 Thomas of Metsoph Armenian cleric and chronicler Male NaN Researcher 1378 1446.0 NaN 68.0 True
354 Q4924 Isabella I, Queen of Armenia queen regnant of Cilician Armenia Female Armenian Kingdom of Cilicia Politician 1216 1252.0 NaN 36.0 True
3462 Q51472 Rouben Mamoulian Armenian American film and theatre director Male United States of America; Russian Empire Artist 1897 1987.0 NaN 90.0 True
3807 Q55394 Henri Verneuil French-Armenian playwright and filmmaker Male France Artist 1920 2002.0 NaN 82.0 True
... ... ... ... ... ... ... ... ... ... ... ...
1158947 Q58030786 Marie Balian Armenian ceramic artist Female Israel Artist 1925 2017.0 NaN 92.0 True
1161788 Q59394760 Robert Kamoyan Armenian director, artist Male Armenia; Soviet Union Artist 1937 2014.0 NaN 77.0 True
1166304 Q59657412 Giuseppe Arachial Armenian Catholic bishop of Angora Male Ottoman Empire Religious figure 1811 1876.0 NaN 65.0 True
1191627 Q63226473 Boris Meliksetyan Armenian geologist Male Armenia; Soviet Union Researcher 1928 1992.0 NaN 64.0 True
1198505 Q64734343 Pierre Tilkian Armenian Catholic bishop Male NaN Religious figure 1809 1885.0 NaN 76.0 True

538 rows × 11 columns

df[df["Country"] == "Armenia"]
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Armenian
43970 Q266968 Gurgen Margaryan Armenian soldier Male Armenia Military personnel 1978 2004.0 homicide 26.0 True
45653 Q278864 Andranik Ozanian Armenian politician and military personnel (18... Male Armenia Politician 1865 1927.0 NaN 62.0 True
54084 Q336104 Jerry Tarkanian American basketball coach Male Armenia Athlete 1930 2015.0 NaN 85.0 False
71000 Q471374 Karen Asrian Armenian chess player Male Armenia Athlete 1980 2008.0 natural causes 28.0 True
79459 Q544093 Genrikh Kasparyan Armenian chess player Male Armenia Athlete 1910 1995.0 NaN 85.0 True
... ... ... ... ... ... ... ... ... ... ... ...
1003702 Q24048886 Robert Abajyan Armenian military person, Hero of Artsakh Male Armenia Military personnel 1996 2016.0 suicide 20.0 True
1025037 Q27349753 Artur Sargsyan Armenian sculptor Male Armenia Artist 1968 2017.0 NaN 49.0 True
1034887 Q28114502 Emma Khanzadyan Armenian historian, archaeologist Female Armenia Researcher 1922 2007.0 NaN 85.0 True
1046490 Q29033966 Eduard Edigaryan Armenian painter Male Armenia Artist 1943 2019.0 NaN 76.0 True
1084025 Q47009214 Pavel Chobanyan Armenian orientalist Male Armenia Researcher 1948 2017.0 NaN 69.0 True

121 rows × 11 columns

arm = df[df["Country"].fillna("na").str.contains("Armenia")]
arm
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Armenian
180 Q1785 Charles Aznavour Armenian-French singer and diplomat Male France; Armenia Artist 1924 2018.0 NaN 94.0 True
354 Q4924 Isabella I, Queen of Armenia queen regnant of Cilician Armenia Female Armenian Kingdom of Cilicia Politician 1216 1252.0 NaN 36.0 True
3201 Q48112 Ivan Bagramyan Marshal of the Soviet Union (1897-1982) Male Soviet Union; Russian Empire; First Republic o... Politician 1897 1982.0 NaN 85.0 False
4983 Q61130 Luigi Colani German industrial designer and design professor Male Germany; Armenia Teacher 1928 2019.0 NaN 91.0 False
5560 Q62316 Robert Sahakyants animator Male Armenia; Soviet Union Artist 1950 2009.0 NaN 59.0 False
... ... ... ... ... ... ... ... ... ... ... ...
1086289 Q47457007 Garnik Karapetyan Armenian scientist and mathematician (1958–2018) Male Armenia; Soviet Union Researcher 1958 2018.0 NaN 60.0 True
1161788 Q59394760 Robert Kamoyan Armenian director, artist Male Armenia; Soviet Union Artist 1937 2014.0 NaN 77.0 True
1182207 Q62024298 Diana Oucleba Georgian poetess, artist Female Armenia; Soviet Union; Russian Empire Artist 1910 2001.0 NaN 91.0 False
1191627 Q63226473 Boris Meliksetyan Armenian geologist Male Armenia; Soviet Union Researcher 1928 1992.0 NaN 64.0 True
1206411 Q66132386 Albert Ghazaryan athlete, coach, referee Male Armenia; Soviet Union Athlete 1935 2020.0 NaN 85.0 False

301 rows × 11 columns

arm["num_countries"] = arm["Country"].str.split(";").apply(len)
arm
C:\Users\hayk_\AppData\Local\Temp\ipykernel_6640\2434009080.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  arm["num_countries"] = arm["Country"].str.split(";").apply(len)
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Armenian num_countries
180 Q1785 Charles Aznavour Armenian-French singer and diplomat Male France; Armenia Artist 1924 2018.0 NaN 94.0 True 2
354 Q4924 Isabella I, Queen of Armenia queen regnant of Cilician Armenia Female Armenian Kingdom of Cilicia Politician 1216 1252.0 NaN 36.0 True 1
3201 Q48112 Ivan Bagramyan Marshal of the Soviet Union (1897-1982) Male Soviet Union; Russian Empire; First Republic o... Politician 1897 1982.0 NaN 85.0 False 3
4983 Q61130 Luigi Colani German industrial designer and design professor Male Germany; Armenia Teacher 1928 2019.0 NaN 91.0 False 2
5560 Q62316 Robert Sahakyants animator Male Armenia; Soviet Union Artist 1950 2009.0 NaN 59.0 False 2
... ... ... ... ... ... ... ... ... ... ... ... ...
1086289 Q47457007 Garnik Karapetyan Armenian scientist and mathematician (1958–2018) Male Armenia; Soviet Union Researcher 1958 2018.0 NaN 60.0 True 2
1161788 Q59394760 Robert Kamoyan Armenian director, artist Male Armenia; Soviet Union Artist 1937 2014.0 NaN 77.0 True 2
1182207 Q62024298 Diana Oucleba Georgian poetess, artist Female Armenia; Soviet Union; Russian Empire Artist 1910 2001.0 NaN 91.0 False 3
1191627 Q63226473 Boris Meliksetyan Armenian geologist Male Armenia; Soviet Union Researcher 1928 1992.0 NaN 64.0 True 2
1206411 Q66132386 Albert Ghazaryan athlete, coach, referee Male Armenia; Soviet Union Athlete 1935 2020.0 NaN 85.0 False 2

301 rows × 12 columns

arm.sort_values(by="num_countries", ascending=False)
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Armenian num_countries
231807 Q2047004 Suren Yeremyan Armenian historian Male Armenia; Soviet Union; Russian Empire; Russian... Researcher 1908 1992.0 NaN 84.0 True 9
71047 Q471740 Armen Dzhigarkhanyan Armenian, Soviet, Russian actor Male United States of America; Russia; Armenia; Sov... Artist 1935 2020.0 NaN 85.0 True 4
100991 Q738092 Pavel Lisitsian Russian singer Male Russia; Armenia; Soviet Union; Russian Empire Artist 1911 2004.0 NaN 93.0 False 4
370403 Q4071165 Tinatin Asatiani Georgian physicist Female Armenia; Soviet Union; Democratic Republic of ... Researcher 1918 2011.0 NaN 93.0 False 4
370366 Q4070512 Varazdat Harutyunyan Armenian architect Male Armenia; Ottoman Empire; Soviet Union; Russian... Researcher 1909 2008.0 NaN 99.0 True 4
... ... ... ... ... ... ... ... ... ... ... ... ...
932345 Q20509556 Maria Petrosyan Armenian philosopher Female Armenia Philosopher 1911 1971.0 NaN 60.0 True 1
932348 Q20509639 Aida Boyajyan Armenian artist Female Armenia Artist 1932 2019.0 NaN 87.0 True 1
932353 Q20509808 Henrik Sevan Armenian children's writer, translator, poet Male Armenia Artist 1925 2008.0 NaN 83.0 True 1
43970 Q266968 Gurgen Margaryan Armenian soldier Male Armenia Military personnel 1978 2004.0 homicide 26.0 True 1
354 Q4924 Isabella I, Queen of Armenia queen regnant of Cilician Armenia Female Armenian Kingdom of Cilicia Politician 1216 1252.0 NaN 36.0 True 1

301 rows × 12 columns

arm["Age of death"].plot(kind="hist", bins=30, edgecolor="black")

arm["Occupation"].value_counts().head(10).plot(kind="bar")

text_based_filter = df["Short description"].fillna("na").apply(is_armenian)
country_based_filter = df["Country"].fillna("na").str.contains("Armenia")

df[(~country_based_filter) & (text_based_filter)]
Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Armenian
311 Q4452 Thomas of Metsoph Armenian cleric and chronicler Male NaN Researcher 1378 1446.0 NaN 68.0 True
3462 Q51472 Rouben Mamoulian Armenian American film and theatre director Male United States of America; Russian Empire Artist 1897 1987.0 NaN 90.0 True
3807 Q55394 Henri Verneuil French-Armenian playwright and filmmaker Male France Artist 1920 2002.0 NaN 82.0 True
28166 Q115683 Michael Arlen Armenian writer Male NaN Artist 1895 1956.0 natural causes 61.0 True
32775 Q139636 Zaven Biberyan Armenian writer Male NaN Artist 1921 1984.0 NaN 63.0 True
... ... ... ... ... ... ... ... ... ... ... ...
1118595 Q55627228 Arthur Beylerian Armenian historian Male NaN Artist 1925 2005.0 NaN 80.0 True
1154710 Q56650119 Gregory Casparian Turkish-Armenia-born painter, photo-engraver a... Male Turkey Artist 1856 1942.0 NaN 86.0 True
1158947 Q58030786 Marie Balian Armenian ceramic artist Female Israel Artist 1925 2017.0 NaN 92.0 True
1166304 Q59657412 Giuseppe Arachial Armenian Catholic bishop of Angora Male Ottoman Empire Religious figure 1811 1876.0 NaN 65.0 True
1198505 Q64734343 Pierre Tilkian Armenian Catholic bishop Male NaN Religious figure 1809 1885.0 NaN 76.0 True

326 rows × 11 columns

Pivot table

print(pd.pivot_table(arm, index="Occupation", columns="Gender", values="Age of death",
               aggfunc=["mean", "count"]))
                         mean             count       
Gender                 Female       Male Female   Male
Occupation                                            
Architect           82.500000  85.000000    2.0    4.0
Artist              77.969697  74.555556   33.0  108.0
Astronomer          88.000000        NaN    1.0    NaN
Athlete                   NaN  65.470588    NaN   17.0
Businessperson            NaN  88.000000    NaN    2.0
Engineer                  NaN  79.000000    NaN    3.0
Entrepreneur              NaN  80.000000    NaN    2.0
Inventor                  NaN  87.000000    NaN    2.0
Journalist                NaN  73.000000    NaN    4.0
Jurist                    NaN  65.500000    NaN    2.0
Lawyer                    NaN  88.000000    NaN    1.0
Military personnel        NaN  39.800000    NaN   10.0
Philosopher         60.000000        NaN    1.0    NaN
Physician           85.000000  86.000000    1.0    1.0
Politician          53.750000  60.444444    4.0   36.0
Religious figure          NaN  86.000000    NaN    1.0
Researcher          84.200000  71.716981    5.0   53.0
Surgeon                   NaN  82.000000    NaN    1.0
Teacher             76.000000  78.750000    1.0    4.0
Translator          75.000000  60.000000    1.0    1.0

🎲 23 (05)

Flag Counter