05 Noble People Analysis

Աշտարակ, լուսանկարի հղումը, Հեղինակ՝ Anna Grigoryan

📌 Նկարագիր

📚 Ամբողջական նյութը

Ուսումնասիրում ենք նշանավոր 1.2 միլիոն մարդու տվյլաներ, ու արդյունքում վարժվում pandas-ի հետ աշխատել։

Ըստ մասնագիտության ապրելու միջին տարիքը
Ըստ երկրի 1000-մարդուց ամենաշատը քանիսին են ինքնասպան լինում
Սեռային բաշխվածությունը ըստ մասնագիտության
Հայ նշանավոր մարդկանց վերլուծություն
Էլի մի քանի մանր մունր բան

Խորհուրդ ենք տալիս սկզբում մենակով բզբզալ տվյալները նոր նայել վիդեոն։

📺 Տեսանյութեր

Գործնական - Նշանավոր մարդկանց վերլուծություն
Եթե դեռ չեք նայել, սկզբում նայեք տեսական դասերը՝ NumPy, Pandas 1, Pandas 2։,

🏡 Տնային

Վերցնել ցանկացած դատասեթ ու փորփրել։

Կարող եք դատան վերցնել Kaggle-ից։ Կամ եթե հայկական եք ուզում՝ Armstat-ից

🛠️ Գործնական

!pip install uv

Requirement already satisfied: uv in c:\users\hayk_\.conda\envs\lectures\lib\site-packages (0.7.19)

!uv pip install kagglehub[pandas-datasets]

Using Python 3.10.18 environment at: C:\Users\hayk_\.conda\envs\lectures
Resolved 16 packages in 922ms
Prepared 1 package in 194ms
Installed 1 package in 40ms
 + kagglehub==0.3.12

import kagglehub

# Download latest version
path = kagglehub.dataset_download("imoore/age-dataset")

print("Path to dataset files:", path)

c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Path to dataset files: C:\Users\hayk_\.cache\kagglehub\datasets\imoore\age-dataset\versions\1

import os 

print(os.listdir(path))

path_csv = os.path.join(path, "AgeDataset-V1.csv") # Pathlib is better

['AgeDataset-V1.csv', 'assets']

import pandas as pd

df = pd.read_csv(path_csv)
df.head()

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death
0	Q23	George Washington	1st president of the United States (1732–1799)	Male	United States of America; Kingdom of Great Bri...	Politician	1732	1799.0	natural causes	67.0
1	Q42	Douglas Adams	English writer and humorist	Male	United Kingdom	Artist	1952	2001.0	natural causes	49.0
2	Q91	Abraham Lincoln	16th president of the United States (1809-1865)	Male	United States of America	Politician	1809	1865.0	homicide	56.0
3	Q254	Wolfgang Amadeus Mozart	Austrian composer of the Classical period	Male	Archduchy of Austria; Archbishopric of Salzburg	Artist	1756	1791.0	NaN	35.0
4	Q255	Ludwig van Beethoven	German classical and romantic composer	Male	Holy Roman Empire; Austrian Empire	Artist	1770	1827.0	NaN	57.0

from pathlib import Path

new_path = Path("assets/people.csv")

df.to_csv(new_path, index=False)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 5
      1 from pathlib import Path
      3 new_path = Path("assets/people.csv")
----> 5 df.to_csv(new_path, index=False)

NameError: name 'df' is not defined

Basic EDA (Exploratory Data Analysis)

df

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death
0	Q23	George Washington	1st president of the United States (1732–1799)	Male	United States of America; Kingdom of Great Bri...	Politician	1732	1799.0	natural causes	67.0
1	Q42	Douglas Adams	English writer and humorist	Male	United Kingdom	Artist	1952	2001.0	natural causes	49.0
2	Q91	Abraham Lincoln	16th president of the United States (1809-1865)	Male	United States of America	Politician	1809	1865.0	homicide	56.0
3	Q254	Wolfgang Amadeus Mozart	Austrian composer of the Classical period	Male	Archduchy of Austria; Archbishopric of Salzburg	Artist	1756	1791.0	NaN	35.0
4	Q255	Ludwig van Beethoven	German classical and romantic composer	Male	Holy Roman Empire; Austrian Empire	Artist	1770	1827.0	NaN	57.0
...	...	...	...	...	...	...	...	...	...	...
1223004	Q77247326	Marie-Fortunée Besson	Frans model (1907-1996)	NaN	France	Tailor; model	1907	1996.0	NaN	89.0
1223005	Q77249504	Ron Thorsen	xugador de baloncestu canadianu (1948–2004)	NaN	Canada; United States of America	Athlete	1948	2004.0	NaN	56.0
1223006	Q77249818	Diether Todenhagen	German navy officer and world war II U-boat co...	NaN	Germany	Military personnel	1920	1944.0	NaN	24.0
1223007	Q77253909	Reginald Oswald Pearson	English artist, working in stained glass, prin...	Male	United Kingdom	Artist	1887	1915.0	NaN	28.0
1223008	Q77254864	Horst Lerche	German painter	Male	Germany	Artist	1938	2017.0	NaN	79.0

1223009 rows × 10 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1223009 entries, 0 to 1223008
Data columns (total 10 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Id                 1223009 non-null  object 
 1   Name               1223009 non-null  object 
 2   Short description  1155109 non-null  object 
 3   Gender             1089363 non-null  object 
 4   Country            887500 non-null   object 
 5   Occupation         1016095 non-null  object 
 6   Birth year         1223009 non-null  int64  
 7   Death year         1223008 non-null  float64
 8   Manner of death    53603 non-null    object 
 9   Age of death       1223008 non-null  float64
dtypes: float64(2), int64(1), object(7)
memory usage: 93.3+ MB

df.isna().sum() / len(df) * 100

Id                    0.000000
Name                  0.000000
Short description     5.551881
Gender               10.927638
Country              27.433077
Occupation           16.918436
Birth year            0.000000
Death year            0.000082
Manner of death      95.617121
Age of death          0.000082
dtype: float64

df.shape

(1223009, 10)

df.describe()

	Birth year	Death year	Age of death
count	1.223009e+06	1.223008e+06	1.223008e+06
mean	1.844972e+03	1.914246e+03	6.927406e+01
std	1.479390e+02	1.516898e+02	1.662938e+01
min	-2.700000e+03	-2.659000e+03	0.000000e+00
25%	1.828000e+03	1.895000e+03	6.000000e+01
50%	1.887000e+03	1.955000e+03	7.200000e+01
75%	1.918000e+03	1.994000e+03	8.100000e+01
max	2.016000e+03	2.021000e+03	1.690000e+02

df["Country"].nunique()

df.value_counts("Country")

Country
United States of America                                                                          152761
Germany                                                                                            95081
France                                                                                             78666
United Kingdom; United Kingdom of Great Britain and Ireland                                        29684
Sweden                                                                                             26915
                                                                                                   ...  
ducat de Bremen; Duchy of Holstein                                                                     1
emirate of Córdoba; Umayyad Caliphate                                                                  1
Zhao                                                                                                   1
Zimbabwe; Rhodesia; Federation of Rhodesia and Nyasaland; Southern Rhodesia; Zimbabwe Rhodesia         1
Zimbabwe; Rhodesia; Zimbabwe Rhodesia                                                                  1
Name: count, Length: 5961, dtype: int64

df.value_counts("Occupation")

Occupation
Artist                                   281512
Politician                               195390
Athlete                                  110943
Researcher                                90709
Military personnel                        52911
                                          ...  
Zoology                                       1
Zoology; marine biology; biologist            1
École polytechnique                           1
Academic; literary scholar                    1
Wholesale; land owner; philanthropist         1
Name: count, Length: 9313, dtype: int64

df.groupby("Occupation")["Age of death"].mean().sort_values(ascending=False)

Occupation
Farmer; lecturer                                            121.0
Deacon; preacher                                             99.0
Warrior; noble                                               99.0
Suffragette; philanthropist; social reformer; suffragist     99.0
Studienrat; lecturer                                         99.0
                                                            ...  
Basij                                                        13.0
Lehnsmann                                                    13.0
Servant of god                                               12.0
Pioneers-heroes                                              11.0
Miner; master builder                                        11.0
Name: Age of death, Length: 9313, dtype: float64

Age of death by occupation

df.groupby("Occupation")["Age of death"].describe()

	count	mean	std	min	25%	50%	75%	max
Occupation
1859	1.0	47.000000	NaN	47.0	47.0	47.0	47.0	47.0
Abbess	36.0	60.694444	16.924740	24.0	49.0	63.0	73.0	90.0
Abbess; business executive	1.0	86.000000	NaN	86.0	86.0	86.0	86.0	86.0
Abbess; christians jehovah’s witnesses	1.0	81.000000	NaN	81.0	81.0	81.0	81.0	81.0
Abbé	6.0	69.666667	16.070677	41.0	66.0	72.5	79.0	87.0
...	...	...	...	...	...	...	...	...
Zoology	1.0	44.000000	NaN	44.0	44.0	44.0	44.0	44.0
Zoology; marine biology; biologist	1.0	73.000000	NaN	73.0	73.0	73.0	73.0	73.0
École polytechnique	1.0	72.000000	NaN	72.0	72.0	72.0	72.0	72.0
Župan	5.0	38.600000	27.061042	12.0	21.0	32.0	47.0	81.0
مجموعة الأنظمة منصة شليلة; serology; bacteriologist	1.0	40.000000	NaN	40.0	40.0	40.0	40.0	40.0

9313 rows × 8 columns

df["Occupation"].nunique()

df["Occupation"].value_counts()

Occupation
Artist                             True
Politician                         True
Athlete                            True
Researcher                         True
Military personnel                 True
                                  ...  
Director; scout leader            False
Salonnière; patron of the arts    False
Servant of god                    False
Cleric; coal miner                False
Goldsmith; metalsmith             False
Name: count, Length: 9313, dtype: bool

occup_counts = df["Occupation"].value_counts()

occup_counts[occup_counts > 1_000].index

Index(['Artist', 'Politician', 'Athlete', 'Researcher', 'Military personnel',
       'Religious figure', 'Businessperson', 'Architect', 'Journalist',
       'Teacher', 'Physician', 'Engineer', 'Judge', 'Lawyer', 'Jurist',
       'Aristocrat', 'Entrepreneur', 'Philosopher', 'Translator', 'Publisher',
       'Librarian', 'Author', 'Surgeon', 'Merchant', 'Novelist', 'Rower',
       'Astronomer', 'Pianist', 'Psychologist', 'Pastor', 'Minister', 'Farmer',
       'Inventor', 'Psychiatrist', 'Rabbi', 'Explorer', 'Fencer',
       'Police officer', 'Trade unionist'],
      dtype='object', name='Occupation')

occupations_more_than_100 = occup_counts[occup_counts > 1_000].index

# df[df["Occupation"].isin(occupations_more_than_100)]

df = df[df["Occupation"].isin(occupations_more_than_100)]

age_by_occup = df.groupby("Occupation")["Age of death"].mean()
age_by_occup

Occupation
Architect             72.085306
Aristocrat            53.006540
Artist                69.725145
Astronomer            71.152301
Athlete               68.772460
Author                70.094754
Businessperson        74.153054
Engineer              72.156611
Entrepreneur          73.222146
Explorer              61.799302
Farmer                71.240991
Fencer                72.496454
Inventor              73.129545
Journalist            69.591239
Judge                 74.004850
Jurist                69.488372
Lawyer                71.231208
Librarian             73.437335
Merchant              68.316125
Military personnel    63.820056
Minister              69.176471
Novelist              71.452949
Pastor                68.870213
Philosopher           71.037957
Physician             70.683996
Pianist               71.754000
Police officer        64.403013
Politician            70.541558
Psychiatrist          73.231385
Psychologist          76.396378
Publisher             71.178990
Rabbi                 71.741322
Religious figure      69.801273
Researcher            73.131376
Rower                 71.019317
Surgeon               71.815642
Teacher               73.331995
Trade unionist        71.768421
Translator            72.046317
Name: Age of death, dtype: float64

age_by_occup.sort_values(ascending=True).plot()

df.columns

Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
       'Birth year', 'Death year', 'Manner of death', 'Age of death'],
      dtype='object')

Suicide

df["Gender"].value_counts(normalize=True) * 100

Gender
Male                                              90.966985
Female                                             9.019897
Transgender Female                                 0.005605
Transgender Male                                   0.002862
Eunuch; Male                                       0.001908
Female; Male                                       0.000716
Intersex                                           0.000596
Transgender Male; Female                           0.000358
Non-Binary                                         0.000239
Transgender Person; Intersex; Transgender Male     0.000119
Intersex; Male                                     0.000119
Transgender Female; Female                         0.000119
Transgender Female; Male                           0.000119
Intersex; Transgender Male                         0.000119
Transgender Male; Male                             0.000119
Female; Female                                     0.000119
Name: proportion, dtype: float64

df[df["Manner of death"] == "Suicide"].empty

True

df["Manner of death"].value_counts()

Manner of death
natural causes        29717
suicide                4647
accident               4217
homicide               3273
capital punishment     1813
                      ...  
rebellion                 1
Holocaust victim          1
unknown                   1
war; suicide              1
White Terror              1
Name: count, Length: 166, dtype: int64

df_suicide = df[df["Manner of death"] == "suicide"]
df_suicide

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death
23	Q440	Salvador Allende	28th president of Chile (1908–1973)	Male	Chile	Politician	1908	1973.0	suicide	65.0
131	Q1322	José Manuel Balmaceda	Chilean politician and President (1840-1891)	Male	Chile	Politician	1840	1891.0	suicide	51.0
189	Q2022	Cesare Pavese	Italian poet, novelist, literary critic, and t...	Male	Italy; Kingdom of Italy	Researcher	1908	1950.0	suicide	42.0
323	Q4616	Marilyn Monroe	American actress, model, and singer (1926-1962)	Female	United States of America	Artist	1926	1962.0	suicide	36.0
327	Q4673	Paul Otto	German film actor and director	Male	Nazi Germany; Weimar Republic; German Empire	Artist	1878	1943.0	suicide	65.0
...	...	...	...	...	...	...	...	...	...	...
1212054	Q70834687	Karl Neumann	politician and director of the Deutsche Zeiche...	NaN	German Reich	Politician	1900	1945.0	suicide	45.0
1213739	Q73375287	Peter Kuranda	Austrian journalist	Male	Austria; Austria-Hungary	Journalist	1896	1938.0	suicide	42.0
1214539	Q75135015	Michael Benveniste	American pornographic film director	Male	United States of America	Artist	1946	1982.0	suicide	36.0
1215398	Q75336010	George Dewey Sanford Jr.	United States Marine	Male	United States of America	Military personnel	1925	1994.0	suicide	69.0
1217823	Q75694915	Gotthard Zimmer	fotograaf uit Oostenrijk-Hongarije (1847-1886)	NaN	Austria-Hungary	Artist	1847	1886.0	suicide	39.0

4647 rows × 10 columns

suicide_counts_country = df_suicide["Country"].value_counts()
suicide_counts_country

Country
United States of America                           991
France                                             362
Germany                                            321
United Kingdom                                     152
Japan                                              141
                                                  ... 
Qing dynasty; Ming dynasty; Kingdom of Tungning      1
Spain; Peru                                          1
West Germany                                         1
Qing dynasty; China                                  1
United States of America; Russian Empire             1
Name: count, Length: 354, dtype: int64

country_counts = df["Country"].value_counts()
country_counts

Country
United States of America                                                                                                  135127
Germany                                                                                                                    78718
France                                                                                                                     65572
United Kingdom; United Kingdom of Great Britain and Ireland                                                                26642
Spain                                                                                                                      21930
                                                                                                                           ...  
Afghanistan; Austria-Hungary                                                                                                   1
Syria; Ottoman Empire; State of Damascus; Arab Kingdom of Syria; State of Syria; Syrian Republic; United Arab Republic         1
Republic of Florence; Grand Duchy of Tuscany                                                                                   1
Grand Duchy of Tuscany; Duchy of Lucca; Kingdom of Italy                                                                       1
Norway; Austria-Hungary; Union between Sweden and Norway                                                                       1
Name: count, Length: 5400, dtype: int64

suicide = pd.merge(suicide_counts_country, country_counts,
                   how="left",
                   on="Country", suffixes=("_suicide", "_overall"))
suicide

	count_suicide	count_overall
Country
United States of America	991	135127
France	362	65572
Germany	321	78718
United Kingdom	152	19127
Japan	141	13209
...	...	...
Qing dynasty; Ming dynasty; Kingdom of Tungning	1	1
Spain; Peru	1	23
West Germany	1	21
Qing dynasty; China	1	10
United States of America; Russian Empire	1	151

354 rows × 2 columns

suicide = pd.merge(suicide_counts_country, country_counts, 
                   how="left", on="Country",
                   suffixes=("_suicide", "_overall"))
suicide

	count_suicide	count_overall
Country
United States of America	991	135127
France	362	65572
Germany	321	78718
United Kingdom	152	19127
Japan	141	13209
...	...	...
Qing dynasty; Ming dynasty; Kingdom of Tungning	1	1
Spain; Peru	1	23
West Germany	1	21
Qing dynasty; China	1	10
United States of America; Russian Empire	1	151

354 rows × 2 columns

suicide["suicide_over_total"] = suicide["count_suicide"] / suicide["count_overall"]
suicide

	count_suicide	count_overall	suicide_over_total
Country
United States of America	991	135127	0.007334
France	362	65572	0.005521
Germany	321	78718	0.004078
United Kingdom	152	19127	0.007947
Japan	141	13209	0.010675
...	...	...	...
Qing dynasty; Ming dynasty; Kingdom of Tungning	1	1	1.000000
Spain; Peru	1	23	0.043478
West Germany	1	21	0.047619
Qing dynasty; China	1	10	0.100000
United States of America; Russian Empire	1	151	0.006623

354 rows × 3 columns

suicide["suicide_per_1k"] = suicide["suicide_over_total"] * 1000

suicide_sorted = suicide.sort_values(by="suicide_per_1k", ascending=True)
suicide_sorted

	count_suicide	count_overall	suicide_over_total	suicide_per_1k
Country
Spain	31	21930	0.001414	1.413589
Denmark	16	9187	0.001742	1.741591
Kingdom of England	7	3920	0.001786	1.785714
Grand Duchy of Finland	1	549	0.001821	1.821494
India; British Raj	5	2642	0.001893	1.892506
...	...	...	...	...
Northern Ireland; Ireland	1	1	1.000000	1000.000000
People's Republic of Bulgaria	1	1	1.000000	1000.000000
United States of America; French Third Republic; Second French Empire	1	1	1.000000	1000.000000
Japan; China	1	1	1.000000	1000.000000
Nazi Germany; Kingdom of Romania; West Germany	1	1	1.000000	1000.000000

354 rows × 4 columns

suicide_sorted.head(10)["suicide_per_1k"].plot(kind="bar")

suicide_sorted.tail(10)["suicide_per_1k"].plot(kind="bar")

suicide_sorted.tail(10)

	count_suicide	count_overall	suicide_over_total	suicide_per_1k
Country
North Korea; Soviet Union; Russian Empire	1	1	1.0	1000.0
Classical Athens; Ancient Carthage	1	1	1.0	1000.0
Qin	1	1	1.0	1000.0
Germany; Nazi Germany; Austria-Hungary; Czechoslovakia	1	1	1.0	1000.0
Ottoman Empire; Soviet Union; Russian Empire	1	1	1.0	1000.0
Northern Ireland; Ireland	1	1	1.0	1000.0
People's Republic of Bulgaria	1	1	1.0	1000.0
United States of America; French Third Republic; Second French Empire	1	1	1.0	1000.0
Japan; China	1	1	1.0	1000.0
Nazi Germany; Kingdom of Romania; West Germany	1	1	1.0	1000.0

suicide_sorted[suicide_sorted["count_overall"] > 5_000]["suicide_per_1k"].tail(10).plot(kind="bar")

df.isna().sum()

Id                        0
Name                      0
Short description      8421
Gender                87633
Country              182030
Occupation                0
Birth year                0
Death year                0
Manner of death      881641
Age of death              0
dtype: int64

Gender

df["Gender"].value_counts()

Gender
Male                                              762780
Female                                             75634
Transgender Female                                    47
Transgender Male                                      24
Eunuch; Male                                          16
Female; Male                                           6
Intersex                                               5
Transgender Male; Female                               3
Non-Binary                                             2
Transgender Person; Intersex; Transgender Male         1
Intersex; Male                                         1
Transgender Female; Female                             1
Transgender Female; Male                               1
Intersex; Transgender Male                             1
Transgender Male; Male                                 1
Female; Female                                         1
Name: count, dtype: int64

df.query("Gender == 'Non-Binary'") # df[df["Gender"] == "Non-Binary"]

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death
39998	Q219634	Claude Cahun	French artist (1894-1954)	Non-Binary	France	Artist	1894	1954.0	NaN	60.0
754386	Q13562059	Maxine Feldman	lesbian and non-binary musician	Non-Binary	United States of America	Artist	1945	2007.0	NaN	62.0

df.columns

Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
       'Birth year', 'Death year', 'Manner of death', 'Age of death'],
      dtype='object')

df.groupby("Gender")["Birth year"].max().sort_values()

Gender
Eunuch; Male                                      1451
Intersex; Male                                    1763
Transgender Male; Male                            1869
Female; Female                                    1884
Transgender Person; Intersex; Transgender Male    1885
Intersex; Transgender Male                        1912
Transgender Male; Female                          1913
Intersex                                          1926
Non-Binary                                        1945
Transgender Female; Male                          1947
Female; Male                                      1949
Transgender Female; Female                        1949
Transgender Male                                  1986
Transgender Female                                1991
Male                                              2002
Female                                            2002
Name: Birth year, dtype: int64

df = df[df["Gender"].isin(["Male", "Female"])]

df["Occupation"].unique()

array(['Politician', 'Artist', 'Astronomer', 'Athlete', 'Researcher',
       'Military personnel', 'Philosopher', 'Businessperson', 'Explorer',
       'Architect', 'Teacher', 'Aristocrat', 'Entrepreneur', 'Journalist',
       'Engineer', 'Author', 'Religious figure', 'Judge', 'Librarian',
       'Translator', 'Physician', 'Inventor', 'Trade unionist',
       'Merchant', 'Publisher', 'Pastor', 'Fencer', 'Rabbi',
       'Psychologist', 'Lawyer', 'Rower', 'Jurist', 'Police officer',
       'Surgeon', 'Psychiatrist', 'Pianist', 'Farmer', 'Minister',
       'Novelist'], dtype=object)

df_reserach = df[df["Occupation"] == "Researcher"]

df_reserach.shape[0]

len(df_reserach)

df_reserach.value_counts("Gender").loc["Male"] / len(df_reserach)

np.float64(0.9204624701780143)

df_reserach["Gender"].value_counts(normalize=True).loc["Male"]

np.float64(0.9204624701780143)

def get_male_percentage(series):
    return series.value_counts(normalize=True).loc["Male"] * 100

get_male_percentage(df_reserach["Gender"])

np.float64(0.9204624701780143)

for m in df["Occupation"].unique():
    df_filter = df[df["Occupation"] == m]
    print(m, get_male_percentage(df_filter["Gender"]))

Politician 0.9561554391245799
Artist 0.821756963672281
Astronomer 0.9173256649892164
Athlete 0.9672833532213965
Researcher 0.9204624701780143
Military personnel 0.9830178291619024
Philosopher 0.9450272765421738
Businessperson 0.9515949663447468
Explorer 0.9703315881326352
Architect 0.9670399592771698
Teacher 0.8632561613144137
Aristocrat 0.6248584371460929
Entrepreneur 0.9663496708119971
Journalist 0.8801171679645639
Engineer 0.9881951949455483
Author 0.8742255266418835
Religious figure 0.9743905658716888
Judge 0.9711538461538461
Librarian 0.7817745803357314
Translator 0.7956669498725574
Physician 0.9199198326943185
Inventor 0.9727497935590421
Trade unionist 0.8755980861244019
Merchant 0.9845261121856866
Publisher 0.9534782608695652
Pastor 0.9901071723000825
Fencer 0.875886524822695
Rabbi 0.9920704845814978
Psychologist 0.7916018662519441
Lawyer 0.939869484151647
Rower 0.9845460399227302
Jurist 0.988530990727184
Police officer 0.9582909460834181
Surgeon 0.9824890556597874
Psychiatrist 0.9107303877366997
Pianist 0.659037095501184
Farmer 0.9534109816971714
Minister 0.9712918660287081
Novelist 0.5951293759512938

gender_occup = df.groupby("Occupation")["Gender"].apply(get_male_percentage).sort_values()
gender_occup

Occupation
Novelist              59.512938
Aristocrat            62.485844
Pianist               65.903710
Librarian             78.177458
Psychologist          79.160187
Translator            79.566695
Artist                82.175696
Teacher               86.325616
Author                87.422553
Trade unionist        87.559809
Fencer                87.588652
Journalist            88.011717
Psychiatrist          91.073039
Astronomer            91.732566
Physician             91.991983
Researcher            92.046247
Lawyer                93.986948
Philosopher           94.502728
Businessperson        95.159497
Farmer                95.341098
Publisher             95.347826
Politician            95.615544
Police officer        95.829095
Entrepreneur          96.634967
Architect             96.703996
Athlete               96.728335
Explorer              97.033159
Judge                 97.115385
Minister              97.129187
Inventor              97.274979
Religious figure      97.439057
Surgeon               98.248906
Military personnel    98.301783
Merchant              98.452611
Rower                 98.454604
Engineer              98.819519
Jurist                98.853099
Pastor                99.010717
Rabbi                 99.207048
Name: Gender, dtype: float64

gender_occup_df = gender_occup.to_frame()
gender_occup_df.rename(columns={"Gender": "Percentage Male"}, inplace=True)

gender_occup_df.plot(kind="bar")

df.value_counts("Gender")

Gender
Male      762780
Female     75634
Name: count, dtype: int64

0.8656544743501265 / 762780 * 1000

0.0011348678181784086

0.132965263400046 / 75634 * 1000 / (0.8656544743501265 / 762780 * 1000)

1.5490871388122094

df_suicide.Gender.value_counts(normalize=True)

Gender
Male                  0.865654
Female                0.132965
Transgender Female    0.000690
Eunuch; Male          0.000230
Transgender Male      0.000230
Intersex              0.000230
Name: proportion, dtype: float64

Հա՞յ ես

հա,
հաճելի ա չէ՞

t1 = "i am from Armenia"
# t2 = "I am Armenian"

options = ["rmenia", "armenian"]
"Armenia" in t1

# contained = []
# for i in options:
#     contained.append(i in t1)
    
contained = [i.lower() in t1.lower() for i in options]
print(any(contained))

True

def is_armenian(text):
    keywords = ["armenian", "armenia"]
    return any([k in text.lower() for k in keywords])

df["Armenian"] = df["Short description"].apply(is_armenian)
df

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[125], line 1
----> 1 df["Armenian"] = df["Short description"].apply(is_armenian)
      2 df

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\series.py:4935, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4800 def apply(
   4801     self,
   4802     func: AggFuncType,
   (...)
   4807     **kwargs,
   4808 ) -> DataFrame | Series:
   4809     """
   4810     Invoke function on values of Series.
   4811 
   (...)
   4926     dtype: float64
   4927     """
   4928     return SeriesApply(
   4929         self,
   4930         func,
   4931         convert_dtype=convert_dtype,
   4932         by_row=by_row,
   4933         args=args,
   4934         kwargs=kwargs,
-> 4935     ).apply()

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\apply.py:1422, in SeriesApply.apply(self)
   1419     return self.apply_compat()
   1421 # self.func is Callable
-> 1422 return self.apply_standard()

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\apply.py:1502, in SeriesApply.apply_standard(self)
   1496 # row-wise access
   1497 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1498 # we need to give `na_action="ignore"` for categorical data.
   1499 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1500 #  Categorical (GH51645).
   1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1502 mapped = obj._map_values(
   1503     mapper=curried, na_action=action, convert=self.convert_dtype
   1504 )
   1506 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1507     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1508     #  See also GH#25959 regarding EA support
   1509     return obj._constructor_expanddim(list(mapped), index=obj.index)

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    922 if isinstance(arr, ExtensionArray):
    923     return arr.map(mapper, na_action=na_action)
--> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File c:\Users\hayk_\.conda\envs\lectures\lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1741 values = arr.astype(object, copy=False)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:
   1745     return lib.map_infer_mask(
   1746         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1747     )

File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer()

Cell In[124], line 3, in is_armenian(text)
      1 def is_armenian(text):
      2     keywords = ["armenian", "armenia"]
----> 3     return any([k in text.lower() for k in keywords])

Cell In[124], line 3, in <listcomp>(.0)
      1 def is_armenian(text):
      2     keywords = ["armenian", "armenia"]
----> 3     return any([k in text.lower() for k in keywords])

AttributeError: 'float' object has no attribute 'lower'

df[df["Short description"].isna()].fillna("")

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death
46515	Q287430	Pietro Guido II Torelli		Male		Aristocrat	1450	1494.0		44.0
71941	Q482302	József Adamovich		Male		Religious figure	1845	1887.0		42.0
75497	Q516682	István Agh		Male		Religious figure	1709	1786.0		77.0
88055	Q621272	Dénes Alesius		Male		Religious figure	1525	1577.0		52.0
92789	Q689315	Mátyás Ambrózy		Male		Pastor	1797	1869.0		72.0
...	...	...	...	...	...	...	...	...	...	...
1219020	Q75881383	Virginia Downing		Female		Artist	1904	1996.0		92.0
1219990	Q76009843	Edward Hunter Ludlow		Male		Physician	1810	1884.0		74.0
1222371	Q76328370	James Gordon Dennis		Male		Military personnel	1921	1944.0		23.0
1222650	Q76375951	John Calvin MacKay		Male		Religious figure	1891	1986.0		95.0
1222675	Q76401454	Joan Marsden		Female		Researcher	1922	2001.0		79.0

5612 rows × 10 columns

df["Armenian"] = df["Short description"].fillna("na").apply(is_armenian)

df[df.Armenian]

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death	Armenian
180	Q1785	Charles Aznavour	Armenian-French singer and diplomat	Male	France; Armenia	Artist	1924	2018.0	NaN	94.0	True
311	Q4452	Thomas of Metsoph	Armenian cleric and chronicler	Male	NaN	Researcher	1378	1446.0	NaN	68.0	True
354	Q4924	Isabella I, Queen of Armenia	queen regnant of Cilician Armenia	Female	Armenian Kingdom of Cilicia	Politician	1216	1252.0	NaN	36.0	True
3462	Q51472	Rouben Mamoulian	Armenian American film and theatre director	Male	United States of America; Russian Empire	Artist	1897	1987.0	NaN	90.0	True
3807	Q55394	Henri Verneuil	French-Armenian playwright and filmmaker	Male	France	Artist	1920	2002.0	NaN	82.0	True
...	...	...	...	...	...	...	...	...	...	...	...
1158947	Q58030786	Marie Balian	Armenian ceramic artist	Female	Israel	Artist	1925	2017.0	NaN	92.0	True
1161788	Q59394760	Robert Kamoyan	Armenian director, artist	Male	Armenia; Soviet Union	Artist	1937	2014.0	NaN	77.0	True
1166304	Q59657412	Giuseppe Arachial	Armenian Catholic bishop of Angora	Male	Ottoman Empire	Religious figure	1811	1876.0	NaN	65.0	True
1191627	Q63226473	Boris Meliksetyan	Armenian geologist	Male	Armenia; Soviet Union	Researcher	1928	1992.0	NaN	64.0	True
1198505	Q64734343	Pierre Tilkian	Armenian Catholic bishop	Male	NaN	Religious figure	1809	1885.0	NaN	76.0	True

538 rows × 11 columns

df[df["Country"] == "Armenia"]

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death	Armenian
43970	Q266968	Gurgen Margaryan	Armenian soldier	Male	Armenia	Military personnel	1978	2004.0	homicide	26.0	True
45653	Q278864	Andranik Ozanian	Armenian politician and military personnel (18...	Male	Armenia	Politician	1865	1927.0	NaN	62.0	True
54084	Q336104	Jerry Tarkanian	American basketball coach	Male	Armenia	Athlete	1930	2015.0	NaN	85.0	False
71000	Q471374	Karen Asrian	Armenian chess player	Male	Armenia	Athlete	1980	2008.0	natural causes	28.0	True
79459	Q544093	Genrikh Kasparyan	Armenian chess player	Male	Armenia	Athlete	1910	1995.0	NaN	85.0	True
...	...	...	...	...	...	...	...	...	...	...	...
1003702	Q24048886	Robert Abajyan	Armenian military person, Hero of Artsakh	Male	Armenia	Military personnel	1996	2016.0	suicide	20.0	True
1025037	Q27349753	Artur Sargsyan	Armenian sculptor	Male	Armenia	Artist	1968	2017.0	NaN	49.0	True
1034887	Q28114502	Emma Khanzadyan	Armenian historian, archaeologist	Female	Armenia	Researcher	1922	2007.0	NaN	85.0	True
1046490	Q29033966	Eduard Edigaryan	Armenian painter	Male	Armenia	Artist	1943	2019.0	NaN	76.0	True
1084025	Q47009214	Pavel Chobanyan	Armenian orientalist	Male	Armenia	Researcher	1948	2017.0	NaN	69.0	True

121 rows × 11 columns

arm = df[df["Country"].fillna("na").str.contains("Armenia")]
arm

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death	Armenian
180	Q1785	Charles Aznavour	Armenian-French singer and diplomat	Male	France; Armenia	Artist	1924	2018.0	NaN	94.0	True
354	Q4924	Isabella I, Queen of Armenia	queen regnant of Cilician Armenia	Female	Armenian Kingdom of Cilicia	Politician	1216	1252.0	NaN	36.0	True
3201	Q48112	Ivan Bagramyan	Marshal of the Soviet Union (1897-1982)	Male	Soviet Union; Russian Empire; First Republic o...	Politician	1897	1982.0	NaN	85.0	False
4983	Q61130	Luigi Colani	German industrial designer and design professor	Male	Germany; Armenia	Teacher	1928	2019.0	NaN	91.0	False
5560	Q62316	Robert Sahakyants	animator	Male	Armenia; Soviet Union	Artist	1950	2009.0	NaN	59.0	False
...	...	...	...	...	...	...	...	...	...	...	...
1086289	Q47457007	Garnik Karapetyan	Armenian scientist and mathematician (1958–2018)	Male	Armenia; Soviet Union	Researcher	1958	2018.0	NaN	60.0	True
1161788	Q59394760	Robert Kamoyan	Armenian director, artist	Male	Armenia; Soviet Union	Artist	1937	2014.0	NaN	77.0	True
1182207	Q62024298	Diana Oucleba	Georgian poetess, artist	Female	Armenia; Soviet Union; Russian Empire	Artist	1910	2001.0	NaN	91.0	False
1191627	Q63226473	Boris Meliksetyan	Armenian geologist	Male	Armenia; Soviet Union	Researcher	1928	1992.0	NaN	64.0	True
1206411	Q66132386	Albert Ghazaryan	athlete, coach, referee	Male	Armenia; Soviet Union	Athlete	1935	2020.0	NaN	85.0	False

301 rows × 11 columns

arm["num_countries"] = arm["Country"].str.split(";").apply(len)
arm

C:\Users\hayk_\AppData\Local\Temp\ipykernel_6640\2434009080.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  arm["num_countries"] = arm["Country"].str.split(";").apply(len)

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death	Armenian	num_countries
180	Q1785	Charles Aznavour	Armenian-French singer and diplomat	Male	France; Armenia	Artist	1924	2018.0	NaN	94.0	True	2
354	Q4924	Isabella I, Queen of Armenia	queen regnant of Cilician Armenia	Female	Armenian Kingdom of Cilicia	Politician	1216	1252.0	NaN	36.0	True	1
3201	Q48112	Ivan Bagramyan	Marshal of the Soviet Union (1897-1982)	Male	Soviet Union; Russian Empire; First Republic o...	Politician	1897	1982.0	NaN	85.0	False	3
4983	Q61130	Luigi Colani	German industrial designer and design professor	Male	Germany; Armenia	Teacher	1928	2019.0	NaN	91.0	False	2
5560	Q62316	Robert Sahakyants	animator	Male	Armenia; Soviet Union	Artist	1950	2009.0	NaN	59.0	False	2
...	...	...	...	...	...	...	...	...	...	...	...	...
1086289	Q47457007	Garnik Karapetyan	Armenian scientist and mathematician (1958–2018)	Male	Armenia; Soviet Union	Researcher	1958	2018.0	NaN	60.0	True	2
1161788	Q59394760	Robert Kamoyan	Armenian director, artist	Male	Armenia; Soviet Union	Artist	1937	2014.0	NaN	77.0	True	2
1182207	Q62024298	Diana Oucleba	Georgian poetess, artist	Female	Armenia; Soviet Union; Russian Empire	Artist	1910	2001.0	NaN	91.0	False	3
1191627	Q63226473	Boris Meliksetyan	Armenian geologist	Male	Armenia; Soviet Union	Researcher	1928	1992.0	NaN	64.0	True	2
1206411	Q66132386	Albert Ghazaryan	athlete, coach, referee	Male	Armenia; Soviet Union	Athlete	1935	2020.0	NaN	85.0	False	2

301 rows × 12 columns

arm.sort_values(by="num_countries", ascending=False)

	Id	Name	Short description	Gender	Country	Occupation	Birth year	Death year	Manner of death	Age of death	Armenian	num_countries
231807	Q2047004	Suren Yeremyan	Armenian historian	Male	Armenia; Soviet Union; Russian Empire; Russian...	Researcher	1908	1992.0	NaN	84.0	True	9
71047	Q471740	Armen Dzhigarkhanyan	Armenian, Soviet, Russian actor	Male	United States of America; Russia; Armenia; Sov...	Artist	1935	2020.0	NaN	85.0	True	4
100991	Q738092	Pavel Lisitsian	Russian singer	Male	Russia; Armenia; Soviet Union; Russian Empire	Artist	1911	2004.0	NaN	93.0	False	4
370403	Q4071165	Tinatin Asatiani	Georgian physicist	Female	Armenia; Soviet Union; Democratic Republic of ...	Researcher	1918	2011.0	NaN	93.0	False	4
370366	Q4070512	Varazdat Harutyunyan	Armenian architect	Male	Armenia; Ottoman Empire; Soviet Union; Russian...	Researcher	1909	2008.0	NaN	99.0	True	4
...	...	...	...	...	...	...	...	...	...	...	...	...
932345	Q20509556	Maria Petrosyan	Armenian philosopher	Female	Armenia	Philosopher	1911	1971.0	NaN	60.0	True	1
932348	Q20509639	Aida Boyajyan	Armenian artist	Female	Armenia	Artist	1932	2019.0	NaN	87.0	True	1
932353	Q20509808	Henrik Sevan	Armenian children's writer, translator, poet	Male	Armenia	Artist	1925	2008.0	NaN	83.0	True	1
43970	Q266968	Gurgen Margaryan	Armenian soldier	Male	Armenia	Military personnel	1978	2004.0	homicide	26.0	True	1
354	Q4924	Isabella I, Queen of Armenia	queen regnant of Cilician Armenia	Female	Armenian Kingdom of Cilicia	Politician	1216	1252.0	NaN	36.0	True	1