The following program creates a
Series
of the baby boy names of 2018,
in order of decreasing popularity.
Names of equal popularity are in alphabetical order.
Names with less than five babies are not listed.
"""
Baby names example from Wes McKinney, "Python for Data Analysis", 2nd. ed., pp. 424-440.
Social Security Administration, Office of the Chief Actuary
http://www.ssa.gov/oact/babynames/limits.html
Download and unzip the .zip file for National Data:
https://www.ssa.gov/oact/babynames/names.zip
I placed the resulting folder in my /Users/myname/Downloads folder.
"""
import sys
import os
import pandas as pd
year = 2018 #most recent available
filename = os.path.expanduser(f"~/Downloads/names/yob{year}.txt") #year of birth
names = ["name", "sex", "births"]
df = pd.read_csv(filename, names = names)
df = df[df.sex == "M"] #Keep the male rows.
df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True)
index = pd.Index(data = df["name"], name = "name")
series = pd.Series(data = df.births.array, index = index, name = "Number of Births")
with pd.option_context("display.min_rows", 40): #context manager
print(series)
sys.exit(0)
name
Liam 19837
Noah 18267
William 14516
James 13525
Oliver 13389
Benjamin 13381
Elijah 12886
Lucas 12585
Mason 12435
Logan 12352
Alexander 11989
Ethan 11854
Jacob 11770
Michael 11620
Daniel 11173
Henry 10649
Jackson 10323
Sebastian 10054
Aiden 9979
Matthew 9924
...
Zien 5
Zier 5
Zierre 5
Zihir 5
Zim 5
Zin 5
Zishe 5
Zmari 5
Zoel 5
Zola 5
Zuber 5
Zubeyr 5
Zyell 5
Zyheem 5
Zykeem 5
Zylas 5
Zyran 5
Zyrie 5
Zyron 5
Zzyzx 5
Name: Number of Births, Length: 14004, dtype: int64
Do something interesting with this
Series,
starting by verifying that the names in the index are unique.
(There should be only one row with each name.)
What is the total number of names?
What is the total number of baby boys?
How many names had 10,000 or more babies?
How many babies had names that were shared by 10,000 or more other babies?
What percent of the babies had names in the top 10?
What initial had the most names?
(It was A).
What initial had the most babies?
How many names started with an A?
How many babies had names that started with an A?
What were the ten most popular names that started with A?
import sys
import os
import pandas as pd
year = 2018
filename = os.path.expanduser(f"~/Downloads/names/yob{year}.txt") #year of birth
names = ["name", "sex", "births"]
df = pd.read_csv(filename, names = names)
df = df[df.sex == "M"]
df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True)
series = df["births"]
series.index = df["name"]
series.name = "Number of births for each name"
series.index.name = "name"
groups = series.groupby(series.index.str[0]) #Initials in alphabetical order
numberOfNames = groups.count()
numberOfNames.name = "Number of names for each initial"
with pd.option_context("display.max_rows", 6):
print(numberOfNames)
print()
#The same Series, in descending numerical order.
#print(numberOfNames.sort_values(ascending = False))
#print()
numberOfBabies = groups.sum()
numberOfBabies.name = "Number of babies for each initial"
with pd.option_context("display.max_rows", 6):
print(numberOfBabies)
print()
#The same Series, in descending numerical order.
#print(numberOfBabies.sort_values(ascending = False))
#print()
numberOfBabies = groups.apply(lambda group: group.sort_values(ascending = False)[:3])
numberOfBabies.index.names = ["initial", "name"]
numberOfBabies.name = "Three most popular names for each initial"
with pd.option_context("display.max_rows", 3 * 26):
print(numberOfBabies)
sys.exit(0)
name
A 1569
B 668
C 712
...
X 60
Y 259
Z 402
Name: Number of names for each initial, Length: 26, dtype: int64
name
A 180564
B 91509
C 135023
...
X 7864
Y 8163
Z 26063
Name: Number of babies for each initial, Length: 26, dtype: int64
initial name
A Alexander 11989
Aiden 9979
Anthony 8003
B Benjamin 13381
Brayden 4383
Bryson 4194
C Carter 9312
Christopher 7261
Caleb 6929
D Daniel 11173
David 9697
Dylan 8549
E Elijah 12886
Ethan 11854
Eli 6027
F Finn 2316
Felix 1638
Finley 1280
G Grayson 8538
Gabriel 8335
Greyson 4728
H Henry 10649
Hudson 6540
Hunter 6066
I Isaac 8417
Isaiah 6614
Ian 4675
J James 13525
Jacob 11770
Jackson 10323
K Kayden 3972
Kai 3421
Kingston 3330
L Liam 19837
Lucas 12585
Logan 12352
M Mason 12435
Michael 11620
Matthew 9924
N Noah 18267
Nathan 6790
Nolan 5607
O Oliver 13389
Owen 9288
Oscar 1945
P Parker 3978
Patrick 2111
Preston 1954
Q Quinn 828
Quentin 511
Quinton 456
R Ryan 6905
Robert 5140
Roman 4364
S Sebastian 10054
Samuel 9734
Santiago 4647
T Theodore 7020
Thomas 6779
Tyler 3298
U Uriel 580
Uriah 461
Ulises 236
V Vincent 3552
Victor 2213
Valentino 396
W William 14516
Wyatt 9127
Weston 3760
X Xavier 4298
Xander 2257
Xzavier 253
Y Yusuf 485
Yosef 328
Yousef 285
Z Zachary 3528
Zion 2153
Zayden 2126
Name: Three most popular names for each initial, dtype: int64
"Divide the frequencies into bins."
import sys
import os
import numpy as np
import pandas as pd
year = 2018 #most recent available
filename = os.path.expanduser(f"~/python/names/yob{year}.txt") #year of birth
names = ["name", "sex", "births"]
df = pd.read_csv(filename, names = names)
df = df[df.sex == "M"] #Keep the male rows.
df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True)
index = pd.Index(data = df["name"], name = "name")
series = pd.Series(data = df.births.array, index = index, name = "Number of Births")
with pd.option_context("display.min_rows", 10): #context manager
print(series)
print()
bins = np.arange(0, 21_000, 1_000) #smallest int in each of the 20 categories
seriesOfBins = pd.cut(series, bins = bins, right = False) #first bin includes 0 but not 1_000
seriesOfBins.name = "Which bin does each name belong to?"
#Examine the dtype of the seriesOfBins.
dtype = seriesOfBins.dtype
print(f"{dtype.name = }")
print(f"{dtype.ordered = }")
print(f"{type(dtype.categories) = }")
print(f"{dtype.categories.closed = }")
print(f"{len(dtype.categories) = }")
print()
for category in dtype.categories:
print(category)
print()
print(seriesOfBins)
print()
seriesOfCounts = seriesOfBins.value_counts(sort = False)
seriesOfCounts.name = "Number of names in each bin"
print(seriesOfCounts)
sys.exit(0)
name
Liam 19837
Noah 18267
William 14516
James 13525
Oliver 13389
...
Zylas 5
Zyran 5
Zyrie 5
Zyron 5
Zzyzx 5
Name: Number of Births, Length: 14004, dtype: int64
dtype.name = 'category'
dtype.ordered = True
type(dtype.categories) = <class 'pandas.core.indexes.interval.IntervalIndex'>
dtype.categories.closed = 'left'
len(dtype.categories) = 20
[0, 1000)
[1000, 2000)
[2000, 3000)
[3000, 4000)
[4000, 5000)
[5000, 6000)
[6000, 7000)
[7000, 8000)
[8000, 9000)
[9000, 10000)
[10000, 11000)
[11000, 12000)
[12000, 13000)
[13000, 14000)
[14000, 15000)
[15000, 16000)
[16000, 17000)
[17000, 18000)
[18000, 19000)
[19000, 20000)
name
Liam [19000, 20000)
Noah [18000, 19000)
William [14000, 15000)
James [13000, 14000)
Oliver [13000, 14000)
...
Zylas [0, 1000)
Zyran [0, 1000)
Zyrie [0, 1000)
Zyron [0, 1000)
Zzyzx [0, 1000)
Name: Which bin does each name belong to?, Length: 14004, dtype: category
Categories (20, interval[int64]): [[0, 1000) < [1000, 2000) < [2000, 3000) < [3000, 4000) < ... <
[16000, 17000) < [17000, 18000) < [18000, 19000) < [19000, 20000)]
[0, 1000) 13672
[1000, 2000) 131
[2000, 3000) 72
[3000, 4000) 34
[4000, 5000) 22
[5000, 6000) 14
[6000, 7000) 15
[7000, 8000) 6
[8000, 9000) 11
[9000, 10000) 9
[10000, 11000) 3
[11000, 12000) 5
[12000, 13000) 4
[13000, 14000) 3
[14000, 15000) 1
[15000, 16000) 0
[16000, 17000) 0
[17000, 18000) 0
[18000, 19000) 1
[19000, 20000) 1
Name: Number of names in each bin, dtype: int64