A pd.Series of baby names

The following program creates a Series of the baby boy names of 2018, in order of decreasing popularity. Names of equal popularity are in alphabetical order. Names with less than five babies are not listed.

"""
Baby names example from Wes McKinney, "Python for Data Analysis", 2nd. ed., pp. 424-440.
Social Security Administration, Office of the Chief Actuary
http://www.ssa.gov/oact/babynames/limits.html
Download and unzip the .zip file for National Data:
https://www.ssa.gov/oact/babynames/names.zip
I placed the resulting folder in my /Users/myname/Downloads folder.
"""

import sys
import os
import pandas as pd

year = 2018   #most recent available
filename = os.path.expanduser(f"~/Downloads/names/yob{year}.txt") #year of birth
names = ["name", "sex", "births"]
df = pd.read_csv(filename, names = names)
df = df[df.sex == "M"]   #Keep the male rows.
df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True)

index = pd.Index(data = df["name"], name = "name")
series = pd.Series(data = df.births.array, index = index, name = "Number of Births")

with pd.option_context("display.min_rows", 40):   #context manager
    print(series)

sys.exit(0)

name
Liam         19837
Noah         18267
William      14516
James        13525
Oliver       13389
Benjamin     13381
Elijah       12886
Lucas        12585
Mason        12435
Logan        12352
Alexander    11989
Ethan        11854
Jacob        11770
Michael      11620
Daniel       11173
Henry        10649
Jackson      10323
Sebastian    10054
Aiden         9979
Matthew       9924
             ...  
Zien             5
Zier             5
Zierre           5
Zihir            5
Zim              5
Zin              5
Zishe            5
Zmari            5
Zoel             5
Zola             5
Zuber            5
Zubeyr           5
Zyell            5
Zyheem           5
Zykeem           5
Zylas            5
Zyran            5
Zyrie            5
Zyron            5
Zzyzx            5
Name: Number of Births, Length: 14004, dtype: int64

Do something interesting with this Series, starting by verifying that the names in the index are unique. (There should be only one row with each name.) What is the total number of names? What is the total number of baby boys? How many names had 10,000 or more babies? How many babies had names that were shared by 10,000 or more other babies? What percent of the babies had names in the top 10? What initial had the most names? (It was A). What initial had the most babies? How many names started with an A? How many babies had names that started with an A? What were the ten most popular names that started with A?

groupby the initial letter

import sys
import os
import pandas as pd

year = 2018
filename = os.path.expanduser(f"~/Downloads/names/yob{year}.txt") #year of birth

names = ["name", "sex", "births"]
df = pd.read_csv(filename, names = names)
df = df[df.sex == "M"]
df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True)

series = df["births"]
series.index = df["name"]
series.name = "Number of births for each name"
series.index.name = "name"

groups = series.groupby(series.index.str[0])   #Initials in alphabetical order

numberOfNames = groups.count()
numberOfNames.name = "Number of names for each initial"
with pd.option_context("display.max_rows", 6):
    print(numberOfNames)
print()

#The same Series, in descending numerical order.
#print(numberOfNames.sort_values(ascending = False))
#print()

numberOfBabies = groups.sum()
numberOfBabies.name = "Number of babies for each initial"
with pd.option_context("display.max_rows", 6):
    print(numberOfBabies)
print()

#The same Series, in descending numerical order.
#print(numberOfBabies.sort_values(ascending = False))
#print()

numberOfBabies = groups.apply(lambda group: group.sort_values(ascending = False)[:3])
numberOfBabies.index.names = ["initial", "name"]
numberOfBabies.name = "Three most popular names for each initial"
with pd.option_context("display.max_rows", 3 * 26):
    print(numberOfBabies)

sys.exit(0)

name
A    1569
B     668
C     712
     ... 
X      60
Y     259
Z     402
Name: Number of names for each initial, Length: 26, dtype: int64

name
A    180564
B     91509
C    135023
      ...  
X      7864
Y      8163
Z     26063
Name: Number of babies for each initial, Length: 26, dtype: int64

initial  name       
A        Alexander      11989
         Aiden           9979
         Anthony         8003
B        Benjamin       13381
         Brayden         4383
         Bryson          4194
C        Carter          9312
         Christopher     7261
         Caleb           6929
D        Daniel         11173
         David           9697
         Dylan           8549
E        Elijah         12886
         Ethan          11854
         Eli             6027
F        Finn            2316
         Felix           1638
         Finley          1280
G        Grayson         8538
         Gabriel         8335
         Greyson         4728
H        Henry          10649
         Hudson          6540
         Hunter          6066
I        Isaac           8417
         Isaiah          6614
         Ian             4675
J        James          13525
         Jacob          11770
         Jackson        10323
K        Kayden          3972
         Kai             3421
         Kingston        3330
L        Liam           19837
         Lucas          12585
         Logan          12352
M        Mason          12435
         Michael        11620
         Matthew         9924
N        Noah           18267
         Nathan          6790
         Nolan           5607
O        Oliver         13389
         Owen            9288
         Oscar           1945
P        Parker          3978
         Patrick         2111
         Preston         1954
Q        Quinn            828
         Quentin          511
         Quinton          456
R        Ryan            6905
         Robert          5140
         Roman           4364
S        Sebastian      10054
         Samuel          9734
         Santiago        4647
T        Theodore        7020
         Thomas          6779
         Tyler           3298
U        Uriel            580
         Uriah            461
         Ulises           236
V        Vincent         3552
         Victor          2213
         Valentino        396
W        William        14516
         Wyatt           9127
         Weston          3760
X        Xavier          4298
         Xander          2257
         Xzavier          253
Y        Yusuf            485
         Yosef            328
         Yousef           285
Z        Zachary         3528
         Zion            2153
         Zayden          2126
Name: Three most popular names for each initial, dtype: int64

"Divide the frequencies into bins."

import sys
import os
import numpy as np
import pandas as pd

year = 2018   #most recent available
filename = os.path.expanduser(f"~/python/names/yob{year}.txt") #year of birth
names = ["name", "sex", "births"]
df = pd.read_csv(filename, names = names)
df = df[df.sex == "M"]   #Keep the male rows.
df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True)

index = pd.Index(data = df["name"], name = "name")
series = pd.Series(data = df.births.array, index = index, name = "Number of Births")

with pd.option_context("display.min_rows", 10):   #context manager
    print(series)
print()

bins = np.arange(0, 21_000, 1_000)   #smallest int in each of the 20 categories
seriesOfBins = pd.cut(series, bins = bins, right = False) #first bin includes 0 but not 1_000
seriesOfBins.name = "Which bin does each name belong to?"

#Examine the dtype of the seriesOfBins.

dtype = seriesOfBins.dtype
print(f"{dtype.name = }")
print(f"{dtype.ordered = }")
print(f"{type(dtype.categories) = }")
print(f"{dtype.categories.closed = }")
print(f"{len(dtype.categories) = }")
print()

for category in dtype.categories:
    print(category)
print()

print(seriesOfBins)
print()

seriesOfCounts = seriesOfBins.value_counts(sort = False)
seriesOfCounts.name = "Number of names in each bin"
print(seriesOfCounts)

sys.exit(0)

name
Liam       19837
Noah       18267
William    14516
James      13525
Oliver     13389
           ...  
Zylas          5
Zyran          5
Zyrie          5
Zyron          5
Zzyzx          5
Name: Number of Births, Length: 14004, dtype: int64

dtype.name = 'category'
dtype.ordered = True
type(dtype.categories) = <class 'pandas.core.indexes.interval.IntervalIndex'>
dtype.categories.closed = 'left'
len(dtype.categories) = 20

[0, 1000)
[1000, 2000)
[2000, 3000)
[3000, 4000)
[4000, 5000)
[5000, 6000)
[6000, 7000)
[7000, 8000)
[8000, 9000)
[9000, 10000)
[10000, 11000)
[11000, 12000)
[12000, 13000)
[13000, 14000)
[14000, 15000)
[15000, 16000)
[16000, 17000)
[17000, 18000)
[18000, 19000)
[19000, 20000)

name
Liam       [19000, 20000)
Noah       [18000, 19000)
William    [14000, 15000)
James      [13000, 14000)
Oliver     [13000, 14000)
                ...      
Zylas           [0, 1000)
Zyran           [0, 1000)
Zyrie           [0, 1000)
Zyron           [0, 1000)
Zzyzx           [0, 1000)
Name: Which bin does each name belong to?, Length: 14004, dtype: category
Categories (20, interval[int64]): [[0, 1000) < [1000, 2000) < [2000, 3000) < [3000, 4000) < ... <
                                   [16000, 17000) < [17000, 18000) < [18000, 19000) < [19000, 20000)]

[0, 1000)         13672
[1000, 2000)        131
[2000, 3000)         72
[3000, 4000)         34
[4000, 5000)         22
[5000, 6000)         14
[6000, 7000)         15
[7000, 8000)          6
[8000, 9000)         11
[9000, 10000)         9
[10000, 11000)        3
[11000, 12000)        5
[12000, 13000)        4
[13000, 14000)        3
[14000, 15000)        1
[15000, 16000)        0
[16000, 17000)        0
[17000, 18000)        0
[18000, 19000)        1
[19000, 20000)        1
Name: Number of names in each bin, dtype: int64