A Python
dict
cannot have more than one instance of the same key:
dictionary = { "AL": "Alabama", "AK": "Alaska", "AR": "Arizona", #illegal "AR": "Arkansas", #illegal "CA": "California" }
But a pandas
Series
can have more than one instance of the same index.
In the following
Series
,
for example,
10,859 of the 27,200 rows have the index
"Manhattan"
.
"Create a Series giving the borough and cuisine of each restaurant in New York City." import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") print(f"{len(series) = :,}") print(f'{len(series["Manhattan"]) = :,}') print() #The series has too many rows to print all of them. print(series.head()) #Just the first 5 rows. Means print(series.iloc[:5]) print() print(series.tail(3)) #Just the last 3 rows. Means print(series.iloc[-3:]) sys.exit(0)
len(series) = 27,200 len(series["Manhattan"]) = 10,859 Borough Bronx Bakery Brooklyn Hamburgers Manhattan Irish Brooklyn American Queens American Name: Cuisine, dtype: object Borough Queens Other Brooklyn Other Brooklyn Other Name: Cuisine, dtype: object
To see only the first three and the last three rows,
pd.set_option("max_rows", 6) print(series)
"Create a Series containing the rows that have an invalid borough." import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] seriesOfBools = ~series.index.isin(valid) #tilde means "not" seriesOfBadRows = series[seriesOfBools] seriesOfBadRows.name = "bad rows" #Print the length of the seriesOfBadRows too. s = seriesOfBadRows.to_string(dtype = True, length = True, name = True) print(s) sys.exit(0)
Borough 0 American 0 American 0 American 0 Scandinavian 0 Hawaiian 0 Other 0 Other 0 Other 0 Other 0 Other 0 Other 0 Other 0 Other 0 Other 0 Other Name: bad rows, Length: 15, dtype: object
Unfortunately,
15 of the 27,200 rows have a bad borough.
We can delete these rows by inserting the following statement
immediately after creating
valid
.
This is called
data cleaning.
series = series[valid]
The value of the expression
series["Staten Island"]
is a
Series
containing 983 rows.
(Warning: had there been only one restaurant in Staten Island,
the value of the expression
series["Staten Island"]
would have been a
str
,
not a
Series
containing one row.)
"Create a Series giving the borough and cuisine of each restaurant in New York City." import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] #Create a Series of all the restaurants in Staten Island. statens = series["Staten Island"] #Create a Series of all the Chinese Restaurants in New York City. seriesOfBools = series == "Chinese" chineses = series[seriesOfBools] seriesOfBools = statens == "Chinese" statenChineses = statens[seriesOfBools] print(f"{len(series):6,} restaurants in New York City with known locations.") print(f"{len(chineses):6,} Chinese restaurants in New York City.") print(f"{len(statens):6,} restaurants in Staten Island.") print(f"{len(statenChineses):6,} Chinese restaurants in Staten Island.") sys.exit(0)
27,185 restaurants in New York City with known locations. 2,396 Chinese restaurants in New York City. 983 restaurants in Staten Island. 78 Chinese restaurants in Staten Island.
If you know what you’re doing, you can simplify
seriesOfBools = series == "Chinese" chineses = series[seriesOfBools]to
chineses = series[series == "Chinese"]
How many Italian restaurants does Manhattan have? The last time I counted (excluding pizza), there were about 9,000.
Index.unique
returns a
pd.Index
;
Series.unique
returns an
np.ndarray
.
Series.unique
can’t create and return a new
Series
,
because it has no index to give to the new
Series
.
""" How many boroughs have restaurants? How many cuisines does New York City have? """ import sys import numpy as np import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] index = series.index if not index.is_unique: index = index.unique() for i, borough in enumerate(index.sort_values(), start = 1): print(i, borough) print() seriesOfCuisines = series if not seriesOfCuisines.is_unique: seriesOfCuisines = pd.Series(data = seriesOfCuisines.unique()) for i, cuisine in enumerate(seriesOfCuisines.sort_values(), start = 1): print(f"{i:2} {cuisine}") sys.exit(0)
1 Bronx 2 Brooklyn 3 Manhattan 4 Queens 5 Staten Island 1 Afghan 2 African 3 American 4 Armenian 5 Asian 6 Australian 7 Bagels/Pretzels 8 Bakery 9 Bangladeshi 10 Barbecue 11 Basque 12 Bottled beverages, including water, sodas, juices, etc. 13 Brazilian 14 Café/Coffee/Tea 15 Cajun 16 Californian 17 Caribbean 18 Chicken 19 Chilean 20 Chinese 21 Chinese/Cuban 22 Chinese/Japanese 23 Continental 24 Creole 25 Creole/Cajun 26 Czech 27 Delicatessen 28 Donuts 29 Eastern European 30 Egyptian 31 English 32 Ethiopian 33 Filipino 34 French 35 Fruits/Vegetables 36 German 37 Greek 38 Hamburgers 39 Hawaiian 40 Hotdogs 41 Hotdogs/Pretzels 42 Ice Cream, Gelato, Yogurt, Ices 43 Indian 44 Indonesian 45 Iranian 46 Irish 47 Italian 48 Japanese 49 Jewish/Kosher 50 Juice, Smoothies, Fruit Salads 51 Korean 52 Latin (Cuban, Dominican, Puerto Rican, South & Central American) 53 Mediterranean 54 Mexican 55 Middle Eastern 56 Moroccan 57 Not Listed/Not Applicable 58 Nuts/Confectionary 59 Other 60 Pakistani 61 Pancakes/Waffles 62 Peruvian 63 Pizza 64 Pizza/Italian 65 Polish 66 Portuguese 67 Russian 68 Salads 69 Sandwiches 70 Sandwiches/Salads/Mixed Buffet 71 Scandinavian 72 Seafood 73 Soul Food 74 Soups 75 Soups & Sandwiches 76 Southwestern 77 Spanish 78 Steak 79 Tapas 80 Tex-Mex 81 Thai 82 Turkish 83 Vegetarian 84 Vietnamese/Cambodian/Malaysia
The file
christmas.txt
contains 1,000 random lowercase letters,
one per line.
I read them into a
Series
named
seriesOfLetters
and printed the first five.
import sys import pandas as pd url = "http://oit2.scps.nyu.edu/~meretzkm/pandas/series/christmas.txt" df = pd.read_csv(url, header = None) #Create a pd.DataFrame containing one column numbered 0 seriesOfLetters = df[0] #Get that column. seriesOfLetters.name = "Random Letters" print(seriesOfLetters.head()) print() seriesOfNumbers = seriesOfLetters.value_counts() seriesOfNumbers.index.name = "letter" seriesOfNumbers.name = "Frequencies" print(seriesOfNumbers) print() print(seriesOfNumbers.sort_index()) #alphabetical order print() print(f"{len(seriesOfLetters) = :,}") print(f"{seriesOfNumbers.sum() = :,}") sys.exit(0)
0 x 1 v 2 o 3 g 4 g Name: Random Letters, dtype: object letter w 51 x 51 e 50 b 50 n 48 c 46 m 45 v 44 k 42 d 42 j 41 p 40 q 40 t 38 u 37 s 37 y 36 r 35 o 35 g 34 z 34 f 32 a 32 h 31 i 29 Name: Frequencies, dtype: int64 letter a 32 b 50 c 46 d 42 e 50 f 32 g 34 h 31 i 29 j 41 k 42 m 45 n 48 o 35 p 40 q 40 r 35 s 37 t 38 u 37 v 44 w 51 x 51 y 36 z 34 Name: Frequencies, dtype: int64 len(seriesOfLetters) = 1,000 seriesOfNumbers.sum() = 1,000
The frequencies of the letters contain a Christmas message of peace and goodwill for all mankind, or humankind. Can you figure out the message?
#Hint 1. print(f"{len(seriesOfNumbers) = :,}")
#Hint 2. data = list(string.ascii_lowercase) #remember to import string index = pd.Index(data = data, name = "every letter") print(seriesOfNumbers.reindex(index = index, fill_value = 0)) #or fill_value = 1_000_000
""" What's the most popular cuisine in Staten Island? What borough has the most Chinese restaurants? """ import sys import numpy as np import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] statens = series["Staten Island"] if len(statens) == 0: print("Staten Island has no restaurants.") else: seriesOfCounts = statens.value_counts() cuisine = seriesOfCounts.idxmax() print(f"The most popular cuisine in Staten Island is {cuisine.upper()}.") print(f"{cuisine} is number {np.argmax(seriesOfCounts.array) + 1} on the list!") print(f"There are {seriesOfCounts.max()} of them in Staten Island.") print() chineses = series[series == "Chinese"] if len(chineses) == 0: print("New York City has no Chinese restaurants.") else: seriesOfCounts = chineses.index.value_counts() borough = seriesOfCounts.idxmax() print(f"The borough with the most Chinese restaurants is {borough.upper()}.") print(f"{borough} is number {np.argmax(seriesOfCounts.array) + 1} on the list!") print(f"There are {seriesOfCounts.max()} of them in {borough}.") sys.exit(0)
The most popular cuisine in Staten Island is AMERICAN. American is number 1 on the list! There are 219 of them in Staten Island. The borough with the most Chinese restaurants is QUEENS. Queens is number 1 on the list! There are 781 of them in Queens.
import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] print("Boroughs in decreasing order:") seriesOfInts = series.index.value_counts(ascending = False) #ascending = False is the default print(seriesOfInts) #seriesOfInts is a pd.Series print() print("Boroughs in alphabetical order:") print(seriesOfInts.sort_index(ascending = True)) #ascending = True is the default print() print("Cuisines in decreasing order:") seriesOfInts = series.value_counts() #seriesOfInts is a pd.Series s = seriesOfInts.to_string(max_rows = None) print(s) print() print("Cuisines in alphabetical order:") seriesOfInts.sort_index(inplace = True) s = seriesOfInts.to_string(max_rows = None) print(s) sys.exit(0)
Boroughs in decreasing order: Manhattan 10859 Brooklyn 6788 Queens 6158 Bronx 2397 Staten Island 983 Name: Borough, dtype: int64 Boroughs in alphabetical order: Bronx 2397 Brooklyn 6788 Manhattan 10859 Queens 6158 Staten Island 983 Name: Borough, dtype: int64 Cuisines in decreasing order: American 5759 Chinese 2396 Other 1826 Café/Coffee/Tea 1808 Pizza 1198 Mexican 967 Italian 960 Japanese 872 Latin (Cuban, Dominican, Puerto Rican, South & Central American) 823 Bakery 751 Caribbean 703 Spanish 636 Donuts 519 Chicken 511 Pizza/Italian 473 Juice, Smoothies, Fruit Salads 422 Asian 413 Hamburgers 365 Sandwiches 352 Ice Cream, Gelato, Yogurt, Ices 345 Jewish/Kosher 331 Indian 316 French 312 Korean 307 Thai 305 Delicatessen 297 Mediterranean 281 Sandwiches/Salads/Mixed Buffet 233 Seafood 193 Middle Eastern 192 Irish 191 Bagels/Pretzels 157 Greek 138 Vegetarian 130 Tex-Mex 126 Bottled beverages, including water, sodas, juices, etc. 109 Vietnamese/Cambodian/Malaysia 103 Peruvian 83 Steak 79 Salads 78 African 73 Turkish 72 Eastern European 70 Russian 69 Soul Food 66 Barbecue 54 Hawaiian 47 Continental 47 Soups & Sandwiches 43 Chinese/Japanese 43 Bangladeshi 43 Tapas 41 Hotdogs/Pretzels 32 Filipino 31 Brazilian 29 Pakistani 29 Hotdogs 28 German 28 Armenian 26 Polish 25 Australian 22 Creole 22 Chinese/Cuban 21 Ethiopian 17 Pancakes/Waffles 15 Afghan 14 English 13 Egyptian 12 Portuguese 11 Indonesian 10 Moroccan 8 Soups 8 Not Listed/Not Applicable 8 Scandinavian 7 Creole/Cajun 6 Fruits/Vegetables 6 Cajun 6 Nuts/Confectionary 6 Czech 4 Californian 4 Iranian 4 Southwestern 3 Chilean 1 Basque 1 Cuisines in alphabetical order: Afghan 14 African 73 American 5759 Armenian 26 Asian 413 Australian 22 Bagels/Pretzels 157 Bakery 751 Bangladeshi 43 Barbecue 54 Basque 1 Bottled beverages, including water, sodas, juices, etc. 109 Brazilian 29 Café/Coffee/Tea 1808 Cajun 6 Californian 4 Caribbean 703 Chicken 511 Chilean 1 Chinese 2396 Chinese/Cuban 21 Chinese/Japanese 43 Continental 47 Creole 22 Creole/Cajun 6 Czech 4 Delicatessen 297 Donuts 519 Eastern European 70 Egyptian 12 English 13 Ethiopian 17 Filipino 31 French 312 Fruits/Vegetables 6 German 28 Greek 138 Hamburgers 365 Hawaiian 47 Hotdogs 28 Hotdogs/Pretzels 32 Ice Cream, Gelato, Yogurt, Ices 345 Indian 316 Indonesian 10 Iranian 4 Irish 191 Italian 960 Japanese 872 Jewish/Kosher 331 Juice, Smoothies, Fruit Salads 422 Korean 307 Latin (Cuban, Dominican, Puerto Rican, South & Central American) 823 Mediterranean 281 Mexican 967 Middle Eastern 192 Moroccan 8 Not Listed/Not Applicable 8 Nuts/Confectionary 6 Other 1826 Pakistani 29 Pancakes/Waffles 15 Peruvian 83 Pizza 1198 Pizza/Italian 473 Polish 25 Portuguese 11 Russian 69 Salads 78 Sandwiches 352 Sandwiches/Salads/Mixed Buffet 233 Scandinavian 7 Seafood 193 Soul Food 66 Soups 8 Soups & Sandwiches 43 Southwestern 3 Spanish 636 Steak 79 Tapas 41 Tex-Mex 126 Thai 305 Turkish 72 Vegetarian 130 Vietnamese/Cambodian/Malaysia 103
"Which boroughs have Cajun restaurants?" import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] seriesOfBools = series == "Cajun" #Create a Series containing about 27,000 bools. 6 of them are True. cajuns = series[seriesOfBools] #Create a Series containing 6 strings. print(cajuns) print() seriesOfInts = cajuns.index.value_counts() #Create a Series containing only 3 ints. print(seriesOfInts) print() newIndex = series.index.unique() seriesOfInts = seriesOfInts.reindex(newIndex, fill_value = 0) #Create a Series containing 5 ints. print(seriesOfInts) print() seriesOfBools = seriesOfInts > 0 #Create a Series containing 5 bools. print(seriesOfBools) sys.exit(0)
Borough Staten Island Cajun Queens Cajun Brooklyn Cajun Brooklyn Cajun Brooklyn Cajun Brooklyn Cajun Name: Cuisine, dtype: object Brooklyn 4 Queens 1 Staten Island 1 Name: Borough, dtype: int64 Borough Bronx 0 Brooklyn 4 Manhattan 0 Queens 1 Staten Island 1 Name: Borough, dtype: int64 Borough Bronx False Brooklyn True Manhattan False Queens True Staten Island True Name: Borough, dtype: bool
In the following program,
the first call to the
groupby
method splits the
series
into five smaller
Series
es called
groups.
That’s because
series.index
contains five distinct values
("Bronx"
,
"Brooklyn"
,
"Manhattan"
,
etc).
Of course,
series.index
contains many copies of each of these five values.
The second call to
groupby
splits the
series
into 84 groups.
import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] #Group by borough in alphabetical order. groups = series.groupby(series.index) for name, group in groups: #Each name is a str, each group is a Series. print(f"{len(group):6,} {name}") print() #Group by cuisine in alphabetical order. groups = series.groupby(series) for name, group in groups: print(f"{len(group):5,} {name}") sys.exit(0)
To group by borough in increasing numeric order, change
for name, group in groups: #Each name is a str, each group is a Series.to
#Each name is a str, each group is a Series. for name, group in sorted(groups, key = lambda t: len(t[1])): #Each t is a tuple containing 2 items.in the above program.
2,392 Bronx 6,771 Brooklyn 10,837 Manhattan 6,150 Queens 982 Staten Island 14 Afghan 73 African 5,739 American 26 Armenian 408 Asian 22 Australian 157 Bagels/Pretzels 750 Bakery 43 Bangladeshi 54 Barbecue 1 Basque 109 Bottled beverages, including water, sodas, juices, etc. 29 Brazilian 1,807 Café/Coffee/Tea 6 Cajun 4 Californian 703 Caribbean 508 Chicken 1 Chilean 2,390 Chinese 21 Chinese/Cuban 43 Chinese/Japanese 47 Continental 22 Creole 6 Creole/Cajun 4 Czech 297 Delicatessen 518 Donuts 70 Eastern European 12 Egyptian 13 English 17 Ethiopian 31 Filipino 311 French 6 Fruits/Vegetables 28 German 138 Greek 365 Hamburgers 47 Hawaiian 28 Hotdogs 32 Hotdogs/Pretzels 345 Ice Cream, Gelato, Yogurt, Ices 316 Indian 10 Indonesian 4 Iranian 191 Irish 959 Italian 873 Japanese 331 Jewish/Kosher 422 Juice, Smoothies, Fruit Salads 306 Korean 822 Latin (Cuban, Dominican, Puerto Rican, South & Central American) 281 Mediterranean 967 Mexican 191 Middle Eastern 8 Moroccan 8 Not Listed/Not Applicable 6 Nuts/Confectionary 1,818 Other 29 Pakistani 15 Pancakes/Waffles 83 Peruvian 1,197 Pizza 473 Pizza/Italian 25 Polish 11 Portuguese 69 Russian 78 Salads 352 Sandwiches 233 Sandwiches/Salads/Mixed Buffet 7 Scandinavian 193 Seafood 66 Soul Food 8 Soups 43 Soups & Sandwiches 3 Southwestern 636 Spanish 79 Steak 41 Tapas 126 Tex-Mex 304 Thai 71 Turkish 130 Vegetarian 102 Vietnamese/Cambodian/Malaysia
The above program merely prints the
len
and name of each group.
The following program performs a more elaborate computation
for each group.
The program does not bother to create a
Series
of all the Chinese restaurants in New York City.
It merely computes the number of rows
(n2
)
that such a
Series
would have.
"Chinese restaurants by borough." import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] seriesOfBools = series == "Chinese" n2 = seriesOfBools.sum() #number of Chinese restaurants in New York City #Group by borough in alphabetical order. groups = series.groupby(series.index) for name, group in groups: #Each name is a str, each group is a Series. m = (group == "Chinese").sum() #number of Chinese restaurants in this borough n1 = len(group) percent1 = 100 * m / n1 percent2 = 100 * m / n2 the = "the " if name == "Bronx" else "" print(f"{m} of the restaurants in {the}{name} are Chinese.") print(f"That's {percent1:.1f}% of the {len(group):,} restaurants in {the}{name},",) print(f"and {percent2:.1f}% of the {n2:,} Chinese restaurants in New York City.") print() sys.exit(0)
295 of the restaurants in Bronx are Chinese. That's 12.3% of the 2,397 restaurants in the Bronx, and 12.3% of the 2,396 Chinese restaurants in New York City. 731 of the restaurants in Brooklyn are Chinese. That's 10.8% of the 6,788 restaurants in Brooklyn, and 30.5% of the 2,396 Chinese restaurants in New York City. 511 of the restaurants in Manhattan are Chinese. That's 4.7% of the 10,859 restaurants in Manhattan, and 21.3% of the 2,396 Chinese restaurants in New York City. 781 of the restaurants in Queens are Chinese. That's 12.7% of the 6,158 restaurants in Queens, and 32.6% of the 2,396 Chinese restaurants in New York City. 78 of the restaurants in Staten Island are Chinese. That's 7.9% of the 983 restaurants in Staten Island, and 3.3% of the 2,396 Chinese restaurants in New York City.
A function passed to
agg
is called the
aggregating function.
In this example,
the two aggregating functions simple enough to be
lambda
functions.
As in the above programs,
the
Series.groupby
method
splits
the
Series
into smaller
Series
es called
groups.
The aggregating function is then
applied
to each group.
In other words,
agg
passes each group to the aggregating function.
In this program,
the first aggregating function is called five times
and the second aggregating function is called 84 times.
In this program,
the argument of the aggregating function is a group.
(Later, the argument might be a
DataFrame
.)
The return value of the aggregating function must be a single value,
not a
Series
.
In this program,
each aggregating function returns an integer.
The
agg
function
combines
the return value of each call to the aggregating function
into a brand new
Series
and returns this
Series
.
""" How many different cuisines does each borough offer? How many boroughs offer each cuisine? """ import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] print("How many cuisines does each borough offer?") groups = series.groupby(series.index) #Group by borough in alphabetical order. seriesOfNumbers = groups.agg(lambda group: len(group.unique())) #Each group is a Series. seriesOfNumbers.name = "How Many Cusisines" print(seriesOfNumbers) print() print("How many boroughs offer each cuisine?") groups = series.groupby(series) #Group by cuisine in alphabetical order. seriesOfNumbers = groups.agg(lambda group: len(group.index.unique())) #Each group is a Series. seriesOfNumbers.name = "How Many Boroughs" print(seriesOfNumbers) sys.exit(0)
How many cuisines does each borough offer? Borough Bronx 53 Brooklyn 79 Manhattan 80 Queens 78 Staten Island 54 Name: How Many Cusisines, dtype: int64 How many boroughs offer each cuisine? Cuisine Afghan 3 African 5 American 5 Armenian 4 Asian 5 .. Tex-Mex 5 Thai 5 Turkish 4 Vegetarian 4 Vietnamese/Cambodian/Malaysia 5 Name: How Many Boroughs, Length: 84, dtype: int64
Like
GroupBy.agg
,
GroupBy.apply
receives an aggregating function.
But the aggregating function passed to
GroupBy.apply
does not have to return a single value.
It can return an entire
Series
.
Until now, the index of a
Series
has been an a single column.
In other words, the index of each row has been a single value
such as
0
or
1
.
But in the following program,
the index of the first
seriesOfNumbers
is two columns named
"Borough"
and
"Cuisine"
.
In other words, the index of each row is now a
tuple
of two items.
For example, the index of the first row is the
tuple
("Bronx", "American")
and the index of the second row is the
tuple
("Bronx", "Chinese")
.
(The first item in a
tuple
is not printed when it is the same as the first item of the
tuple
in the previous row.)
This type of index is called a
hierarchical index
or
MultiIndex
.
See
Hierarchical
indexing.
In the following program,
the first
seriesOfNumbers
has a
MultiIndex
with two
levels.
The level on the left
("Borough"
)
is called the
outer level,
and the level on the right
("Cuisine"
)
is called the
inner level.
The levels are numbered like the items in a Python
list
:
the outer level is number
0
,
the next level is number
1
,
and a more complicated example might have a level
2
.
The innermost level (number 1
in this example)
is also number
-1
.
import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] print("What are the three most popular cuisines in each borough?") groups = series.groupby(series.index) #Group by borough in alphabetical order. seriesOfNumbers = groups.apply(lambda group: group.value_counts()[:3]) #Each group is a Series. seriesOfNumbers.index.names = ["Borough", "Cuisine"] seriesOfNumbers.name = "Most popular cuisines in each borough" print(seriesOfNumbers) print() print(seriesOfNumbers.index) print() print("What three boroughs have the most restaurants of each cuisine?") groups = series.groupby(series) #Group by cuisine in alphabetical order. seriesOfNumbers = groups.apply(lambda group: group.index.value_counts()[:3]) #Each group is a Series. seriesOfNumbers.index.names = ["Cuisine", "Borough"] seriesOfNumbers.name = "Boroughs with the most restaurants of each cuisine" s = seriesOfNumbers.to_string(dtype = True, length = True, max_rows = 20, name = True) print(s) print() print(seriesOfNumbers.index) sys.exit(0)
What are the 3 most popular cuisines in each borough? Borough Cuisine Bronx American 402 Chinese 295 Pizza 203 Brooklyn American 1247 Chinese 731 Café/Coffee/Tea 459 Manhattan American 2911 Café/Coffee/Tea 984 Other 818 Queens American 980 Chinese 781 Other 388 Staten Island American 219 Chinese 78 Italian 70 Name: Most popular cuisines in each borough, dtype: int64 MultiIndex([( 'Bronx', 'American'), ( 'Bronx', 'Chinese'), ( 'Bronx', 'Pizza'), ( 'Brooklyn', 'American'), ( 'Brooklyn', 'Chinese'), ( 'Brooklyn', 'Café/Coffee/Tea'), ( 'Manhattan', 'American'), ( 'Manhattan', 'Café/Coffee/Tea'), ( 'Manhattan', 'Other'), ( 'Queens', 'American'), ( 'Queens', 'Chinese'), ( 'Queens', 'Other'), ('Staten Island', 'American'), ('Staten Island', 'Chinese'), ('Staten Island', 'Italian')], names=['Borough', 'Cuisine']) What 3 boroughs have the most restaurants of each cuisine? Cuisine Borough Afghan Queens 8 Manhattan 4 Brooklyn 2 African Bronx 25 Brooklyn 23 Manhattan 20 American Manhattan 2911 Brooklyn 1247 Queens 980 Armenian Manhattan 14 ... Thai Queens 78 Turkish Brooklyn 30 Manhattan 29 Queens 10 Vegetarian Manhattan 70 Brooklyn 47 Queens 12 Vietnamese/Cambodian/Malaysia Manhattan 42 Brooklyn 36 Queens 21 Name: Boroughs with the most restaurants of each cuisine, Length: 239, dtype: int64 MultiIndex([( 'Afghan', 'Queens'), ( 'Afghan', 'Manhattan'), ( 'Afghan', 'Brooklyn'), ( 'African', 'Bronx'), ( 'African', 'Brooklyn'), ( 'African', 'Manhattan'), ( 'American', 'Manhattan'), ( 'American', 'Brooklyn'), ( 'American', 'Queens'), ( 'Armenian', 'Manhattan'), ... ( 'Thai', 'Queens'), ( 'Turkish', 'Brooklyn'), ( 'Turkish', 'Manhattan'), ( 'Turkish', 'Queens'), ( 'Vegetarian', 'Manhattan'), ( 'Vegetarian', 'Brooklyn'), ( 'Vegetarian', 'Queens'), ('Vietnamese/Cambodian/Malaysia', 'Manhattan'), ('Vietnamese/Cambodian/Malaysia', 'Brooklyn'), ('Vietnamese/Cambodian/Malaysia', 'Queens')], names=['Cuisine', 'Borough'], length=239)
The following call to
Series.unstack
created a
DataFrame
containg five columns.
The index of the
DataFrame
was created by removing the innermost level
(number
-1
)
from the index of the
Series
.
(The index of the
Series
was a hierarchical index with two levels;
the index of the
DataFrame
is just a plain old index with only one level.)
The
DataFrame
contains five columns because the level we removed from the index
contains five possible values
("Bronx"
,
"Brooklyn"
,
"Manhattan"
,
etc.)
import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] groups = series.groupby(series) #Group by cuisine in alphabetical order. seriesOfNumbers = groups.apply(lambda group: group.index.value_counts()) #Each group is a Series. seriesOfNumbers.index.names = ["Cuisine", "Borough"] seriesOfNumbers.name = "Number of Restaurants" s = seriesOfNumbers.to_string(dtype = True, length = True, max_rows = 20, name = True) print(s) print() df = seriesOfNumbers.unstack(fill_value = 0, level = -1) #level = -1 is the default df.name = "Cross Tabulation" s = df.to_string(max_cols = 5, max_rows = 10, show_dimensions = True) print(s) sys.exit(0)
Cuisine Borough Afghan Queens 8 Manhattan 4 Brooklyn 2 African Bronx 25 Brooklyn 23 Manhattan 20 Queens 4 Staten Island 1 American Manhattan 2911 Brooklyn 1247 ... Turkish Staten Island 3 Vegetarian Manhattan 70 Brooklyn 47 Queens 12 Bronx 1 Vietnamese/Cambodian/Malaysia Manhattan 42 Brooklyn 36 Queens 21 Staten Island 3 Bronx 1 Name: Number of Restaurants, Length: 344, dtype: int64 Borough Bronx Brooklyn Manhattan Queens Staten Island Cuisine Afghan 0 2 4 8 0 African 25 23 20 4 1 American 402 1247 2911 980 219 Armenian 1 7 14 4 0 Asian 6 98 175 118 15 ... ... ... ... ... ... Tex-Mex 10 32 50 28 6 Thai 5 81 138 78 3 Turkish 0 30 29 10 3 Vegetarian 1 47 70 12 0 Vietnamese/Cambodian/Malaysia 1 36 42 21 3 [84 rows x 5 columns]
pd.cross_tab
is a shortcut for the above.
The following call to this function creates a
pd.DataFrame
containing five columns because the index of the
Series
contains copies of five strings
("Broooklyn"
,
"Bronx"
,
"Manhattan"
,
etc).
The
pd.DataFrame
contains 84 rows because the values of the
Series
contains copies of 84 strings
("Afghan"
,
"African"
,
"American"
,
etc).
There’s one extra row and one extra column
(for totals of 6 and 85)
because we asked for
margins
.
import sys import pandas as pd #Website is #https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j url = "https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv" df = pd.read_csv(url).groupby("CAMIS").first() #Keep only one row (the first) for each restaurant. index = pd.Index(data = df["BORO"], name = "Borough") series = pd.Series(data = df["CUISINE DESCRIPTION"].array, index = index, name = "Cuisine") valid = [ "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island" ] series = series[valid] #Create a pd.DataFrame containing 5+1 columns, including the All column. df = pd.crosstab(index = series.array, columns = series.index, margins = True) df.index.name = "Cuisine" df.columns.name = "Boroughs" s = df.to_string(max_cols = 5 + 1, show_dimensions = True) print(s) sys.exit(0)
Boroughs Bronx Brooklyn Manhattan Queens Staten Island All Cuisine Afghan 0 2 4 8 0 14 African 25 23 20 4 1 73 American 402 1247 2911 980 219 5759 Armenian 1 7 14 4 0 26 Asian 6 98 175 118 15 412 Australian 0 4 17 1 0 22 Bagels/Pretzels 7 39 62 39 10 157 Bakery 75 190 241 223 22 751 Bangladeshi 8 9 6 20 0 43 Barbecue 3 16 24 6 5 54 Basque 0 0 1 0 0 1 Bottled beverages, including water, sodas, juic... 9 23 39 34 4 109 Brazilian 0 5 13 11 0 29 Café/Coffee/Tea 63 459 984 275 28 1809 Cajun 0 4 0 1 1 6 Californian 0 1 3 0 0 4 Caribbean 118 337 38 205 5 703 Chicken 115 157 113 109 17 511 Chilean 0 0 0 1 0 1 Chinese 295 731 511 781 78 2396 Chinese/Cuban 3 3 11 4 0 21 Chinese/Japanese 0 13 19 9 2 43 Continental 0 10 25 11 1 47 Creole 0 15 2 5 0 22 Creole/Cajun 0 4 0 2 0 6 Czech 1 1 0 2 0 4 Delicatessen 28 51 112 93 13 297 Donuts 66 122 141 155 35 519 Eastern European 3 34 17 13 3 70 Egyptian 0 4 5 3 0 12 English 0 1 11 1 0 13 Ethiopian 0 4 13 0 0 17 Filipino 0 1 8 20 2 31 French 2 51 244 15 0 312 Fruits/Vegetables 0 1 3 1 1 6 German 0 6 12 7 3 28 Greek 6 18 55 53 6 138 Hamburgers 63 86 92 101 23 365 Hawaiian 1 8 32 6 0 47 Hotdogs 1 7 18 2 0 28 Hotdogs/Pretzels 3 4 17 5 3 32 Ice Cream, Gelato, Yogurt, Ices 23 102 114 74 32 345 Indian 7 59 138 103 9 316 Indonesian 0 3 3 4 0 10 Iranian 0 0 2 2 0 4 Irish 10 12 115 52 2 191 Italian 42 168 568 112 70 960 Japanese 16 187 475 161 34 873 Jewish/Kosher 8 184 76 61 2 331 Juice, Smoothies, Fruit Salads 42 107 174 88 11 422 Korean 0 26 108 173 0 307 Latin (Cuban, Dominican, Puerto Rican, South & ... 174 138 186 314 11 823 Mediterranean 5 68 147 52 9 281 Mexican 124 288 326 189 40 967 Middle Eastern 4 87 57 38 6 192 Moroccan 0 0 7 1 0 8 Not Listed/Not Applicable 1 1 3 2 1 8 Nuts/Confectionary 0 4 2 0 0 6 Other 110 449 818 388 62 1827 Pakistani 1 13 6 9 0 29 Pancakes/Waffles 2 3 3 5 2 15 Peruvian 3 13 15 49 3 83 Pizza 203 308 354 276 57 1198 Pizza/Italian 57 130 126 105 55 473 Polish 0 17 2 3 3 25 Portuguese 0 2 4 5 0 11 Russian 0 52 8 6 3 69 Salads 0 8 65 5 0 78 Sandwiches 44 62 154 69 23 352 Sandwiches/Salads/Mixed Buffet 25 34 111 56 7 233 Scandinavian 0 2 5 0 0 7 Seafood 26 42 92 29 4 193 Soul Food 10 25 14 15 2 66 Soups 0 1 7 0 0 8 Soups & Sandwiches 2 7 30 3 1 43 Southwestern 0 0 1 2 0 3 Spanish 132 146 126 212 20 636 Steak 5 8 54 11 1 79 Tapas 0 10 22 8 1 41 Tex-Mex 10 32 50 28 6 126 Thai 5 81 138 78 3 305 Turkish 0 30 29 10 3 72 Vegetarian 1 47 70 12 0 130 Vietnamese/Cambodian/Malaysia 1 36 42 21 3 103 All 2397 6788 10860 6159 983 27187 [85 rows x 6 columns]
Our
DataFrame
contains 84 rows and 5 columns.
What would happen if you reversed the order of the two arguments of
pd.cross_tab
?
df = pd.crosstab(index = series.index, columns = series.array, margins = True)