[1]:
import os
import gzip
import numpy as np
import pandas as pd
import transportation_tutorials as tt
Pandas is the de facto standard for statistical analysis of tabular data using Python.
The basic data structure in pandas is a DataFrame. A DataFrame is a two dimensional table of data, with optional row and column labels.
You can construct a DataFrame from raw data in a few different ways. One simple way is to pass a dictionary, as here:
[2]:
raw = {'A': [7,8,9], 'B': [1,2,3], 'C': [4.1,5.2,6.3]}
df = pd.DataFrame(raw)
df
[2]:
A | B | C | |
---|---|---|---|
0 | 7 | 1 | 4.1 |
1 | 8 | 2 | 5.2 |
2 | 9 | 3 | 6.3 |
In addition to the data in the table, DataFrames are defined with row and column labels, identified as index
and columns
, respectively. You might notice that in the first DataFrame we created, the keys of the raw dictionary were adopted automatically as the column labels. The row labels are not drawn from the raw dictionary as there’s nothing in it to use for that purpose. But every DataFrame must have some kind of row labels, so some are created automatically. If we provide more info to
the DataFrame constructor so that it can see what the row labels should be, then it will use those:
[3]:
raw_los = {
'Speed': {'Car': 60, 'Bus': 20, 'Walk':3},
'Cost': {'Car': 3.25, 'Bus': 1.75, 'Walk':0},
}
los = pd.DataFrame(raw_los)
los
[3]:
Speed | Cost | |
---|---|---|
Car | 60 | 3.25 |
Bus | 20 | 1.75 |
Walk | 3 | 0.00 |
[4]:
los.index
[4]:
Index(['Car', 'Bus', 'Walk'], dtype='object')
[5]:
los.columns
[5]:
Index(['Speed', 'Cost'], dtype='object')
The values for either index
or columns
can be changed, simply by overwriting the appropriate attribute of the DataFrame.
[6]:
los.index = [2,1,3]
los
[6]:
Speed | Cost | |
---|---|---|
2 | 60 | 3.25 |
1 | 20 | 1.75 |
3 | 3 | 0.00 |
You can also choose an existing column of the DataFrame to be the index, using the set_index
method.
[7]:
los.set_index('Speed')
[7]:
Cost | |
---|---|
Speed | |
60 | 3.25 |
20 | 1.75 |
3 | 0.00 |
This probably isn’t a good thing to do with measured attributes like ‘Speed’, but it can come in handy to set the index as an identifier code (e.g. a TAZ code), especially after reading data from a file.
As you might notice, the index
or columns
can be made up of text, or numbers – or, in fact, any Python object.
[8]:
los.index = [{'key':'value'},set([None]),bytes(2)]
los
[8]:
Speed | Cost | |
---|---|---|
{'key': 'value'} | 60 | 3.25 |
{None} | 20 | 1.75 |
b'\x00\x00' | 3 | 0.00 |
But you’ll find things much simpler if you stick to strings or integers for these labels, unless there is some very compelling reason to deviate from that plan.
Every DataFrame has both index
and columns
, even if one or both of these sets of labels are not defined when creating the DataFrame. For example, the example df
created from raw
above only has column names, but the index
with row names was created by default.
[9]:
df.index
[9]:
RangeIndex(start=0, stop=3, step=1)
This default index is a RangeIndex
, which is a memory-saving feature (at least when the DataFrame is large), as the actual index values are not stored, just the start, stop, and step values (i.e., start at 0, stop before you get to 3, counting by 1’s).
Within the DataFrame, each column of data has a uniform data type, but the data types are permitted to vary across the columns. In the example DataFrame we created here, the data type was inferred as int64
for columns ‘A’ and ‘B’, and float64
for ‘C’ (because the values in that column’s data are integers). Note that all the values displayed in column ‘C’ have decimal values. We can also check explicitly on the data types for each column:
[10]:
df
[10]:
A | B | C | |
---|---|---|---|
0 | 7 | 1 | 4.1 |
1 | 8 | 2 | 5.2 |
2 | 9 | 3 | 6.3 |
[11]:
df.dtypes
[11]:
A int64
B int64
C float64
dtype: object
If we try to force data into a DataFrame that doesn’t follow this pattern of maintaining one common data type in each column, the underlying data will be up-casted to preserve this data type by columns rule. For example, if we take the transpose of the example DataFrame (swapping the rows and columns), the integer data gets up-casted to floats:
[12]:
df.T
[12]:
0 | 1 | 2 | |
---|---|---|---|
A | 7.0 | 8.0 | 9.0 |
B | 1.0 | 2.0 | 3.0 |
C | 4.1 | 5.2 | 6.3 |
The numerical values of each data cell are preserved, but now all three columns are represeted as floating point numbers.
[13]:
df.T.dtypes
[13]:
0 float64
1 float64
2 float64
dtype: object
For virtually all transportation data analysis, the data used will be loaded from a file, instead of being entered directly in Python commands. Fortunately, Pandas comes with a host of data reading methods that can be used to read data in a variety of formats:
You can click through to the documentation for each of these. In this tutorial, we’ll cover the most basic reading function in the pandas library, which is read_csv
. Despite the name, this function isn’t just for reading comma seperated values, but also pretty much any delimited text file, including tab and space delimited files.
A very small example csv data file is included in the transportation_tutorials package:
[14]:
os.path.basename(tt.data('FL-COUNTY-POP'))
[14]:
'FL-COUNTY-POP.csv.gz'
For most csv files, you can just pass the name of the file to the read_csv function, and pandas will figure out the rest. This includes, as shown here, when a csv file is gzipped on disk, in which case it is transparently decompressed while reading.
[15]:
fl = pd.read_csv(tt.data('FL-COUNTY'))
[16]:
fl
[16]:
Name | 2019 Population | Growth Since 2010 | Land Area | |
---|---|---|---|---|
0 | Miami-Dade County | 2751796 | 9.754518 | 1897.72 |
1 | Broward County | 1935878 | 10.438146 | 1209.79 |
2 | Palm Beach County | 1471150 | 11.134781 | 1969.76 |
3 | Hillsborough County | 1408566 | 14.187346 | 1020.21 |
4 | Orange County | 1348975 | 17.435335 | 903.43 |
5 | Pinellas County | 970637 | 5.908084 | 273.80 |
6 | Duval County | 937934 | 8.349131 | 762.19 |
7 | Lee County | 739224 | 19.139938 | 784.51 |
8 | Polk County | 686483 | 13.819128 | 1797.84 |
9 | Brevard County | 589162 | 8.306616 | 1015.66 |
10 | Volusia County | 538692 | 8.949042 | 1101.03 |
11 | Pasco County | 525643 | 12.922512 | 746.89 |
12 | Seminole County | 462659 | 9.358846 | 309.22 |
13 | Sarasota County | 419119 | 10.310730 | 555.87 |
14 | Manatee County | 385571 | 19.209805 | 742.93 |
15 | Collier County | 372880 | 15.585507 | 1998.32 |
16 | Marion County | 354353 | 6.944143 | 1584.55 |
17 | Osceola County | 352180 | 30.508575 | 1327.45 |
18 | Lake County | 346017 | 16.228536 | 938.38 |
19 | Escambia County | 313512 | 5.192661 | 656.46 |
20 | St. Lucie County | 313506 | 12.659688 | 571.93 |
21 | Leon County | 290292 | 5.189313 | 666.85 |
22 | Alachua County | 266944 | 7.809988 | 875.02 |
23 | St. Johns County | 243812 | 27.492731 | 600.66 |
24 | Clay County | 212230 | 10.863853 | 604.36 |
25 | Okaloosa County | 202970 | 12.315593 | 930.25 |
26 | Hernando County | 186553 | 7.857172 | 472.54 |
27 | Bay County | 183563 | 8.484283 | 758.46 |
28 | Charlotte County | 182033 | 13.863851 | 680.28 |
29 | Santa Rosa County | 174272 | 13.969564 | 1011.61 |
... | ... | ... | ... | ... |
37 | Monroe County | 77013 | 5.181715 | 983.28 |
38 | Putnam County | 73464 | -0.997251 | 727.62 |
39 | Columbia County | 69612 | 3.043401 | 797.57 |
40 | Walton County | 68376 | 23.840400 | 1037.63 |
41 | Jackson County | 48330 | -2.650767 | 917.76 |
42 | Gadsden County | 46071 | -3.607072 | 516.33 |
43 | Suwannee County | 44190 | 4.406379 | 688.55 |
44 | Okeechobee County | 41605 | 3.934549 | 768.91 |
45 | Levy County | 40355 | -0.906100 | 1118.21 |
46 | Hendry County | 40347 | 3.414072 | 1152.75 |
47 | DeSoto County | 36862 | 5.512938 | 637.06 |
48 | Wakulla County | 32120 | 4.201135 | 606.42 |
49 | Baker County | 28283 | 4.511862 | 585.23 |
50 | Hardee County | 27411 | -1.132552 | 637.78 |
51 | Bradford County | 27038 | -5.249509 | 293.96 |
52 | Washington County | 24567 | -0.622952 | 582.80 |
53 | Taylor County | 21833 | -3.338203 | 1043.31 |
54 | Holmes County | 19558 | -1.446208 | 478.78 |
55 | Madison County | 18449 | -4.151081 | 695.95 |
56 | Gilchrist County | 17743 | 4.364449 | 349.68 |
57 | Dixie County | 16673 | 1.658435 | 705.05 |
58 | Gulf County | 16160 | 2.123357 | 564.01 |
59 | Union County | 15517 | -0.180122 | 243.56 |
60 | Calhoun County | 14483 | -1.112932 | 567.33 |
61 | Hamilton County | 14184 | -3.398488 | 513.79 |
62 | Jefferson County | 14144 | -4.095471 | 598.10 |
63 | Glades County | 13754 | 6.843782 | 806.01 |
64 | Franklin County | 11727 | 1.779205 | 534.73 |
65 | Lafayette County | 8451 | -4.074915 | 543.41 |
66 | Liberty County | 8242 | -1.269765 | 835.56 |
67 rows × 4 columns
Within the Jupyter notebook, by default if you try to view a large data frame, the middle section is not displayed. If you look carefully above, you’ll see a row of ellipsis marking the place where some rows are not shown. Still, we get the first 30 and last 30 rows of the table. If you don’t want to review so large a section of a DataFrame, you can use the head
method of DataFrames to see just the first few rows.
[17]:
fl.head()
[17]:
Name | 2019 Population | Growth Since 2010 | Land Area | |
---|---|---|---|---|
0 | Miami-Dade County | 2751796 | 9.754518 | 1897.72 |
1 | Broward County | 1935878 | 10.438146 | 1209.79 |
2 | Palm Beach County | 1471150 | 11.134781 | 1969.76 |
3 | Hillsborough County | 1408566 | 14.187346 | 1020.21 |
4 | Orange County | 1348975 | 17.435335 | 903.43 |
One notable tabular file format you may encounter in transportation planning that is not directly readable by pandas is the DBF
file. Fortunately, there are other related tools that can read this kind of file, including geopandas
. Geopandas includes just one main reading function: read_file
. When you read a DBF file with geopandas.read_file
, you’ll get a GeoDataFrame
object instead of a regular pandas.DataFrame – but don’t worry, everything you can do with a DataFrame you
can also do with a GeoDataFrame.
[18]:
import os
import geopandas as gpd
[19]:
os.path.basename(tt.data('US-STATES'))
[19]:
'US-STATES.dbf'
[20]:
states = gpd.read_file(tt.data('US-STATES'))
states.head()
[20]:
STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 01779775 | 0400000US01 | 01 | AL | Alabama | 00 | 131173688951 | 4593686489 | None |
1 | 02 | 01785533 | 0400000US02 | 02 | AK | Alaska | 00 | 1477946266785 | 245390495931 | None |
2 | 04 | 01779777 | 0400000US04 | 04 | AZ | Arizona | 00 | 294198560125 | 1027346486 | None |
3 | 05 | 00068085 | 0400000US05 | 05 | AR | Arkansas | 00 | 134771517596 | 2960191698 | None |
4 | 06 | 01779778 | 0400000US06 | 06 | CA | California | 00 | 403501101370 | 20466718403 | None |
[21]:
import larch
An alternate DBF reading implementation is included in Larch. This optimized DBF reader can currently only handle text and numeric datatyped (i.e., types ‘C’, ‘F’, and ‘N’; no dates, memos, or logical fields). However, it can read data quite fast even for large files, and will extract the basic contents of the file (headers, number of rows) even without actually reading the data.
[22]:
f = larch.DBF(tt.data('US-STATES'))
To actually load the data into a DataFrame, use the load_dataframe
method.
[23]:
states = f.load_dataframe()
states.set_index('STUSPS', inplace=True)
states.head()
[23]:
STATEFP | STATENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | |
---|---|---|---|---|---|---|---|---|
STUSPS | ||||||||
AL | 01 | 01779775 | 0400000US01 | 01 | Alabama | 00 | 131173688951 | 4593686489 |
AK | 02 | 01785533 | 0400000US02 | 02 | Alaska | 00 | 1477946266785 | 245390495931 |
AZ | 04 | 01779777 | 0400000US04 | 04 | Arizona | 00 | 294198560125 | 1027346486 |
AR | 05 | 00068085 | 0400000US05 | 05 | Arkansas | 00 | 134771517596 | 2960191698 |
CA | 06 | 01779778 | 0400000US06 | 06 | California | 00 | 403501101370 | 20466718403 |
Sometimes, data that is read in from raw files will require some cleaning to be usable. Consider this example of data on bridges in Florida:
[24]:
with gzip.open(tt.data('FL-BRIDGES'), 'rt') as previewfile:
print(*(next(previewfile) for x in range(6)))
County,Total #,Good #,Fair #,Poor #,SD #,Total Area,Good Area,Fair Area,Poor Area,SD Area
ALACHUA (001),111,64,47,-,-,64767,55794,8973,,
BAKER (003),89,30,52,7,8,32162,19369,12282,510,623
BAY (005),122,49,63,10,11,210039,98834,109628,1577,10120
BRADFORD (007),62,23,37,2,2,9330,5492,3217,620,620
BREVARD (009),241,160,81,-,-,364138,204179,159959,,
The data is loaded by pandas in the usual way without errors.
[25]:
bridges = pd.read_csv(tt.data('FL-BRIDGES'))
However, not all is well, and problems will appear during analysis.
[26]:
bridges['SD #'].sum()
[26]:
'-8112-711-324152218128--6-1--22913622-286111-136313-3-7-3331352-13917696256'
That is clearly not the number of structurally deficient bridges in Florida. And it’s not even a number at all, it’s a string. If you look carefully and compare against the first few rows of the DataFrame shown below, you may figure out that the string is actually a concatenation of the numbers in the column, not the sum.
[27]:
bridges.head()
[27]:
County | Total # | Good # | Fair # | Poor # | SD # | Total Area | Good Area | Fair Area | Poor Area | SD Area | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ALACHUA (001) | 111 | 64 | 47 | - | - | 64767 | 55794 | 8973 | NaN | NaN |
1 | BAKER (003) | 89 | 30 | 52 | 7 | 8 | 32162 | 19369 | 12282 | 510.0 | 623.0 |
2 | BAY (005) | 122 | 49 | 63 | 10 | 11 | 210039 | 98834 | 109628 | 1577.0 | 10120.0 |
3 | BRADFORD (007) | 62 | 23 | 37 | 2 | 2 | 9330 | 5492 | 3217 | 620.0 | 620.0 |
4 | BREVARD (009) | 241 | 160 | 81 | - | - | 364138 | 204179 | 159959 | NaN | NaN |
This is happening because the values in that columns are read by pandas as strings, not numbers, because the zero values are given as ‘-’ instead of 0. We can see the problem clearly if we look at the info
for the bridges DataFrame.
[28]:
bridges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 11 columns):
County 68 non-null object
Total # 68 non-null int64
Good # 68 non-null int64
Fair # 68 non-null int64
Poor # 68 non-null object
SD # 68 non-null object
Total Area 68 non-null int64
Good Area 68 non-null int64
Fair Area 68 non-null int64
Poor Area 47 non-null float64
SD Area 49 non-null float64
dtypes: float64(2), int64(6), object(3)
memory usage: 6.0+ KB
To fix this, we can use the replace
method to fill all the ‘-’ values with zeros, and then change the dtype of the affected columns back to integer.
[29]:
bridges = bridges.replace('-', 0)
bridges[['Poor #', 'SD #']] = bridges[['Poor #', 'SD #']].astype(int)
Then we are able to compute the total number of structurally deficient bridges in Florida.
[30]:
bridges['SD #'].sum()
[30]:
512
Whoa, that seems high. Let’s check on the values in the table and see where all those structurally deficient bridges are. We can sort on the ‘SD #’ column and take a look at the highest few rows:
[31]:
bridges.sort_values('SD #', ascending=False).head()
[31]:
County | Total # | Good # | Fair # | Poor # | SD # | Total Area | Good Area | Fair Area | Poor Area | SD Area | |
---|---|---|---|---|---|---|---|---|---|---|---|
67 | TOTALS | 12313 | 8534 | 3545 | 233 | 256 | 16759416 | 11232342 | 5149876 | 376033.0 | 402201.0 |
14 | DUVAL (031) | 763 | 479 | 262 | 22 | 22 | 1795247 | 932820 | 811836 | 50590.0 | 50590.0 |
15 | ESCAMBIA (033) | 243 | 116 | 111 | 16 | 18 | 585435 | 259541 | 233423 | 92471.0 | 92585.0 |
42 | MIAMI-DADE (086) | 957 | 804 | 146 | 7 | 13 | 1743024 | 1315449 | 404966 | 22609.0 | 33739.0 |
45 | OKALOOSA (091) | 216 | 111 | 92 | 13 | 13 | 269312 | 77988 | 171330 | 19993.0 | 19993.0 |
Oops! The data we loaded includes a ‘TOTALS’ row, but we don’t want that row included in our DataFrame for analysis. We can drop it (using the index for that row, 67), to fix the problem:
[32]:
bridges.drop(67, inplace=True)
bridges['SD #'].sum()
[32]:
256
A slightly different problem occurs in the data for ‘SD Area’. In this column, the zero values were not given by ‘-’ values, but instead were actually omitted. When read in by pandas, omitted values are valid input and are different from zero values.
If the intention is that missing values are indeed missing, this behavior is expected and useful. If, however, the missing values should be interpreted as zeros, then that instruction needs to be applied to the DataFrame explicitly, or problems may arise. For example, missing values don’t count in the denominator when computing the mean for a column:
[33]:
bridges['SD Area'].mean()
[33]:
8379.208333333334
If we compute the mean manually using the actual number of counties, we get a different value:
[34]:
bridges['SD Area'].sum() / len(bridges['SD Area'])
[34]:
6003.014925373134
We can use the fillna
method to set all the missing values to zero, and then we get the correct answer:
[35]:
bridges.fillna(0, inplace=True)
bridges['SD Area'].mean()
[35]:
6003.014925373134
New columns can be added to an existing DataFrame simply by assiging a value to a new column label, using plain square bracket notation. This makes it easy to create and derived values to a DataFrame.
[36]:
fl['Population Density'] = fl['2019 Population'] / fl['Land Area']
[37]:
fl.head()
[37]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | |
---|---|---|---|---|---|
0 | Miami-Dade County | 2751796 | 9.754518 | 1897.72 | 1450.053749 |
1 | Broward County | 1935878 | 10.438146 | 1209.79 | 1600.176890 |
2 | Palm Beach County | 1471150 | 11.134781 | 1969.76 | 746.867639 |
3 | Hillsborough County | 1408566 | 14.187346 | 1020.21 | 1380.662805 |
4 | Orange County | 1348975 | 17.435335 | 903.43 | 1493.170473 |
Adding rows to a DataFrame can be done as well, using the loc
indexer (see the next section on slicing for more about loc
).
[38]:
fl.loc[999,['Name', '2019 Population']] = ('Dry Tortugas', 0)
[39]:
fl.tail()
[39]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | |
---|---|---|---|---|---|
63 | Glades County | 13754.0 | 6.843782 | 806.01 | 17.064304 |
64 | Franklin County | 11727.0 | 1.779205 | 534.73 | 21.930694 |
65 | Lafayette County | 8451.0 | -4.074915 | 543.41 | 15.551793 |
66 | Liberty County | 8242.0 | -1.269765 | 835.56 | 9.864043 |
999 | Dry Tortugas | 0.0 | NaN | NaN | NaN |
There are two main ways to slice a DataFrame, by label and by position.
Slicing by index is done using the iloc
. This makes the DataFrame operate in a manner similar to a numpy array.
Giving only one index or set of indexes selects from the rows only.
[40]:
fl.iloc[1:3]
[40]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | |
---|---|---|---|---|---|
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 |
Giving two sets of of indexes selects from the rows and columns.
[41]:
states.iloc[1:3, 2:5]
[41]:
AFFGEOID | GEOID | NAME | |
---|---|---|---|
STUSPS | |||
AK | 0400000US02 | 02 | Alaska |
AZ | 0400000US04 | 04 | Arizona |
See more in Selection by Position in the Pandas documentation.
Slicing by label is done using the loc
. Instead of finding rows or columns based on the raw position in the DataFrame, the loc
indexer finds rows or columns based upon the values in the index
or columns
attributes of the DataFrame.
[42]:
states.loc['AK':'CA']
[42]:
STATEFP | STATENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | |
---|---|---|---|---|---|---|---|---|
STUSPS | ||||||||
AK | 02 | 01785533 | 0400000US02 | 02 | Alaska | 00 | 1477946266785 | 245390495931 |
AZ | 04 | 01779777 | 0400000US04 | 04 | Arizona | 00 | 294198560125 | 1027346486 |
AR | 05 | 00068085 | 0400000US05 | 05 | Arkansas | 00 | 134771517596 | 2960191698 |
CA | 06 | 01779778 | 0400000US06 | 06 | California | 00 | 403501101370 | 20466718403 |
An important feature of the loc
selector is that the resulting selection includes both the starting and the ending label. This is different from the usual Python slicing process, where the slice runs from the starting position up to but not including the ending position.
It can also be confusing to use the loc
selector when the index
being used has integer values. Case must be taken to select the rows you want, even when the index
is a RangeIndex starting from zero. Consider these two selections, which are very similar but yield different results:
[43]:
fl.loc[1:3]
[43]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | |
---|---|---|---|---|---|
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 |
[44]:
fl.iloc[1:3]
[44]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | |
---|---|---|---|---|---|
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 |
See more in Selection by Label in the Pandas documentation.
Sorting must be done explictly using the index, or one or more columns of values.
[45]:
states.sort_index().head()
[45]:
STATEFP | STATENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | |
---|---|---|---|---|---|---|---|---|
STUSPS | ||||||||
AK | 02 | 01785533 | 0400000US02 | 02 | Alaska | 00 | 1477946266785 | 245390495931 |
AL | 01 | 01779775 | 0400000US01 | 01 | Alabama | 00 | 131173688951 | 4593686489 |
AR | 05 | 00068085 | 0400000US05 | 05 | Arkansas | 00 | 134771517596 | 2960191698 |
AS | 60 | 01802701 | 0400000US60 | 60 | American Samoa | 00 | 197759069 | 1307243753 |
AZ | 04 | 01779777 | 0400000US04 | 04 | Arizona | 00 | 294198560125 | 1027346486 |
[46]:
states.sort_values('NAME').head()
[46]:
STATEFP | STATENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | |
---|---|---|---|---|---|---|---|---|
STUSPS | ||||||||
AL | 01 | 01779775 | 0400000US01 | 01 | Alabama | 00 | 131173688951 | 4593686489 |
AK | 02 | 01785533 | 0400000US02 | 02 | Alaska | 00 | 1477946266785 | 245390495931 |
AS | 60 | 01802701 | 0400000US60 | 60 | American Samoa | 00 | 197759069 | 1307243753 |
AZ | 04 | 01779777 | 0400000US04 | 04 | Arizona | 00 | 294198560125 | 1027346486 |
AR | 05 | 00068085 | 0400000US05 | 05 | Arkansas | 00 | 134771517596 | 2960191698 |
If we have two DataFrames that contain related data, we may want to merge the contents into a single DataFrame for some kinds of analysis. For example, suppose we have a DataFrame that gives enumerates counties in Florida by FDOT District, like this one:
[47]:
districts = pd.read_csv(tt.data('FL-COUNTY-BY-DISTRICT'))
districts.head()
[47]:
County | District | |
---|---|---|
0 | Charlotte | 1 |
1 | Collier | 1 |
2 | DeSoto | 1 |
3 | Glades | 1 |
4 | Hardee | 1 |
This is useful information if we want to compute statistics by FDOT District instead of by county, but it will need to be merged with the other available Florida county data we have.
Pandas provides a merge function, which provides the core functionality for all standard database-style join operations between DataFrames. These operations let the analyst connect two DataFrames that are related in some systematic way.
The principal arguments to the merge function include:
In addition, some combination of selector arguments must be given, to indicate the keys to use for the join:
Advanced users will find details on a variety of other merge
arguments in the function documentation. There are also merge
and join
shorthand methods available as methods on a DataFrame, although each represents an application of the more general merge
method illustrated here.
To merge the county data with the District numbers, we can merge the two datasets using the merge
function. In the population data file we previously loaded, the county names are given in a column named ‘Name’, while in the district mbmership table, the county names are given in a column named ‘County’, we we’ll use the left_on
and right_on
arguments to identify that the merge keys have different names.
[48]:
pd.merge(fl, districts, left_on='Name', right_on='County')
[48]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District |
---|
In this instance, the ‘inner’ merge resulted in a new DataFrame, but with zero rows of data. This is because, as it turns out, the merge keys are not actually the same:
[49]:
fl.head()
[49]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | |
---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 |
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 |
4 | Orange County | 1348975.0 | 17.435335 | 903.43 | 1493.170473 |
[50]:
districts.head()
[50]:
County | District | |
---|---|---|
0 | Charlotte | 1 |
1 | Collier | 1 |
2 | DeSoto | 1 |
3 | Glades | 1 |
4 | Hardee | 1 |
The names of the counties in the first table are given with the appelation “Name County”, while in the second table, they are identified merely as “Name”. In order to use the merge
function, the keys must match exactly across the two tables. To make this work, we’ll need to strip the ‘County’ from the names in the first table. We can do so with the str.replace
method, like this:
[51]:
fl['County'] = fl['Name'].str.replace(' County', '')
Then, if we attempt the merge again, we’ll get the results we expect.
[52]:
fl_2 = pd.merge(fl, districts, on='County')
fl_2.head()
[52]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | |
---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 |
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 | Hillsborough | 7 |
4 | Orange County | 1348975.0 | 17.435335 | 903.43 | 1493.170473 | Orange | 5 |
The merge here, adding the FDOT district to the County table, is a one-to-one merge (i.e., every row in the left table matches one and only one row in the right table). However, that one-to-one is not required to undertake a merge. We can also merge data that has one-to-many and many-to-one relationships, and even many-to-many, although that merge type is less common in most transportation applications.
As an example, consider this table that gives the FDOT District name and headquarters location for each district:
[53]:
district_info = pd.read_csv(tt.data('FL-DISTRICTS'), index_col='District')
district_info
[53]:
Name | Headquarters | |
---|---|---|
District | ||
1 | Southwest Florida | Bartow |
2 | Northeast Florida | Lake City |
3 | Northwest Florida | Chipley |
4 | Southeast Florida | Fort Lauderdale |
5 | Central Florida | DeLand |
6 | South Florida | Miami |
7 | West Central Florida | Tampa |
If we want to add a column that gives the district name to the counties table we already have, we can use a many-to-one merge. This is actually just detected automatically, so there is no extra argument to give.
[54]:
pd.merge(fl_2, district_info, on='District').head()
[54]:
Name_x | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | Name_y | Headquarters | |
---|---|---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 | South Florida | Miami |
1 | Monroe County | 77013.0 | 5.181715 | 983.28 | 78.322553 | Monroe | 6 | South Florida | Miami |
2 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 | Southeast Florida | Fort Lauderdale |
3 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 | Southeast Florida | Fort Lauderdale |
4 | St. Lucie County | 313506.0 | 12.659688 | 571.93 | 548.154494 | St. Lucie | 4 | Southeast Florida | Fort Lauderdale |
Because both the left and right DataFrames have a column named ‘Name’, in the merged result the column from the left DataFrame has a ‘_x’ appended, while the column from the right DataFrame has a ‘_y’ appended. This is configurable using the suffixes argument, or by just changing the column names before merging. In fact, many merge details can be controlled easily by manipulating the left and right arguments before merging. For example, if we only want to get the district name, not the headquarters, we can slice the right DataFrame to only include what we want:
[55]:
pd.merge(fl_2, district_info[['Name']].add_prefix('District_'), on='District').head()
[55]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | District_Name | |
---|---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 | South Florida |
1 | Monroe County | 77013.0 | 5.181715 | 983.28 | 78.322553 | Monroe | 6 | South Florida |
2 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 | Southeast Florida |
3 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 | Southeast Florida |
4 | St. Lucie County | 313506.0 | 12.659688 | 571.93 | 548.154494 | St. Lucie | 4 | Southeast Florida |
Often data we are working with has multiple categories of columns. For example, the bridge data we worked with above has two data types: the number of bridges by condition, and the deck area by condition. We can explicitly gather together those groups of columns using a multi-level index.
To demonstate this, we’ll make a copy of the bridges
DataFrame with the county name as the index.
[56]:
bridges2 = bridges.set_index('County')
There are ten columns of this dataframe now, five counts and five areas.
[57]:
bridges2.columns
[57]:
Index(['Total #', 'Good #', 'Fair #', 'Poor #', 'SD #', 'Total Area',
'Good Area', 'Fair Area', 'Poor Area', 'SD Area'],
dtype='object')
We can use a list of tuples to create a MultiIndex
that gives both the column’s category and a name. Notice that we can duplicate the condition names without trouble, because the duplicate values have different categories.
[58]:
bridges2.columns = pd.MultiIndex.from_tuples(
[('Count','Total'), ('Count','Good'), ('Count','Fair'), ('Count','Poor'), ('Count','SD'),
('Area','Total'), ('Area','Good'), ('Area','Fair'), ('Area','Poor'), ('Area','SD'), ],
names=['measure', 'condition'],
)
[59]:
bridges2.head()
[59]:
measure | Count | Area | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
condition | Total | Good | Fair | Poor | SD | Total | Good | Fair | Poor | SD |
County | ||||||||||
ALACHUA (001) | 111 | 64 | 47 | 0 | 0 | 64767 | 55794 | 8973 | 0.0 | 0.0 |
BAKER (003) | 89 | 30 | 52 | 7 | 8 | 32162 | 19369 | 12282 | 510.0 | 623.0 |
BAY (005) | 122 | 49 | 63 | 10 | 11 | 210039 | 98834 | 109628 | 1577.0 | 10120.0 |
BRADFORD (007) | 62 | 23 | 37 | 2 | 2 | 9330 | 5492 | 3217 | 620.0 | 620.0 |
BREVARD (009) | 241 | 160 | 81 | 0 | 0 | 364138 | 204179 | 159959 | 0.0 | 0.0 |
MultiIndexes can be used for both the columns and the index of a DataFrame.
One advantage of using this multi-index approach to grouping similar data is that it makes it much easier to reshape the DataFrame for different types of analysis. For example, we can transform the bridges dataframe to have a for for each county and condition, and just two columns (count and area), by stack
ing the data.
[60]:
bridges3 = bridges2.stack()
bridges3.head(12)
[60]:
measure | Area | Count | |
---|---|---|---|
County | condition | ||
ALACHUA (001) | Fair | 8973.0 | 47 |
Good | 55794.0 | 64 | |
Poor | 0.0 | 0 | |
SD | 0.0 | 0 | |
Total | 64767.0 | 111 | |
BAKER (003) | Fair | 12282.0 | 52 |
Good | 19369.0 | 30 | |
Poor | 510.0 | 7 | |
SD | 623.0 | 8 | |
Total | 32162.0 | 89 | |
BAY (005) | Fair | 109628.0 | 63 |
Good | 98834.0 | 49 |
One thing you might notice here is that the conditions have been reordered. The new order isn’t consistent with the categorical meanings, but instead is simply alphabetic. We can preserve the ordering by converting these values into ordered categorical values like this:
[61]:
conditions = pd.CategoricalIndex(
bridges2.columns.levels[1],
ordered=True,
categories=['Total', 'Good', 'Fair', 'Poor', 'SD'],
)
bridges2.columns.set_levels(conditions, 1, inplace=True)
bridges2.stack().head(12)
[61]:
measure | Area | Count | |
---|---|---|---|
County | condition | ||
ALACHUA (001) | Total | 64767.0 | 111 |
Good | 55794.0 | 64 | |
Fair | 8973.0 | 47 | |
Poor | 0.0 | 0 | |
SD | 0.0 | 0 | |
BAKER (003) | Total | 32162.0 | 89 |
Good | 19369.0 | 30 | |
Fair | 12282.0 | 52 | |
Poor | 510.0 | 7 | |
SD | 623.0 | 8 | |
BAY (005) | Total | 210039.0 | 122 |
Good | 98834.0 | 49 |
Although obviously that’s a bit complicated and it may be easier to simply rename the conditions in a way that the alphabetic sorting works for us.
Now, for example, we can easily get the average deck area of bridges by condition.
[62]:
bridges3['AvgArea'] = bridges3['Area'] / bridges3['Count']
bridges3.head(12)
[62]:
measure | Area | Count | AvgArea | |
---|---|---|---|---|
County | condition | |||
ALACHUA (001) | Fair | 8973.0 | 47 | 190.914894 |
Good | 55794.0 | 64 | 871.781250 | |
Poor | 0.0 | 0 | NaN | |
SD | 0.0 | 0 | NaN | |
Total | 64767.0 | 111 | 583.486486 | |
BAKER (003) | Fair | 12282.0 | 52 | 236.192308 |
Good | 19369.0 | 30 | 645.633333 | |
Poor | 510.0 | 7 | 72.857143 | |
SD | 623.0 | 8 | 77.875000 | |
Total | 32162.0 | 89 | 361.370787 | |
BAY (005) | Fair | 109628.0 | 63 | 1740.126984 |
Good | 98834.0 | 49 | 2017.020408 |
We can also reverse this transformation, returning to a wider format, with the unstack
method.
[63]:
bridges4 = bridges3.unstack()
bridges4.head()
[63]:
measure | Area | Count | AvgArea | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
condition | Fair | Good | Poor | SD | Total | Fair | Good | Poor | SD | Total | Fair | Good | Poor | SD | Total |
County | |||||||||||||||
ALACHUA (001) | 8973.0 | 55794.0 | 0.0 | 0.0 | 64767.0 | 47 | 64 | 0 | 0 | 111 | 190.914894 | 871.781250 | NaN | NaN | 583.486486 |
BAKER (003) | 12282.0 | 19369.0 | 510.0 | 623.0 | 32162.0 | 52 | 30 | 7 | 8 | 89 | 236.192308 | 645.633333 | 72.857143 | 77.875 | 361.370787 |
BAY (005) | 109628.0 | 98834.0 | 1577.0 | 10120.0 | 210039.0 | 63 | 49 | 10 | 11 | 122 | 1740.126984 | 2017.020408 | 157.700000 | 920.000 | 1721.631148 |
BRADFORD (007) | 3217.0 | 5492.0 | 620.0 | 620.0 | 9330.0 | 37 | 23 | 2 | 2 | 62 | 86.945946 | 238.782609 | 310.000000 | 310.000 | 150.483871 |
BREVARD (009) | 159959.0 | 204179.0 | 0.0 | 0.0 | 364138.0 | 81 | 160 | 0 | 0 | 241 | 1974.802469 | 1276.118750 | NaN | NaN | 1510.946058 |
If you unstack
a DataFrame that doesn’t have a MultiIndex index with at least two levels, you’ll get a one dimensional Series instead of a DataFrame.
[64]:
bridges4.unstack()
[64]:
measure condition County
Area Fair ALACHUA (001) 8973.000000
BAKER (003) 12282.000000
BAY (005) 109628.000000
BRADFORD (007) 3217.000000
BREVARD (009) 159959.000000
...
AvgArea Total UNION (125) 447.107143
VOLUSIA (127) 1439.052863
WAKULLA (129) 355.660000
WALTON (131) 475.378995
WASHINGTON (133) 751.208333
Length: 1005, dtype: float64
The pandas documentation includes a lot more detail about reshaping data, including a variety of other tools for reshaping in other ways.
A common task in data analysis is to conduct group-wise analysis. This involves splitting a DataFrame into groups of related rows, then computing some sort of measure on each group. The results are then usually combined into a final coherent data structure, generally having one row for each original row (a transformation), or one row for each group (an aggregation). It is possible to introduce some filtering in the process so the results do not conform to one of these patterns, but that is a more advanced topic that will not be discussed here.
For example, we might group counties by district, and then total the population values, to get to total population by district:
[65]:
fl_2.head()
[65]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | |
---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 |
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 | Hillsborough | 7 |
4 | Orange County | 1348975.0 | 17.435335 | 903.43 | 1493.170473 | Orange | 5 |
[66]:
fl_2.groupby('District')[['2019 Population']].sum()
[66]:
2019 Population | |
---|---|
District | |
1 | 3048172.0 |
2 | 2139433.0 |
3 | 1468387.0 |
4 | 4034840.0 |
5 | 4227713.0 |
6 | 2828809.0 |
7 | 3237046.0 |
The splitting into groups in done using the groupby
method of DataFrames, which is fairly flexible—it can be used to group rows or columns, and to group on an index or on values. Pandas provides an extensive tutorial that covers groupby
in more depth. In this tutorial, we’ll focus on the most common tasks for data analysis in the transportation context (and, most data analysis): grouping rows using one or more
columns, or a function thereof, as the grouping keys.
In the example above, the groupby
method is called with a single argument ‘District’, which indicates that the groups are delineated by common values of the column named ‘District’. Grouping by an identifier like this is convenient, because the identifiers are naturally well defined and the groups are inherently meaningful.
Similar for merging, groupby
keys need to be actually identical in order to match. Attempting to use groupby
directly with continuous data that not generally categorical in nature will not give good results – you can expect that more or less every row will be its own group. Instead, if you want to group rows in this way, you’ll need to convert it to categorical data first. This can be done using pandas.cut
or pandas.qcut.
cut
defines categorical bins either by dividing the entire range of values into a number of equal sized bins (by giving the number of bins as an integer) or by giving a list of explicit breakpoints for the bins.qcut
defines categorical bins either setting the bin edges so that each bin contains a roughly equal number of observations (by giving the number of bins as an integer) or by giving a list of explicit quantile levels for the bin breakpoints (e.g., [0,10,25,50,75,90,100]
).Let’s use cut
to create five categorical values of county population:
[67]:
population_binned = pd.cut(fl_2['2019 Population'], bins=5)
population_binned.head()
[67]:
0 (2203085.2, 2751796.0]
1 (1654374.4, 2203085.2]
2 (1105663.6, 1654374.4]
3 (1105663.6, 1654374.4]
4 (1105663.6, 1654374.4]
Name: 2019 Population, dtype: category
Categories (5, interval[float64]): [(5498.446, 556952.8] < (556952.8, 1105663.6] < (1105663.6, 1654374.4] < (1654374.4, 2203085.2] < (2203085.2, 2751796.0]]
[68]:
population_binned
[68]:
0 (2203085.2, 2751796.0]
1 (1654374.4, 2203085.2]
2 (1105663.6, 1654374.4]
3 (1105663.6, 1654374.4]
4 (1105663.6, 1654374.4]
...
62 (5498.446, 556952.8]
63 (5498.446, 556952.8]
64 (5498.446, 556952.8]
65 (5498.446, 556952.8]
66 (5498.446, 556952.8]
Name: 2019 Population, Length: 67, dtype: category
Categories (5, interval[float64]): [(5498.446, 556952.8] < (556952.8, 1105663.6] < (1105663.6, 1654374.4] < (1654374.4, 2203085.2] < (2203085.2, 2751796.0]]
Those bins divide the range of population values evenly into 5 bins, but they do appear a bit random. We can set more reasonable value by explicitly giving bin boundaries that are round numbers of our choosing:
[69]:
bins=[5_000, 500_000, 1_000_000, 1_500_000, 2_000_000, 3_000_000]
population_binned = pd.cut(fl_2['2019 Population'], bins=bins)
population_binned.head()
[69]:
0 (2000000, 3000000]
1 (1500000, 2000000]
2 (1000000, 1500000]
3 (1000000, 1500000]
4 (1000000, 1500000]
Name: 2019 Population, dtype: category
Categories (5, interval[int64]): [(5000, 500000] < (500000, 1000000] < (1000000, 1500000] < (1500000, 2000000] < (2000000, 3000000]]
We can then pass this categorical data directly to the groupby
method, instead of naming existing columns in the DataFrame to use as the groupby
keys.
[70]:
fl_2.groupby(population_binned)[['2019 Population']].sum()
[70]:
2019 Population | |
---|---|
2019 Population | |
(5000, 500000] | 7080260.0 |
(500000, 1000000] | 4987775.0 |
(1000000, 1500000] | 4228691.0 |
(1500000, 2000000] | 1935878.0 |
(2000000, 3000000] | 2751796.0 |
The examples above all use the sum
method, which is an aggregate function: it takes a set of values drawn from a group of rows, and returns a single output value. Other common aggregation functions include:
function | description |
---|---|
mean |
within-group means |
std |
within-group standard deviation |
min |
within-group minimum |
max |
within-group maximum |
first |
first value in group |
last |
last value in group |
size |
number of group members including NaN |
count |
number of group members excluding NaN |
When aggregation function are applied, the result is generally one row per group, instead of one row per row.
Multiple aggregation functions can be applied simultaneously using the agg
method, like this:
[71]:
fl_2.groupby(population_binned)[['2019 Population']].agg([np.mean, np.std, np.size])
[71]:
2019 Population | |||
---|---|---|---|
mean | std | size | |
2019 Population | |||
(5000, 500000] | 1.287320e+05 | 130660.887798 | 55.0 |
(500000, 1000000] | 7.125393e+05 | 182201.018266 | 7.0 |
(1000000, 1500000] | 1.409564e+06 | 61093.609816 | 3.0 |
(1500000, 2000000] | 1.935878e+06 | NaN | 1.0 |
(2000000, 3000000] | 2.751796e+06 | NaN | 1.0 |
In contrast with aggregation, which returns a value for each group, transform
returns a value for each row in the original DataFrame. Many of the same functions can be used as with transform
, and the results are simply broadcast back to the original rows (i.e., a copy of the relevant aggregated value is attached to each original row, so that if there are multiple rows in a group, then there are multiple copies of the aggregate output). The return value is a new Series or DataFrame,
indexed the same as the original, which makes it easy to use transform
to create new derived columns based on group-wise aggregations or transforms.
For example, to add a new column that contains the FDOT District total population for each county, we can give the sum
function to transform
, and write the result as a new column in the original DataFrame:
[72]:
pop_sum = fl_2.groupby('District')['2019 Population'].transform(sum)
pop_sum.head()
[72]:
0 2828809.0
1 4034840.0
2 4034840.0
3 3237046.0
4 4227713.0
Name: 2019 Population, dtype: float64
[73]:
fl_2.head()
[73]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | |
---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 |
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 | Hillsborough | 7 |
4 | Orange County | 1348975.0 | 17.435335 | 903.43 | 1493.170473 | Orange | 5 |
The second and third counties in the fl_2
DataFrame are both in District 4, so both rows of this output give the total population in District 4. Generating the output like this makes it super easy to attach the result as a new column in the DataFrame:
[74]:
fl_2['District Population'] = fl_2.groupby('District')['2019 Population'].transform(sum)
[75]:
fl_2.head()
[75]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | District Population | |
---|---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 | 2828809.0 |
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 | 4034840.0 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 | 4034840.0 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 | Hillsborough | 7 | 3237046.0 |
4 | Orange County | 1348975.0 | 17.435335 | 903.43 | 1493.170473 | Orange | 5 | 4227713.0 |
In addition to expanding aggregate measures to match with the rows of the original DataFrame, the transform
function can also generate unique row-wise values, which can be computed relative to the group instead of relative to the entire DataFrame. For example, to compute the fraction of the population of each district living in each county:
[76]:
fl_2['Fraction of Pop'] = fl_2.groupby('District')['2019 Population'].transform(lambda x: (x / x.sum()))
fl_2.head()
[76]:
Name | 2019 Population | Growth Since 2010 | Land Area | Population Density | County | District | District Population | Fraction of Pop | |
---|---|---|---|---|---|---|---|---|---|
0 | Miami-Dade County | 2751796.0 | 9.754518 | 1897.72 | 1450.053749 | Miami-Dade | 6 | 2828809.0 | 0.972775 |
1 | Broward County | 1935878.0 | 10.438146 | 1209.79 | 1600.176890 | Broward | 4 | 4034840.0 | 0.479791 |
2 | Palm Beach County | 1471150.0 | 11.134781 | 1969.76 | 746.867639 | Palm Beach | 4 | 4034840.0 | 0.364612 |
3 | Hillsborough County | 1408566.0 | 14.187346 | 1020.21 | 1380.662805 | Hillsborough | 7 | 3237046.0 | 0.435139 |
4 | Orange County | 1348975.0 | 17.435335 | 903.43 | 1493.170473 | Orange | 5 | 4227713.0 | 0.319079 |