Using Pivot Tables¶

[1]:

import transportation_tutorials as tt
import pandas as pd
import numpy as np

Questions¶

Within the Jupiter study area, what is the average distance for bike tours to work? (Hint: It is 4.03 miles)
What tour purpose has the highest average tour distance? (Hint: Work tours)
What is the median distance of walking for all tour purposes? (Hint: 0.548 miles)

Data¶

To answer the questions, use the following data file:

[11]:

tour = pd.read_csv(tt.data('SERPM8-BASE2015-TOURS'))
tour.head()

[11]:

	hh_id	person_id	person_num	person_type	tour_id	tour_category	tour_purpose	orig_mgra	dest_mgra	start_period	end_period	tour_mode	tour_distance	tour_time	atWork_freq	num_ob_stops	util_1	util_2	util_3	util_4	util_5	util_6	util_7	util_8	util_9	util_10	util_11	util_12	util_13	util_14	util_15	util_16	util_17	util_18	util_19	util_20	prob_1	prob_3	prob_4	prob_5	prob_6	prob_9	prob_11
0	1690841	4502948	1	1	0	MANDATORY	Work	7736	9290	8	29	6	22.261	32.311001	6	0	-1.395076	-998.970825	-2.360061	-2.051275	-2.361161	-1.139561	-999.0	-999.0	-996.401001	-997.447021	-996.244019	-996.244019	-997.664978	-999.883789	-999.883789	-1001.335999	-999.883789	-999.883789	-1001.335999	-999.0	0.340576	0.000096	0.091587	0.000000	0.567742	0.000000	0.000000
1	1690841	4502948	1	1	1	AT_WORK	Work-Based	9290	7980	13	14	1	1.910	4.752000	0	0	-0.259837	-998.668518	-0.758937	-999.608398	-0.687137	-999.858276	-999.0	-999.0	-5.843162	-28.534241	-8.014107	-1024.000000	-1024.000000	-30.581329	-1024.000000	-32.709736	-30.581329	-1024.000000	-32.709736	-999.0	0.555677	0.204790	0.000000	0.236415	0.000000	0.002799	0.000319
2	1690841	4502948	1	1	2	AT_WORK	Work-Based	9290	10608	19	23	3	1.357	3.777000	0	0	-0.195235	-998.668518	-0.694335	-999.608398	-0.622535	-999.858276	-999.0	-999.0	-4.755019	-28.262205	-8.655456	-1024.000000	-1024.000000	-33.426655	-1024.000000	-31.728464	-33.426655	-1024.000000	-31.728464	-999.0	0.553007	0.203806	0.000000	0.235279	0.000000	0.007751	0.000157
3	1690841	4502949	2	2	0	MANDATORY	Work	7736	8289	27	30	4	30.930	55.431000	1	1	-2.548089	-998.970825	-3.130183	-1.918921	-3.237083	-1.397364	-999.0	-999.0	-996.401001	-997.447021	-996.244019	-996.244019	-997.664978	-999.883789	-999.883789	-1001.335999	-999.883789	-999.883789	-1001.335999	-999.0	0.068926	0.000000	0.242591	0.000000	0.688483	0.000000	0.000000
4	1690841	4502949	2	2	1	MANDATORY	Work	7736	8289	31	36	4	30.930	55.431000	1	0	-2.493357	-998.970825	-3.075933	-1.860832	-3.182833	-1.339275	-999.0	-999.0	-996.401001	-997.447021	-996.244019	-996.244019	-997.664978	-999.883789	-999.883789	-1001.335999	-999.883789	-999.883789	-1001.335999	-999.0	0.068497	0.000000	0.242703	0.000000	0.688801	0.000000	0.000000

[13]:

tour_mode_dict = {
"DRIVEALONEFREE",
"DRIVEALONEPAY",
"SHARED2GP",
"SHARED2PAY",
"SHARED3GP",
"SHARED3PAY",
"TNCALONE",
"TNCSHARED",
"WALK",
"BIKE",
"WALK_MIX",
"WALK_PRMW",
"WALK_PRMD",
"PNR_MIX",
"PNR_PRMW",
"PNR_PRMD",
"KNR_MIX",
"KNR_PRMW",
"KNR_PRMD",
"SCHBUS",
}

Solution¶

To find the answers in this exercise, we need to deal with three variables: tour_mode, tour_purpose and tour_distance. Let’s look at the data type of these three variables. We use .info() to view the data types of dataframe columns.

[6]:

tour[['tour_mode', 'tour_purpose', 'tour_distance']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47485 entries, 0 to 47484
Data columns (total 3 columns):
tour_mode        47485 non-null int64
tour_purpose     47485 non-null object
tour_distance    47485 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB

As we can see, tour_mode column in the data is categorized as integers. It is often helpful to have string category for easier exploration. Given that we know the notations of integer categories, we can convert the integer categories to string categories using a dictionary like tour_mode_dict.

[14]:

tour['tour_mode'] = tour['tour_mode'].map(tour_mode_dict)

[15]:

tour.tour_mode.value_counts()

[15]:

DRIVEALONEFREE    25911
SHARED2GP         10013
SHARED3GP          4857
DRIVEALONEPAY      2057
SCHBUS             1590
WALK               1213
BIKE                674
SHARED2PAY          500
SHARED3PAY          465
WALK_MIX            171
PNR_PRMW             11
KNR_PRMD              9
KNR_MIX               5
KNR_PRMW              3
PNR_PRMD              3
PNR_MIX               3
Name: tour_mode, dtype: int64

[16]:

tour.tour_purpose.value_counts()

[16]:

Work             16238
Shop              6291
School            5605
Discretionary     4980
Maintenance       4787
Eating Out        3064
Visiting          2641
Work-Based        2260
University         820
Escort             799
Name: tour_purpose, dtype: int64

Now that we have the perfect data types for required variables, we create a pivot table using .pivot_table() method and set the categorical columns (tour_mode and tour_purpose) in tour dataframe as index and columns. Then, we can set the continuous variable column, tour_distance as values in the .pivot_table() method.

Please note that .pivot_table() method by default calculates the average values of the column that is set as values.

[17]:

tour.pivot_table(index = 'tour_mode', columns = 'tour_purpose', values = 'tour_distance')

[17]:

tour_purpose	Discretionary	Eating Out	Escort	Maintenance	School	Shop	University	Visiting	Work	Work-Based
tour_mode
BIKE	2.764216	3.739237	NaN	2.872279	3.100353	4.323302	4.953091	2.448940	4.034763	NaN
DRIVEALONEFREE	5.681107	6.299282	3.848886	4.993209	4.396832	8.658878	7.974719	6.533719	12.089045	3.387014
DRIVEALONEPAY	38.526133	21.259400	23.726999	22.798727	NaN	59.314263	NaN	34.439682	38.741760	19.173934
KNR_MIX	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	33.410000	NaN
KNR_PRMD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	37.558556	NaN
KNR_PRMW	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	39.154332	NaN
PNR_MIX	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	21.311667	NaN
PNR_PRMD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	61.115000	NaN
PNR_PRMW	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	58.980637	NaN
SCHBUS	NaN	NaN	NaN	NaN	5.816528	NaN	NaN	NaN	NaN	NaN
SHARED2GP	7.469143	6.521744	4.544895	6.026752	4.176316	9.935834	8.809288	8.371339	13.427365	4.131103
SHARED2PAY	26.186025	22.228364	23.677999	28.893666	NaN	59.844746	21.773001	33.488765	39.481301	13.410200
SHARED3GP	7.622825	6.677294	6.378542	6.842151	4.308441	10.631094	8.269645	8.709989	13.986657	4.055721
SHARED3PAY	21.648273	22.139000	25.274000	20.237000	NaN	59.383289	16.806000	38.686500	37.353821	13.419333
WALK	0.637523	0.906817	0.491000	0.782691	0.973146	0.810000	1.343474	0.612707	0.714053	0.385786
WALK_MIX	3.486778	3.483111	5.693667	4.597381	2.434500	8.946885	3.589333	3.416000	9.022067	0.754857

From the pivot table that we just created, we can see that the average distance of biking to work is 4.034763 mile, which is the answer to our first question listed above.

To answer the second question, we just need a one-dimensional pivot table that we can create by ignoring either index or columns in .pivot_table() method.

[18]:

tour.pivot_table(columns = 'tour_purpose', values = 'tour_distance')

[18]:

tour_purpose	Discretionary	Eating Out	Escort	Maintenance	School	Shop	University	Visiting	Work	Work-Based
tour_distance	6.355063	6.341543	4.266534	5.413463	4.53435	13.421542	7.796431	7.534867	15.804213	3.841985

It is pretty clear that the work tours have the highest average tour distance as shown in the table above.

If we want to create a pivot table with median values of the column assigned in values, we can use aggfunc argument to specify that requirement. Again, we can use a one-dimensional pivot table to find the answer of third question.

[11]:

tour.pivot_table(index = 'tour_mode', values = 'tour_distance', aggfunc = ['median'])

[11]:

	median
	tour_distance
tour_mode
BIKE	2.890500
DRIVEALONEFREE	6.098000
DRIVEALONEPAY	31.275000
KNR_MIX	29.497999
KNR_PRMD	36.278000
KNR_PRMW	18.084999
PNR_MIX	16.518000
PNR_PRMD	55.566002
PNR_PRMW	56.125999
SCHBUS	2.624000
SHARED2GP	5.912000
SHARED2PAY	31.558000
SHARED3GP	5.085000
SHARED3PAY	30.006001
WALK	0.548000
WALK_MIX	3.294000

Looking at the table, it is obvious that the median walking distance for all kinds of tour is almost half a mile (0.548 mile).

Table Of Contents

Using Pivot Tables¶

Questions¶

Data¶

Solution¶