[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np
To answer the questions, use the following data file:
[11]:
tour = pd.read_csv(tt.data('SERPM8-BASE2015-TOURS'))
tour.head()
[11]:
hh_id | person_id | person_num | person_type | tour_id | tour_category | tour_purpose | orig_mgra | dest_mgra | start_period | end_period | tour_mode | tour_distance | tour_time | atWork_freq | num_ob_stops | num_ib_stops | out_btap | out_atap | in_btap | in_atap | util_1 | util_2 | util_3 | util_4 | util_5 | util_6 | util_7 | util_8 | util_9 | util_10 | util_11 | util_12 | util_13 | util_14 | util_15 | util_16 | util_17 | util_18 | util_19 | util_20 | prob_1 | prob_2 | prob_3 | prob_4 | prob_5 | prob_6 | prob_7 | prob_8 | prob_9 | prob_10 | prob_11 | prob_12 | prob_13 | prob_14 | prob_15 | prob_16 | prob_17 | prob_18 | prob_19 | prob_20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1690841 | 4502948 | 1 | 1 | 0 | MANDATORY | Work | 7736 | 9290 | 8 | 29 | 6 | 22.261 | 32.311001 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | -1.395076 | -998.970825 | -2.360061 | -2.051275 | -2.361161 | -1.139561 | -999.0 | -999.0 | -996.401001 | -997.447021 | -996.244019 | -996.244019 | -997.664978 | -999.883789 | -999.883789 | -1001.335999 | -999.883789 | -999.883789 | -1001.335999 | -999.0 | 0.340576 | 0.0 | 0.000096 | 0.091587 | 0.000000 | 0.567742 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1690841 | 4502948 | 1 | 1 | 1 | AT_WORK | Work-Based | 9290 | 7980 | 13 | 14 | 1 | 1.910 | 4.752000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.259837 | -998.668518 | -0.758937 | -999.608398 | -0.687137 | -999.858276 | -999.0 | -999.0 | -5.843162 | -28.534241 | -8.014107 | -1024.000000 | -1024.000000 | -30.581329 | -1024.000000 | -32.709736 | -30.581329 | -1024.000000 | -32.709736 | -999.0 | 0.555677 | 0.0 | 0.204790 | 0.000000 | 0.236415 | 0.000000 | 0.0 | 0.0 | 0.002799 | 0.0 | 0.000319 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1690841 | 4502948 | 1 | 1 | 2 | AT_WORK | Work-Based | 9290 | 10608 | 19 | 23 | 3 | 1.357 | 3.777000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.195235 | -998.668518 | -0.694335 | -999.608398 | -0.622535 | -999.858276 | -999.0 | -999.0 | -4.755019 | -28.262205 | -8.655456 | -1024.000000 | -1024.000000 | -33.426655 | -1024.000000 | -31.728464 | -33.426655 | -1024.000000 | -31.728464 | -999.0 | 0.553007 | 0.0 | 0.203806 | 0.000000 | 0.235279 | 0.000000 | 0.0 | 0.0 | 0.007751 | 0.0 | 0.000157 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1690841 | 4502949 | 2 | 2 | 0 | MANDATORY | Work | 7736 | 8289 | 27 | 30 | 4 | 30.930 | 55.431000 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | -2.548089 | -998.970825 | -3.130183 | -1.918921 | -3.237083 | -1.397364 | -999.0 | -999.0 | -996.401001 | -997.447021 | -996.244019 | -996.244019 | -997.664978 | -999.883789 | -999.883789 | -1001.335999 | -999.883789 | -999.883789 | -1001.335999 | -999.0 | 0.068926 | 0.0 | 0.000000 | 0.242591 | 0.000000 | 0.688483 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1690841 | 4502949 | 2 | 2 | 1 | MANDATORY | Work | 7736 | 8289 | 31 | 36 | 4 | 30.930 | 55.431000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -2.493357 | -998.970825 | -3.075933 | -1.860832 | -3.182833 | -1.339275 | -999.0 | -999.0 | -996.401001 | -997.447021 | -996.244019 | -996.244019 | -997.664978 | -999.883789 | -999.883789 | -1001.335999 | -999.883789 | -999.883789 | -1001.335999 | -999.0 | 0.068497 | 0.0 | 0.000000 | 0.242703 | 0.000000 | 0.688801 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
[13]:
tour_mode_dict = {
1: "DRIVEALONEFREE",
2: "DRIVEALONEPAY",
3: "SHARED2GP",
4: "SHARED2PAY",
5: "SHARED3GP",
6: "SHARED3PAY",
7: "TNCALONE",
8: "TNCSHARED",
9: "WALK",
10: "BIKE",
11: "WALK_MIX",
12: "WALK_PRMW",
13: "WALK_PRMD",
14: "PNR_MIX",
15: "PNR_PRMW",
16: "PNR_PRMD",
17: "KNR_MIX",
18: "KNR_PRMW",
19: "KNR_PRMD",
20: "SCHBUS",
}
To find the answers in this exercise, we need to deal with three variables: tour_mode
, tour_purpose
and tour_distance
. Let’s look at the data type of these three variables. We use .info()
to view the data types of dataframe columns.
[6]:
tour[['tour_mode', 'tour_purpose', 'tour_distance']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47485 entries, 0 to 47484
Data columns (total 3 columns):
tour_mode 47485 non-null int64
tour_purpose 47485 non-null object
tour_distance 47485 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB
As we can see, tour_mode
column in the data is categorized as integers. It is often helpful to have string category for easier exploration. Given that we know the notations of integer categories, we can convert the integer categories to string categories using a dictionary like tour_mode_dict
.
[14]:
tour['tour_mode'] = tour['tour_mode'].map(tour_mode_dict)
[15]:
tour.tour_mode.value_counts()
[15]:
DRIVEALONEFREE 25911
SHARED2GP 10013
SHARED3GP 4857
DRIVEALONEPAY 2057
SCHBUS 1590
WALK 1213
BIKE 674
SHARED2PAY 500
SHARED3PAY 465
WALK_MIX 171
PNR_PRMW 11
KNR_PRMD 9
KNR_MIX 5
KNR_PRMW 3
PNR_PRMD 3
PNR_MIX 3
Name: tour_mode, dtype: int64
[16]:
tour.tour_purpose.value_counts()
[16]:
Work 16238
Shop 6291
School 5605
Discretionary 4980
Maintenance 4787
Eating Out 3064
Visiting 2641
Work-Based 2260
University 820
Escort 799
Name: tour_purpose, dtype: int64
Now that we have the perfect data types for required variables, we create a pivot table using .pivot_table()
method and set the categorical columns (tour_mode
and tour_purpose
) in tour
dataframe as index
and columns
. Then, we can set the continuous variable column, tour_distance
as values
in the .pivot_table()
method.
Please note that .pivot_table()
method by default calculates the average values of the column that is set as values
.
[17]:
tour.pivot_table(index = 'tour_mode', columns = 'tour_purpose', values = 'tour_distance')
[17]:
tour_purpose | Discretionary | Eating Out | Escort | Maintenance | School | Shop | University | Visiting | Work | Work-Based |
---|---|---|---|---|---|---|---|---|---|---|
tour_mode | ||||||||||
BIKE | 2.764216 | 3.739237 | NaN | 2.872279 | 3.100353 | 4.323302 | 4.953091 | 2.448940 | 4.034763 | NaN |
DRIVEALONEFREE | 5.681107 | 6.299282 | 3.848886 | 4.993209 | 4.396832 | 8.658878 | 7.974719 | 6.533719 | 12.089045 | 3.387014 |
DRIVEALONEPAY | 38.526133 | 21.259400 | 23.726999 | 22.798727 | NaN | 59.314263 | NaN | 34.439682 | 38.741760 | 19.173934 |
KNR_MIX | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 33.410000 | NaN |
KNR_PRMD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 37.558556 | NaN |
KNR_PRMW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.154332 | NaN |
PNR_MIX | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 21.311667 | NaN |
PNR_PRMD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61.115000 | NaN |
PNR_PRMW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58.980637 | NaN |
SCHBUS | NaN | NaN | NaN | NaN | 5.816528 | NaN | NaN | NaN | NaN | NaN |
SHARED2GP | 7.469143 | 6.521744 | 4.544895 | 6.026752 | 4.176316 | 9.935834 | 8.809288 | 8.371339 | 13.427365 | 4.131103 |
SHARED2PAY | 26.186025 | 22.228364 | 23.677999 | 28.893666 | NaN | 59.844746 | 21.773001 | 33.488765 | 39.481301 | 13.410200 |
SHARED3GP | 7.622825 | 6.677294 | 6.378542 | 6.842151 | 4.308441 | 10.631094 | 8.269645 | 8.709989 | 13.986657 | 4.055721 |
SHARED3PAY | 21.648273 | 22.139000 | 25.274000 | 20.237000 | NaN | 59.383289 | 16.806000 | 38.686500 | 37.353821 | 13.419333 |
WALK | 0.637523 | 0.906817 | 0.491000 | 0.782691 | 0.973146 | 0.810000 | 1.343474 | 0.612707 | 0.714053 | 0.385786 |
WALK_MIX | 3.486778 | 3.483111 | 5.693667 | 4.597381 | 2.434500 | 8.946885 | 3.589333 | 3.416000 | 9.022067 | 0.754857 |
From the pivot table that we just created, we can see that the average distance of biking to work is 4.034763 mile, which is the answer to our first question listed above.
To answer the second question, we just need a one-dimensional pivot table that we can create by ignoring either index
or columns
in .pivot_table()
method.
[18]:
tour.pivot_table(columns = 'tour_purpose', values = 'tour_distance')
[18]:
tour_purpose | Discretionary | Eating Out | Escort | Maintenance | School | Shop | University | Visiting | Work | Work-Based |
---|---|---|---|---|---|---|---|---|---|---|
tour_distance | 6.355063 | 6.341543 | 4.266534 | 5.413463 | 4.53435 | 13.421542 | 7.796431 | 7.534867 | 15.804213 | 3.841985 |
It is pretty clear that the work tours have the highest average tour distance as shown in the table above.
If we want to create a pivot table with median values of the column assigned in values
, we can use aggfunc
argument to specify that requirement. Again, we can use a one-dimensional pivot table to find the answer of third question.
[11]:
tour.pivot_table(index = 'tour_mode', values = 'tour_distance', aggfunc = ['median'])
[11]:
median | |
---|---|
tour_distance | |
tour_mode | |
BIKE | 2.890500 |
DRIVEALONEFREE | 6.098000 |
DRIVEALONEPAY | 31.275000 |
KNR_MIX | 29.497999 |
KNR_PRMD | 36.278000 |
KNR_PRMW | 18.084999 |
PNR_MIX | 16.518000 |
PNR_PRMD | 55.566002 |
PNR_PRMW | 56.125999 |
SCHBUS | 2.624000 |
SHARED2GP | 5.912000 |
SHARED2PAY | 31.558000 |
SHARED3GP | 5.085000 |
SHARED3PAY | 30.006001 |
WALK | 0.548000 |
WALK_MIX | 3.294000 |
Looking at the table, it is obvious that the median walking distance for all kinds of tour is almost half a mile (0.548 mile).