[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
Construct an ordinary least squares linear regression model to predict the given value of time for each individual in the Jupiter study area data as a function of: - age, - gender, - full-time employment status, and - household income.
Evaluate this model to answer the questions:
To answer the questions, use the following data files:
[2]:
per = pd.read_csv(tt.data('SERPM8-BASE2015-PERSONS'))
hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'))
[3]:
per.head()
[3]:
hh_id | person_id | person_num | age | gender | type | value_of_time | activity_pattern | imf_choice | inmf_choice | fp_choice | reimb_pct | wrkr_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1690841 | 4502948 | 1 | 46 | m | Full-time worker | 5.072472 | M | 1 | 1 | -1 | 0.0 | 0 |
1 | 1690841 | 4502949 | 2 | 47 | f | Part-time worker | 5.072472 | M | 2 | 37 | -1 | 0.0 | 0 |
2 | 1690841 | 4502950 | 3 | 11 | f | Student of non-driving age | 3.381665 | M | 3 | 1 | -1 | 0.0 | 0 |
3 | 1690841 | 4502951 | 4 | 8 | m | Student of non-driving age | 3.381665 | M | 3 | 1 | -1 | 0.0 | 0 |
4 | 1690961 | 4503286 | 1 | 52 | m | Part-time worker | 2.447870 | M | 1 | 2 | -1 | 0.0 | 0 |
[4]:
hh.head()
[4]:
Unnamed: 0 | hh_id | home_mgra | income | autos | transponder | cdap_pattern | jtf_choice | autotech | tncmemb | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 426629 | 1690841 | 7736 | 512000 | 2 | 1 | MMMM0 | 0 | 0 | 0 |
1 | 426630 | 1690961 | 7736 | 27500 | 1 | 0 | MNMM0 | 0 | 0 | 0 |
2 | 426631 | 1690866 | 7736 | 150000 | 2 | 0 | HMM0 | 0 | 0 | 0 |
3 | 426632 | 1690895 | 7736 | 104000 | 2 | 1 | MMMM0 | 0 | 0 | 0 |
4 | 426633 | 1690933 | 7736 | 95000 | 2 | 1 | MNM0 | 0 | 0 | 0 |
First, we import household income from hh
dataframe and merge it with per
dataframe to get the household income information for each individual.
[5]:
per = pd.merge(per, hh[['hh_id', 'income', 'autos', 'transponder']], on = 'hh_id', how = 'inner')
Then, we create a couple of dummy variables with binary values to include them as explanatory variables in model estimation. We create female
and full-time
variable to observe the categorical effect of gender and full-time employment status on model outcome. We can also scale income
varibale to ensure more reasonable variance in the estimation. For example, we can simply scale down the numbers by 100K.
[6]:
per['female'] = np.where((per.gender == 'f'), 1, 0)
per['full_time'] = np.where((per.type == 'Full-time worker'), 1, 0)
per['hh_income(100k)'] = per['income'] / 100000
[7]:
per.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40683 entries, 0 to 40682
Data columns (total 19 columns):
hh_id 40683 non-null int64
person_id 40683 non-null int64
person_num 40683 non-null int64
age 40683 non-null int64
gender 40683 non-null object
type 40683 non-null object
value_of_time 40683 non-null float64
activity_pattern 40683 non-null object
imf_choice 40683 non-null int64
inmf_choice 40683 non-null int64
fp_choice 40683 non-null int64
reimb_pct 40683 non-null float64
wrkr_type 40683 non-null int64
income 40683 non-null int64
autos 40683 non-null int64
transponder 40683 non-null int64
female 40683 non-null int64
full_time 40683 non-null int64
hh_income(100k) 40683 non-null float64
dtypes: float64(3), int64(13), object(3)
memory usage: 6.2+ MB
At this point, we have the dataframe ready with all explanatory variables (age
, female
, full-time
and hh_income(100k)
) and the response variable (value_of_time
). We check data types of all variables and presence of NULL
values. If everything looks appropriate, then we go for creating a model object. We use sm.OLS()
method for building a model object. Inside this method, we can add a constant to the explanatory variables in regression model using sm.add_constant()
method. Then, we fit the model using .fit()
method and store the estimation results in a variable.
[8]:
model = sm.OLS(per['value_of_time'], sm.add_constant(per[['age', 'female', 'full_time', 'hh_income(100k)']]))
result = model.fit()
/Users/jpn/anaconda/envs/tt/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
We can print the summary of model estimation using .summary()
method, to review a number of statistical outputs from the model, including the model coefficients.
[9]:
print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: value_of_time R-squared: 0.036
Model: OLS Adj. R-squared: 0.036
Method: Least Squares F-statistic: 384.5
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 15:02:14 Log-Likelihood: -1.4546e+05
No. Observations: 40683 AIC: 2.909e+05
Df Residuals: 40678 BIC: 2.910e+05
Df Model: 4
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const 6.9651 0.116 60.046 0.000 6.738 7.192
age 0.0361 0.002 20.006 0.000 0.033 0.040
female -0.0345 0.087 -0.399 0.690 -0.204 0.135
full_time 1.7770 0.089 19.855 0.000 1.602 1.952
hh_income(100k) 0.9476 0.037 25.937 0.000 0.876 1.019
==============================================================================
Omnibus: 17005.353 Durbin-Watson: 0.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 77628.651
Skew: 2.042 Prob(JB): 0.00
Kurtosis: 8.395 Cond. No. 155.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the model estimation summary, we see that full-time employment has a substantial positive effect on value of time. Increases in household income and age also contributes to the increase in value of time. The t-statistics for all these parameters are much larger than 1.96, indicating they are all almost certainly significant parameters in determining the person’s value of time. However, the t-statistic for the gender of the person has a small t-statistics, with magnitude only about 0.4. This suggests that gender is not a statistically significant factor in determining the value of time.
If we estimate the same model, but adding the number of automobiles owned by the person’s household as an additional explanatory factor, we can see that automobile ownership is also a relevant and statistically significant factor (with a t-statistic of 11.8).
[10]:
model = sm.OLS(
per['value_of_time'],
sm.add_constant(per[['age', 'female', 'full_time', 'hh_income(100k)', 'autos']])
)
result = model.fit()
print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: value_of_time R-squared: 0.040
Model: OLS Adj. R-squared: 0.040
Method: Least Squares F-statistic: 336.2
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 15:02:14 Log-Likelihood: -1.4539e+05
No. Observations: 40683 AIC: 2.908e+05
Df Residuals: 40677 BIC: 2.909e+05
Df Model: 5
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const 5.6471 0.161 35.013 0.000 5.331 5.963
age 0.0407 0.002 22.068 0.000 0.037 0.044
female 0.0060 0.086 0.069 0.945 -0.163 0.175
full_time 1.7309 0.089 19.354 0.000 1.556 1.906
hh_income(100k) 0.8326 0.038 22.047 0.000 0.759 0.907
autos 0.6224 0.053 11.740 0.000 0.518 0.726
==============================================================================
Omnibus: 17013.391 Durbin-Watson: 0.348
Prob(Omnibus): 0.000 Jarque-Bera (JB): 77754.288
Skew: 2.043 Prob(JB): 0.00
Kurtosis: 8.401 Cond. No. 203.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.