[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
A popular package for developing linear regression models in Python is statsmodels
. This packages includes an extensive set of tools for statisitical modeling, but in this tutorial we will focus on linear regression models.
Generally, linear regression models will be developed using a pandas.DataFrame
containing both independent (explanatory) and dependent (target) variables. We’ll work with data in the households and trips table from the Jupiter study area.
[2]:
hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'), index_col=0)
hh.set_index('hh_id', inplace=True)
[3]:
trips = pd.read_csv(tt.data('SERPM8-BASE2015-TRIPS'))
If we want to develop a linear regression model to predict trip generation by households, we’ll need to merge these two data tables, tabulating the number of trips taken by each household. (See the tutorial on grouping for more details on how to do this).
[4]:
hh = hh.merge(
trips.groupby(['hh_id']).size().rename('n_trips'),
left_on=['hh_id'],
right_index=True,
)
We can review what variables we now have in the hh
DataFrame:
[5]:
hh.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17260 entries, 1690841 to 1726370
Data columns (total 9 columns):
home_mgra 17260 non-null int64
income 17260 non-null int64
autos 17260 non-null int64
transponder 17260 non-null int64
cdap_pattern 17260 non-null object
jtf_choice 17260 non-null int64
autotech 17260 non-null int64
tncmemb 17260 non-null int64
n_trips 17260 non-null int64
dtypes: int64(8), object(1)
memory usage: 1.3+ MB
If we suppose that the number of trips made by a household is a function of income and the number of automobiles owned, we can create an ordinary least squares regression model, and find the best fitting parameters like this:
[6]:
mod = sm.OLS(
hh.n_trips,
sm.add_constant(hh[['autos','income']])
)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_trips R-squared: 0.229
Model: OLS Adj. R-squared: 0.229
Method: Least Squares F-statistic: 2563.
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 13:45:17 Log-Likelihood: -48167.
No. Observations: 17260 AIC: 9.634e+04
Df Residuals: 17257 BIC: 9.636e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.4460 0.073 33.694 0.000 2.304 2.588
autos 2.5804 0.039 65.547 0.000 2.503 2.658
income 1.97e-06 2.79e-07 7.049 0.000 1.42e-06 2.52e-06
==============================================================================
Omnibus: 4116.620 Durbin-Watson: 1.926
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11185.853
Skew: 1.273 Prob(JB): 0.00
Kurtosis: 6.011 Cond. No. 4.14e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.14e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
/Users/jpn/anaconda/envs/tt/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
Note that the hh
dataframes contains a variety of other columns of data, but since we’re not interested in using them for this model, they can be implicitly omitted by creating a dataframe view that includes only the variables we do want.
Also, we use sm.add_constant
, which includes a constant in the regression function. By default, the statsmodels
module does not include a constant in an ordinary least squares (OLS) model, so you must explicitly add one to the explanatory variables to include it.
The output of the model summary()
is relatively extensive and includes a large number of statistical measures and tests that may or may not interest you. The most important of these measures include the coefficient estimates shown in the center panel of this report, as well as the R-squared measure at the upper right.
[7]:
sm.add_constant(hh[['autos','income']]).std()
[7]:
const 0.000000
autos 0.801841
income 112974.383573
dtype: float64
A magnitude variance this large is not problematic in raw statistical theory, but it can introduce numerical stability problems when using computers to represent these models. To solve this issue, we can simply scale one or more variables to more consistent variance:
[8]:
hh['income_100k'] = hh.income / 100_000
[9]:
mod = sm.OLS(
hh.n_trips,
sm.add_constant(hh[['autos','income_100k']])
)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_trips R-squared: 0.229
Model: OLS Adj. R-squared: 0.229
Method: Least Squares F-statistic: 2563.
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 13:45:17 Log-Likelihood: -48167.
No. Observations: 17260 AIC: 9.634e+04
Df Residuals: 17257 BIC: 9.636e+04
Df Model: 2
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const 2.4460 0.073 33.694 0.000 2.304 2.588
autos 2.5804 0.039 65.547 0.000 2.503 2.658
income_100k 0.1970 0.028 7.049 0.000 0.142 0.252
==============================================================================
Omnibus: 4116.620 Durbin-Watson: 1.926
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11185.853
Skew: 1.273 Prob(JB): 0.00
Kurtosis: 6.011 Cond. No. 6.60
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The R-squared and t-statistics of this model are the same as before, as this is in effect the same model as above. But in this revised model, the magnitude of the income coefficient is now much closer to that of the other coefficients, and the “condition number” warning is not present in the summary.
OLS linear regression models are by design written as linear-in-parameters models, but that does not mean that the explanitory data cannot be first transformed, for example by using a piece-wise linear expansion. The larch
package includes a piecewise_expansion
function that can expand a single column of data in a pandas.DataFrame into multiple columns based on defined breakpoints.
[10]:
from larch.util.data_expansion import piecewise_expansion
For example, to create a piecewise linear version of household income, with break points at $25 and $75 thouand, we could write:
[11]:
piecewise_expansion(hh.income, [25_000, 75_000]).head()
[11]:
piece(income,None,25000) | piece(income,25000,75000) | piece(income,75000,None) | |
---|---|---|---|
hh_id | |||
1690841 | 25000 | 50000 | 437000 |
1690961 | 25000 | 2500 | 0 |
1690866 | 25000 | 50000 | 75000 |
1690895 | 25000 | 50000 | 29000 |
1690933 | 25000 | 50000 | 20000 |
The result is three columns of data instead of one, with the first giving income up to the lower breakpoint, the next giving income between the two breakpoints, and the last giving the amount of income above the top breakpoint.
We can readily concatenate this expanded data with any other explanatory variables by using pandas.concat
. Note that by default this function concatenates dataframes vertically (combining columns and stacking rows), but in this case we want to concatenate horizontally (combining rows and stacking columns). We can achieve this by also passing axis=1
to the function in addition to the list of dataframes to concatenate.
[12]:
hh_edited = pd.concat([
hh.autos,
piecewise_expansion(hh.income_100k, [.25, .75]),
], axis=1)
hh_edited.head()
[12]:
autos | piece(income_100k,None,0.25) | piece(income_100k,0.25,0.75) | piece(income_100k,0.75,None) | |
---|---|---|---|---|
hh_id | ||||
1690841 | 2 | 0.25 | 0.500 | 4.37 |
1690961 | 1 | 0.25 | 0.025 | 0.00 |
1690866 | 2 | 0.25 | 0.500 | 0.75 |
1690895 | 2 | 0.25 | 0.500 | 0.29 |
1690933 | 2 | 0.25 | 0.500 | 0.20 |
Then we can use this modified dataframe to construct a piecewise linear OLS regression model. Because the original and modified dataframes have the same index (i.e. number and order of rows) we can mix them in the OLS defintion, using the n_trips
column from the original as the dependent variable and the explanatory data from the modified dataframe.
[13]:
mod = sm.OLS(
hh.n_trips,
sm.add_constant(hh_edited)
)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_trips R-squared: 0.231
Model: OLS Adj. R-squared: 0.231
Method: Least Squares F-statistic: 1297.
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 13:45:17 Log-Likelihood: -48143.
No. Observations: 17260 AIC: 9.630e+04
Df Residuals: 17255 BIC: 9.633e+04
Df Model: 4
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
const 1.9177 0.153 12.542 0.000 1.618 2.217
autos 2.4769 0.042 58.560 0.000 2.394 2.560
piece(income_100k,None,0.25) 2.2983 0.722 3.181 0.001 0.882 3.714
piece(income_100k,0.25,0.75) 1.0124 0.204 4.972 0.000 0.613 1.412
piece(income_100k,0.75,None) 0.0977 0.033 3.002 0.003 0.034 0.162
==============================================================================
Omnibus: 4132.769 Durbin-Watson: 1.925
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11235.615
Skew: 1.278 Prob(JB): 0.00
Kurtosis: 6.015 Cond. No. 56.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/Users/jpn/anaconda/envs/tt/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
In addition to piecewise linear terms in the regression equation, standard OLS allows for any arbitrary non-linear transformation. Students of statistics will be familiar with fitting a polynomial function with OLS coefficients, and this can be done using statsmodels
for example by explicitly computing the desired polynomial terms before estimating model parameter.
[14]:
hh['autos^2'] = hh['autos'] ** 2
hh['income^2'] = hh['income_100k'] ** 2
[15]:
mod = sm.OLS(
hh.n_trips,
sm.add_constant(hh[['autos','income_100k', 'autos^2', 'income^2']])
)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_trips R-squared: 0.230
Model: OLS Adj. R-squared: 0.229
Method: Least Squares F-statistic: 1286.
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 13:45:17 Log-Likelihood: -48160.
No. Observations: 17260 AIC: 9.633e+04
Df Residuals: 17255 BIC: 9.637e+04
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const 2.6596 0.127 20.926 0.000 2.411 2.909
autos 2.2161 0.139 15.970 0.000 1.944 2.488
income_100k 0.3748 0.067 5.578 0.000 0.243 0.507
autos^2 0.0839 0.033 2.545 0.011 0.019 0.148
income^2 -0.0314 0.011 -2.788 0.005 -0.054 -0.009
==============================================================================
Omnibus: 4098.383 Durbin-Watson: 1.925
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11134.434
Skew: 1.268 Prob(JB): 0.00
Kurtosis: 6.009 Cond. No. 46.6
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alternatively, polynomial terms can be created automatically for every column in the source data, as well as for interactions, using the PolynomialFeatures
preprocessor from the sklearn
package. This tool doesn’t automatically maintain the DataFrame formatting when applied (instead it outputs a simple array of values), but it is simple to write a small function that will do so.
[16]:
def polynomial(x, **kwargs):
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(**kwargs)
arr = poly.fit_transform(x)
return pd.DataFrame(arr, columns=poly.get_feature_names(x.columns), index=x.index)
Then we can use the function to calculate polynomial terms automatically. In this example, by setting the degree
to 3, we not only get squared and cubed versions of the two parameters, but also all the interactions of these parameters up to degree 3.
[17]:
hh_poly = polynomial(hh[['autos','income_100k']], degree=3)
hh_poly.head()
[17]:
1 | autos | income_100k | autos^2 | autos income_100k | income_100k^2 | autos^3 | autos^2 income_100k | autos income_100k^2 | income_100k^3 | |
---|---|---|---|---|---|---|---|---|---|---|
hh_id | ||||||||||
1690841 | 1.0 | 2.0 | 5.120 | 4.0 | 10.240 | 26.214400 | 8.0 | 20.480 | 52.428800 | 134.217728 |
1690961 | 1.0 | 1.0 | 0.275 | 1.0 | 0.275 | 0.075625 | 1.0 | 0.275 | 0.075625 | 0.020797 |
1690866 | 1.0 | 2.0 | 1.500 | 4.0 | 3.000 | 2.250000 | 8.0 | 6.000 | 4.500000 | 3.375000 |
1690895 | 1.0 | 2.0 | 1.040 | 4.0 | 2.080 | 1.081600 | 8.0 | 4.160 | 2.163200 | 1.124864 |
1690933 | 1.0 | 2.0 | 0.950 | 4.0 | 1.900 | 0.902500 | 8.0 | 3.800 | 1.805000 | 0.857375 |
Great care should be used with this automatic polynomial expansion of the data, as it is easy to end up with an overfitted model, especially when using a tool like OLS that does not attempt to self-correct to limit overfitting.
[18]:
mod = sm.OLS(
hh.n_trips,
hh_poly
)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_trips R-squared: 0.236
Model: OLS Adj. R-squared: 0.235
Method: Least Squares F-statistic: 590.5
Date: Thu, 08 Aug 2019 Prob (F-statistic): 0.00
Time: 13:45:17 Log-Likelihood: -48093.
No. Observations: 17260 AIC: 9.621e+04
Df Residuals: 17250 BIC: 9.628e+04
Df Model: 9
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
1 3.6606 0.192 19.086 0.000 3.285 4.037
autos -0.1241 0.321 -0.386 0.699 -0.754 0.506
income_100k 0.6579 0.223 2.955 0.003 0.222 1.094
autos^2 1.2151 0.178 6.828 0.000 0.866 1.564
autos income_100k 0.5617 0.182 3.088 0.002 0.205 0.918
income_100k^2 -0.3633 0.052 -6.984 0.000 -0.465 -0.261
autos^3 -0.1641 0.030 -5.515 0.000 -0.222 -0.106
autos^2 income_100k -0.1079 0.039 -2.790 0.005 -0.184 -0.032
autos income_100k^2 -0.0216 0.016 -1.337 0.181 -0.053 0.010
income_100k^3 0.0349 0.004 8.257 0.000 0.027 0.043
==============================================================================
Omnibus: 4055.002 Durbin-Watson: 1.926
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10960.315
Skew: 1.257 Prob(JB): 0.00
Kurtosis: 5.987 Cond. No. 659.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.