Discrete choice modeling is at the heart of many transportion planning models. In this example, we will examine the development of a mode choice model for Exampville, an entirely fictional town built for the express purpose of demostrating the use of discrete choice modeling tools for transportation planning.
In this notebook, we will walk through the creation of a tour mode choice model. This example will assume the reader is familiar with the mathematical basics of discrete choice modeling generally, and will focus on the technical aspects of estimating the parameters of a discrete choice model in Python using Larch.
[1]:
import larch, numpy, pandas, os
To begin, we’ll load some raw data. The Exampville data contains a set of files similar to what we might find for a real travel survey: network skims, and tables of households, persons, and tours.
[2]:
import larch.exampville
The skims data is a file in openmatrix
format, which contains a series of two-dimensional arrays describing zone-to-zone transportation level-of-service attributes. We can see below that this file contains data on travel times and costs by auto, transit, biking, and walking, for travel to and from each of 40 travel analysis zones in Exampville. Ideally, these skim values represent the “observed” travel times and costs for trips between each pair of zones, but generally these matrixes are
approximations of these real values generated by a base-case transportation model.
[3]:
skims = larch.OMX( larch.exampville.files.skims, mode='r' )
skims
[3]:
<larch.OMX> ⋯/exampville_skims.omx
| shape:(40, 40)
| data:
| AUTO_COST (float64)
| AUTO_DIST (float64)
| AUTO_TIME (float64)
| BIKE_TIME (float64)
| TRANSIT_FARE (float64)
| TRANSIT_IVTT (float64)
| TRANSIT_OVTT (float64)
| WALK_DIST (float64)
| WALK_TIME (float64)
| lookup:
| TAZ_ID (40 int64)
The other files are simple csv
text files, containing row-wise data about households, persons, and tours, as might be contained in survey results from a household travel survey conducted in Exampville.
[4]:
hh = pandas.read_csv( larch.exampville.files.hh )
pp = pandas.read_csv( larch.exampville.files.person )
tour = pandas.read_csv( larch.exampville.files.tour )
Let’s check out what’s in each table.
[5]:
hh.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
X 5000 non-null float64
Y 5000 non-null float64
INCOME 5000 non-null int64
geometry 5000 non-null object
HOMETAZ 5000 non-null int64
HHSIZE 5000 non-null int64
HHID 5000 non-null int64
dtypes: float64(2), int64(4), object(1)
memory usage: 273.5+ KB
[6]:
pp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9146 entries, 0 to 9145
Data columns (total 8 columns):
PERSONID 9146 non-null int64
HHID 9146 non-null int64
HHIDX 9146 non-null int64
AGE 9146 non-null int64
WORKS 9146 non-null int64
N_WORK_TOURS 9146 non-null int64
N_OTHER_TOURS 9146 non-null int64
N_TOURS 9146 non-null int64
dtypes: int64(8)
memory usage: 571.7 KB
[7]:
tour.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15934 entries, 0 to 15933
Data columns (total 6 columns):
TOURID 15934 non-null int64
HHID 15934 non-null int64
PERSONID 15934 non-null int64
DTAZ 15934 non-null int64
TOURMODE 15934 non-null int64
TOURPURP 15934 non-null int64
dtypes: int64(6)
memory usage: 747.0 KB
To use these data tables for mode choice modeling, we’ll need to filter them so they only include relevant rows, and merge them into a unified composite dataset.
Mode choice models are often created seperately for each tour purpose. We can review the purposes contained in our data by using the statistics
method, which Larch adds to pandas.Series objects:
[8]:
tour.TOURPURP.statistics()
[8]:
key | value |
---|---|
n | 15934 |
minimum | 1 |
maximum | 2 |
median | 2.0 |
histogram | |
mean | 1.6892180243504455 |
stdev | 0.4628137198278743 |
zeros | 0 |
positives | 15934 |
negatives | 0 |
nonzero_minimum | 1 |
nonzero_maximum | 2 |
nonzero_mean | 1.6892180243504455 |
nonzero_stdev | 0.4628137198278743 |
As we can see above, in Exampville, there are only two purposes for tours. These purposes are defined as:
We want to first estimate a mode choice model for work tours, so we’ll begin by creating a working dataframe, filtering the tours data to exclude non-work tours:
[9]:
df = tour[tour.TOURPURP == 1]
We can then merge data from the three survey tables using the usual pandas
syntax for merging.
[10]:
df = df.merge(hh, on='HHID').merge(pp, on=('HHID', 'PERSONID'))
Merging the skims data is more complicated, as we want to select not only the correct row, but also the correct column for each observation. We could do this by transforming the skims data in such a way that every origin-destination pair is on its own row, but for very large zone systems this can be inefficient. Larch provides a more efficient method to directly extract a DataFrame with the right information, based on the two-dimensional structure of the skims.
Our zone numbering system starts with zone 1, as is common for many TAZ numbering systems seen in practice. But, for looking up data in the skims matrix using Larch, we’ll need to use zero-based numbering that is standard in Python. So we’ll create two new TAZ-index columns to assist this process.
[11]:
df["HOMETAZi"] = df["HOMETAZ"] - 1
df["DTAZi"] = df["DTAZ"] - 1
For this tour mode choice model, we can pick values out of the skims for the known O-D of the tour, so that we have access to the level-of-service data for each possible mode serving that O-D pair. We’ll use the get_rc_dataframe
method of the larch.OMX
object, which lets us give the a list of the index for the production and attraction (row and column) value we want to use.
[12]:
df = df.join(
skims.get_rc_dataframe(
df["HOMETAZi"], df["DTAZi"],
)
)
We can review the df
frame to see what variables are now included.
[13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4952 entries, 0 to 4951
Data columns (total 29 columns):
TOURID 4952 non-null int64
HHID 4952 non-null int64
PERSONID 4952 non-null int64
DTAZ 4952 non-null int64
TOURMODE 4952 non-null int64
TOURPURP 4952 non-null int64
X 4952 non-null float64
Y 4952 non-null float64
INCOME 4952 non-null int64
geometry 4952 non-null object
HOMETAZ 4952 non-null int64
HHSIZE 4952 non-null int64
HHIDX 4952 non-null int64
AGE 4952 non-null int64
WORKS 4952 non-null int64
N_WORK_TOURS 4952 non-null int64
N_OTHER_TOURS 4952 non-null int64
N_TOURS 4952 non-null int64
HOMETAZi 4952 non-null int64
DTAZi 4952 non-null int64
AUTO_COST 4952 non-null float64
AUTO_DIST 4952 non-null float64
AUTO_TIME 4952 non-null float64
BIKE_TIME 4952 non-null float64
TRANSIT_FARE 4952 non-null float64
TRANSIT_IVTT 4952 non-null float64
TRANSIT_OVTT 4952 non-null float64
WALK_DIST 4952 non-null float64
WALK_TIME 4952 non-null float64
dtypes: float64(11), int64(17), object(1)
memory usage: 1.3+ MB
Now we can assemble this data, along with the definitions of the alternatives, into a larch.DataFrames object. This object links information about the raw data and the choices into one data structure. It can handle multiple data formats (including idco Format and idca Format data simultanously), but here we’ll only be using a single idco Format format data table.
[14]:
# For clarity, we can define numbers as names for modes
DA = 1
SR = 2
Walk = 3
Bike = 4
Transit = 5
[15]:
dfs = larch.DataFrames(
co=df,
alt_codes=[DA,SR,Walk,Bike,Transit],
alt_names=['DA','SR','Walk','Bike','Transit'],
ch_name='TOURMODE',
)
Now we are ready to create our model. We’ll create a larch.Model
object to do so, and link it to the data we just created.
[16]:
m = larch.Model(dataservice=dfs)
m.title = "Exampville Work Tour Mode Choice v1"
We will explicitly define the set of utility functions we want to use. Because the DataFrames we are using to serve data to this model contains exclusively idco
format data, we’ll use the utility_co
mapping, which allows us to define a unique utility function for each alternative.
Each utility function must be expressed as a linear-in-parameters function to combine raw or pre-computed data values with named model parameters. To facilitate writing these functions, Larch provides two special classes: parameter references (P
) and data references (X
).
[17]:
from larch import P, X
Parameter and data references can be defined using either a function-like notation, or a attribute-like notation.
[18]:
P('NamedParameter')
[18]:
P.NamedParameter
[19]:
X.NamedDataValue
[19]:
X.NamedDataValue
In either case, if the named value contains any spaces or non-alphanumeric characters, it must be given in function-like notation only, as Python will not accept those characters in the attribute-like form.
[20]:
P('Named Parameter')
[20]:
P('Named Parameter')
Data references can name an exact column that appears in the DataFrames
as defined above, or can include simple transformations of that data, so long as these transformations can be done without regard to any estimated parameter. For example, we can use the log of income:
[21]:
X("log(INCOME)")
[21]:
X('log(INCOME)')
To write a linear-in-parameters utility function, we simply multiply together a parameter reference and a data reference, and then optionally add that to one or more similar terms.
[22]:
P.InVehTime * X.AUTO_TIME + P.Cost * X.AUTO_COST
[22]:
P.InVehTime * X.AUTO_TIME + P.Cost * X.AUTO_COST
It is permissible to omit the data reference on a term (in which case it is implicitly set to 1.0).
[23]:
P.ASC + P.InVehTime * X.AUTO_TIME + P.Cost * X.AUTO_COST
[23]:
P.ASC + P.InVehTime * X.AUTO_TIME + P.Cost * X.AUTO_COST
We can then combine these to write utility functions for each alternative in the Exampville data:
[24]:
m.utility_co[DA] = (
+ P.InVehTime * X.AUTO_TIME
+ P.Cost * X.AUTO_COST # dollars per mile
)
m.utility_co[SR] = (
+ P.ASC_SR
+ P.InVehTime * X.AUTO_TIME
+ P.Cost * (X.AUTO_COST * 0.5) # dollars per mile, half share
+ P("LogIncome:SR") * X("log(INCOME)")
)
m.utility_co[Walk] = (
+ P.ASC_Walk
+ P.NonMotorTime * X.WALK_TIME
+ P("LogIncome:Walk") * X("log(INCOME)")
)
m.utility_co[Bike] = (
+ P.ASC_Bike
+ P.NonMotorTime * X.BIKE_TIME
+ P("LogIncome:Bike") * X("log(INCOME)")
)
m.utility_co[Transit] = (
+ P.ASC_Transit
+ P.InVehTime * X.TRANSIT_IVTT
+ P.OutVehTime * X.TRANSIT_OVTT
+ P.Cost * X.TRANSIT_FARE
+ P("LogIncome:Transit") * X('log(INCOME)')
)
To write a nested logit model, we’ll attach some nesting nodes to the model’s graph
. Each new_node
allows us to define the set of codes for the child nodes (elemental alternatives, or lower level nests) as well as giving the new nest a name and assigning a logsum parameter. The return value of this method is the node code for the newly created nest, which then can potenially be used as a child code when creating a higher level nest. We do this below, adding the ‘Car’ nest into the
‘Motor’ nest.
[25]:
Car = m.graph.new_node(parameter='Mu:Car', children=[DA,SR], name='Car')
NonMotor = m.graph.new_node(parameter='Mu:NonMotor', children=[Walk,Bike], name='NonMotor')
Motor = m.graph.new_node(parameter='Mu:Motor', children=[Car,Transit], name='Motor')
Let’s visually check on the nesting structure.
[26]:
m.graph
[26]:
The tour mode choice model’s choice variable is indicated by the code value in ‘TOURMODE’, and this can be defined for the model using choice_co_code
.
[27]:
m.choice_co_code = 'TOURMODE'
We can also give a dictionary of availability conditions based on values in the idco
data, using the availability_co_vars
attribute. Alternatives that are always available can be indicated by setting the criterion to 1. For alternative that are only sometimes available, we can give an availability criteria in the same manner as for a data reference described above: either by giving the name of a variable in the data, or an expression that can be evaluated using the data alone. In the
case of availability criteria, these will be tranformed to boolean (true/false) values, so data that evaluates as 0 will be unavailable, and data that evaluates as non-zero will be available (including, perhaps counterintuitively, negative numbers).
[28]:
m.availability_co_vars = {
DA: 'AGE >= 16',
SR: 1,
Walk: 'WALK_TIME < 60',
Bike: 'BIKE_TIME < 60',
Transit: 'TRANSIT_FARE>0',
}
Then let’s prepare this data for estimation. Even though the data is already in memory, the load_data
method is used to pre-process the data, extracting the required values, pre-computing the values of fixed expressions, and assembling the results into contiguous arrays suitable for computing the log likelihood values efficiently.
[29]:
m.load_data()
We can check on some important statistics of this loaded and preprocessed data even before we estimate the model.
[30]:
m.dataframes.choice_avail_summary()
[30]:
name | chosen | available | |
---|---|---|---|
1 | DA | 3984.0 | 4952 |
2 | SR | 570.0 | 4952 |
3 | Walk | 114.0 | 2789 |
4 | Bike | 47.0 | 4952 |
5 | Transit | 237.0 | 2651 |
< Total All Alternatives > | 4952.0 |
[31]:
m.dataframes.data_co.statistics()
[31]:
n | minimum | maximum | median | histogram | mean | stdev | zeros | positives | negatives | nonzero_minimum | nonzero_maximum | nonzero_mean | nonzero_stdev | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AUTO_COST | 4952 | 0.194926 | 4.30796 | 0.991167 | 1.19767 | 0.746932 | 0 | 4952 | 0 | 0.194926 | 4.30796 | 1.19767 | 0.746932 | |
AUTO_COST*(0.5) | 4952 | 0.0974628 | 2.15398 | 0.495584 | 0.598834 | 0.373466 | 0 | 4952 | 0 | 0.0974628 | 2.15398 | 0.598834 | 0.373466 | |
AUTO_TIME | 4952 | 0.930008 | 30.8023 | 7.55075 | 8.24245 | 4.60459 | 0 | 4952 | 0 | 0.930008 | 30.8023 | 8.24245 | 4.60459 | |
BIKE_TIME | 4952 | 2.78465 | 52.2321 | 13.3696 | 15.8609 | 9.07942 | 0 | 4952 | 0 | 2.78465 | 52.2321 | 15.8609 | 9.07942 | |
TRANSIT_FARE | 4952 | 0 | 2.5 | 2.5 | 1.33835 | 1.24687 | 2301 | 2651 | 0 | 2.5 | 2.5 | 2.5 | 0 | |
TRANSIT_IVTT | 4952 | 0 | 12.1668 | 0.959322 | 2.51945 | 3.35336 | 2301 | 2651 | 0 | 0.959322 | 12.1668 | 4.70626 | 3.27318 | |
TRANSIT_OVTT | 4952 | 0.759059 | 127.031 | 39.2568 | 39.2398 | 24.0574 | 0 | 4952 | 0 | 0.759059 | 127.031 | 39.2398 | 24.0574 | |
WALK_TIME | 4952 | 11.1386 | 208.928 | 53.4784 | 63.4435 | 36.3177 | 0 | 4952 | 0 | 11.1386 | 208.928 | 63.4435 | 36.3177 | |
log(INCOME) | 4952 | 7.0775 | 14.8527 | 10.4983 | 10.4794 | 1.16767 | 0 | 4952 | 0 | 7.0775 | 14.8527 | 10.4794 | 1.16767 |
If we are satisfied with the statistics we see above, we can go ahead and estimate the model. Estimation is done using maximium likelihood techniques, relying on the scipy.optimize
library for providing a variety of algorithms for solving this non-linear optimization problem. For nested logit models, the ‘SLSQP’ method often works well.
[32]:
result = m.maximize_loglike(method='slsqp')
LL = -2384.5719677817524
value | initvalue | nullvalue | minimum | maximum | holdfast | note | best | |
---|---|---|---|---|---|---|---|---|
ASC_Bike | 0.059887 | 0.0 | 0.0 | -inf | inf | 0 | 0.059887 | |
ASC_SR | 1.162299 | 0.0 | 0.0 | -inf | inf | 0 | 1.162299 | |
ASC_Transit | 4.565357 | 0.0 | 0.0 | -inf | inf | 0 | 4.565357 | |
ASC_Walk | 7.365430 | 0.0 | 0.0 | -inf | inf | 0 | 7.365430 | |
Cost | -0.376209 | 0.0 | 0.0 | -inf | inf | 0 | -0.376209 | |
InVehTime | -0.090880 | 0.0 | 0.0 | -inf | inf | 0 | -0.090880 | |
LogIncome:Bike | -0.243509 | 0.0 | 0.0 | -inf | inf | 0 | -0.243509 | |
LogIncome:SR | -0.267938 | 0.0 | 0.0 | -inf | inf | 0 | -0.267938 | |
LogIncome:Transit | -0.318997 | 0.0 | 0.0 | -inf | inf | 0 | -0.318997 | |
LogIncome:Walk | -0.468722 | 0.0 | 0.0 | -inf | inf | 0 | -0.468722 | |
Mu:Car | 0.700454 | 1.0 | 1.0 | 0.001 | 1.0 | 0 | 0.700454 | |
Mu:Motor | 0.691102 | 1.0 | 1.0 | 0.001 | 1.0 | 0 | 0.691102 | |
Mu:NonMotor | 0.749037 | 1.0 | 1.0 | 0.001 | 1.0 | 0 | 0.749037 | |
NonMotorTime | -0.235496 | 0.0 | 0.0 | -inf | inf | 0 | -0.235496 | |
OutVehTime | -0.248537 | 0.0 | 0.0 | -inf | inf | 0 | -0.248537 |
After we find the best fitting parameters, we can compute some variance-covariance statistics, incuding standard errors of the estimates and t statistics, using calculate_parameter_covariance
.
[33]:
m.calculate_parameter_covariance()
Then we can review the results in a variety of report tables.
[34]:
m.parameter_summary()
[34]:
Parameter | Value | Std Err | t Stat | Null Value | |
---|---|---|---|---|---|
ASC_Bike | 0.05989 | 1.25 | 0.05 | 0.0 | |
ASC_SR | 1.162 | 0.784 | 1.48 | 0.0 | |
ASC_Transit | 4.565 | 1.98 | 2.31 | 0.0 | |
ASC_Walk | 7.365 | 1.17 | 6.29 | 0.0 | |
Cost | -0.3762 | 0.206 | -1.83 | 0.0 | |
InVehTime | -0.09088 | 0.0388 | -2.34 | 0.0 | |
LogIncome | Bike | -0.2435 | 0.120 | -2.02 | 0.0 |
SR | -0.2679 | 0.164 | -1.64 | 0.0 | |
Transit | -0.319 | 0.143 | -2.23 | 0.0 | |
Walk | -0.4687 | 0.100 | -4.67 | 0.0 | |
Mu | Car | 0.7005 | 0.418 | -0.72 | 1.0 |
Motor | 0.6911 | 0.269 | -1.15 | 1.0 | |
NonMotor | 0.749 | 0.129 | -1.95 | 1.0 | |
NonMotorTime | -0.2355 | 0.0215 | -10.97 | 0.0 | |
OutVehTime | -0.2485 | 0.0979 | -2.54 | 0.0 |
[35]:
m.estimation_statistics()
[35]:
Statistic | Aggregate | Per Case |
---|---|---|
Number of Cases | 4952 | |
Log Likelihood at Convergence | -2384.57 | -0.48 |
Log Likelihood at Null Parameters | -6958.98 | -1.41 |
Rho Squared w.r.t. Null Parameters | 0.657 |
If we are satisified with this model, or if we just want to record it as part of our workflow while exploring different model structures, we can write the model out to a report. To do so, we can instantiatie a larch.Reporter
object.
[36]:
report = larch.Reporter(title=m.title)
Then, we can push section headings and report pieces into the report using the “<<” operator.
[37]:
report << '# Parameter Summary' << m.parameter_summary()
[37]:
Parameter | Value | Std Err | t Stat | Null Value | |
---|---|---|---|---|---|
ASC_Bike | 0.05989 | 1.25 | 0.05 | 0.0 | |
ASC_SR | 1.162 | 0.784 | 1.48 | 0.0 | |
ASC_Transit | 4.565 | 1.98 | 2.31 | 0.0 | |
ASC_Walk | 7.365 | 1.17 | 6.29 | 0.0 | |
Cost | -0.3762 | 0.206 | -1.83 | 0.0 | |
InVehTime | -0.09088 | 0.0388 | -2.34 | 0.0 | |
LogIncome | Bike | -0.2435 | 0.120 | -2.02 | 0.0 |
SR | -0.2679 | 0.164 | -1.64 | 0.0 | |
Transit | -0.319 | 0.143 | -2.23 | 0.0 | |
Walk | -0.4687 | 0.100 | -4.67 | 0.0 | |
Mu | Car | 0.7005 | 0.418 | -0.72 | 1.0 |
Motor | 0.6911 | 0.269 | -1.15 | 1.0 | |
NonMotor | 0.749 | 0.129 | -1.95 | 1.0 | |
NonMotorTime | -0.2355 | 0.0215 | -10.97 | 0.0 | |
OutVehTime | -0.2485 | 0.0979 | -2.54 | 0.0 |
[38]:
report << "# Estimation Statistics" << m.estimation_statistics()
[38]:
[39]:
report << "# Utility Functions" << m.utility_functions()
[39]:
alt | formula |
---|---|
1 | + P.InVehTime * X.AUTO_TIME + P.Cost * X.AUTO_COST |
2 | + P.ASC_SR + P.InVehTime * X.AUTO_TIME + P.Cost * X('AUTO_COST*(0.5)') + P('LogIncome:SR') * X('log(INCOME)') |
3 | + P.ASC_Walk + P.NonMotorTime * X.WALK_TIME + P('LogIncome:Walk') * X('log(INCOME)') |
4 | + P.ASC_Bike + P.NonMotorTime * X.BIKE_TIME + P('LogIncome:Bike') * X('log(INCOME)') |
5 | + P.ASC_Transit + P.InVehTime * X.TRANSIT_IVTT + P.OutVehTime * X.TRANSIT_OVTT + P.Cost * X.TRANSIT_FARE + P('LogIncome:Transit') * X('log(INCOME)') |
Once we have assembled the report, we can save the file to disk as an HTML report containing the content previously assembled. Attaching the model itself to the report as metadata can be done within the save
method, which will allow us to directly reload the same model again later.
[40]:
report.save(
'/tmp/exampville_mode_choice.html',
overwrite=True,
metadata=m,
)
[40]:
'/tmp/exampville_mode_choice.html'