# Data management

#### 2021-11-12

This vignette explains how to provide choice data for RprobitB via

As a first step, we recommend to specify the model formula.

## Specify the model formula

The model formula is specified using a formula object, let’s call it form.

The structure of form is choice ~ A | B | C, where

• choice is the discrete choice we aim to explain,

• A are alternative and choice situation specific covariates with a generic coefficient (we call them covariates of type 1),

• B are choice situation specific covariates with alternative specific coefficients (we call them covariates of type 2),

• and C are alternative and choice situation specific covariates with alternative specific coefficients (we call them covariates of type 3).

Keep the following rules in mind:

• By default, alternative specific constants are added to the model1. They can be removed by adding +0 in the second spot, e.g. choice ~ A | B + 0.

• To exclude covariates of the backmost categories, use either 0, e.g. choice ~ A | B | 0 or just leave this part out and write choice ~ A | B. However, to exclude covariates of front categories, we have to use 0, e.g. choice ~ A | 0 | C.

• To include more than one covariate of the same category, use +, e.g. choice ~ A1 + A2 | B.

• If we don’t want to include any covariates of the second category but we want to estimate alternative specific constants, add 1 in the second spot, e.g. choice ~ A | 1. The expression choice ~ A | 0 is interpreted as no covariates of the second category and no alternative specific constants.

To have random effects for specific variables, we need to define a character vector re of the corresponding variable names. To have random effects for the alternative specific constants, include "ASC" in re.

### Example: Simulated choice of transportation means

Say we want to explain the choice of transportation means by the variables cost, income, and travel_time. We furthermore want to add alternative specific constants.

• The cost of an alternative is obviously alternative specific. However, we can argue that it does not matter for which alternative we spend our money. Therefore, we want to estimate a generic coefficient for cost.

• The income of a decision maker is constant across alternatives, but can have a different influence on the alternatives. It is therefore a covariate of type 2.

• The travel_time is a covariate of type 3: It is alternative specific but in contrast to the cost, we can imagine that spending time in public transportation means is different from spending time in ones own car.

Therefore, we specify:

form = choice ~ cost | income | travel_time

We typically would expect heterogeneity in preferences regarding spending money on a transportation means, therefore we impose a random effect on cost:

re = "cost"

## Empirical data

This section explains how to prepare empirical data for estimation using the function prepare().

Say we have a data set with empirical choice data, let’s call it choice_data. It must meet the following requirements:

1. It must be a data frame.

2. It must be in wide format, that means each row represents one choice occasion.

3. It must contain a column named id, which contains a unique identifier for each decision maker.

4. It must contain a column named choice, where choice must match the name of the dependent variable in form.

5. For each alternative specific covariate p in form and each choice alternative j, choice_data must contain a column named p_j.

6. For each covariate q that is constant across covariates (covariate of type 2), choice_data must contain a column named q.

To prepare choice_data for estimation, we must call

data = prepare(form = form, choice_data = choice_data)

The function prepare() has the following optional arguments:

• alternatives: We may not want to consider all alternatives in choice_data. In that case, we can specify a character vector alternatives with selected names of alternatives.

• re: The character vector of variable names of form with random effects.

• id: A character, the name of the column in choice_data that contains a unique identifier for each decision maker. The default is "id".

• standardize: A character vector of variable names of form that get standardized, see below.

### Example: “Train” data set of the mlogit package

Let’s prepare the Train data set of the mlogit package for estimation. We consider the covariates price (type 1), time, comfort and change (each of type 3), where we link price and time to random effects2.

data("Train", package = "mlogit")
data = prepare(form = choice ~ price | 0 | time + comfort + change,
choice_data = Train,
re = c("price","time"))

## Simulated data

This section explains how to simulate choice data using the function simulate().

If we want to simulate the choices of N deciders in T choice occasions3 among J alternatives from our model formulation form, we have to call

data = simulate(form = form, N = N, T = T, J = J)

The function simulate() has the following optional arguments:

• re: The character vector of variable names of form with random effects.

• alternatives: A character vector with the names of the choice alternatives with length J.

• distr: A named list of number generation functions from which the covariates are drawn. Each element of distr must be of the form "cov" = list("name" = "<name of the number generation function>", ...), where cov is the name of the covariate4 and ... are required parameters for the number generation function. Covariates for which no distribution is specified are drawn from a standard normal distribution. Possible number generation functions are

• functions of the type r* from base R (e.g. rnorm) where all required parameters (except for n) must be specified,

• the function sample, where all required parameters except for size) must be specified.

• standardize: A character vector of variable names of form that get standardized, see below.

We can specify true parameter values by adding values for

• alpha, the fixed coefficient vector,

• C, the number (greater or equal 1) of latent classes of decision, makers

• s, the vector of class weights,

• b, the matrix of class means as columns,

• Omega, the matrix of class covariance matrices as columns,

• Sigma, the differenced error term covariance matrix,

• Sigma_full, the full error term covariance matrix.

### Example: Simulated choice of transportation means

form = choice ~ cost | income | travel_time
re = "cost"

Let us now simulate the choices of N = 100 decision makers in T = 10 choice occasions on the J = 3 alternatives “car”, “bus” and “train”. We want C = 2 true latent classes and specific distributions5 for our covariates:

N = 100
T = 10
J = 3
alternatives = c("car", "bus", "train")
distr = list("cost" = list("name" = "rnorm", sd = 3),
"income" = list("name" =  "sample", x = (1:10)*1e3, replace = TRUE),
"travel_time_car" = list("name" = "rlnorm", meanlog = 1),
"travel_time_bus" = list("name" = "rlnorm", meanlog = 2))
data = simulate(form = form, N = N, T = T, J = J, re = re,
alternatives = alternatives, distr = distr, C = 2)

## Standardize covariates

Both simulate() and prepare() have the optional input standardize, which is a character vector of names of covariates that get standardized, i.e. normalize to mean 0 and standard deviation 1. If standardize = "all", all covariates get standardized.

Covariates of type 1 or 3 have to be addressed by covariate_alternative.

If standardize = "all", all covariates get standardized.

### Example: Simulated choice of transportation means

In our example of the simulated choice of transportation means, scaling the income is reasonable and can improve model fitting. For demonstration purpose, we also standardize travel_time for each alternative:

standardize = c("income", "travel_time_car", "travel_time_bus",
"travel_time_train")
data = simulate(form = form, N = N, T = T, J = J, re = re,
alternatives = alternatives, parm = parm, distr = distr,
standardize = standardize)

## Data summary

We can check if the data preparation or simulation worked as expected using the summary() function. The columns z and re indicate standardized and random effect covariates, respectively. The rest of the output is self-explanatory.

summary(data)

1. Alternative specific constants can be interpreted as covariates of type 2. Due to the dummy variable trap, we cannot estimate alternative specific constants for all the alternatives. Therefore, they are added for all except for the last alternative.↩︎

2. Note that alternative specific constants are excluded here.↩︎

3. T can be either a positive number, representing a fixed number of choice occasions for each decision maker, or a vector of length N, i.e. a decision maker specific number of choice occasions.↩︎

4. For a covariate cov of type 1 or 3, you can either choose "name" = cov (to draw the covariate for all alternatives from the same distribution) or "name" = cov_alternative (to draw the covariate for a specific alternative from a specific distribution).↩︎

5. Note that the cost covariate for all alternatives is drawn from the same distribution. Also note that since we did not specify a distribution for travel_time_bus, this covariate is drawn from a standard normal distribution.↩︎