This vignette explains how to provide choice data for **RprobitB** via

As a first step, we recommend to specify the model formula.

The model formula is specified using a formula object, let’s call it `form`

.

The structure of `form`

is `choice ~ A | B | C`

, where

`choice`

is the discrete choice we aim to explain,`A`

are alternative and choice situation specific covariates with a generic coefficient (we call them covariates of type 1),`B`

are choice situation specific covariates with alternative specific coefficients (we call them covariates of type 2),and

`C`

are alternative and choice situation specific covariates with alternative specific coefficients (we call them covariates of type 3).

Keep the following rules in mind:

By default, alternative specific constants are added to the model

^{1}. They can be removed by adding`+0`

in the second spot, e.g.`choice ~ A | B + 0`

.To exclude covariates of the backmost categories, use either

`0`

, e.g.`choice ~ A | B | 0`

or just leave this part out and write`choice ~ A | B`

. However, to exclude covariates of front categories, we have to use`0`

, e.g.`choice ~ A | 0 | C`

.To include more than one covariate of the same category, use

`+`

, e.g.`choice ~ A1 + A2 | B`

.If we don’t want to include any covariates of the second category but we want to estimate alternative specific constants, add

`1`

in the second spot, e.g.`choice ~ A | 1`

. The expression`choice ~ A | 0`

is interpreted as no covariates of the second category and no alternative specific constants.

To have random effects for specific variables, we need to define a character vector `re`

of the corresponding variable names. To have random effects for the alternative specific constants, include `"ASC"`

in `re`

.

Say we want to explain the `choice`

of transportation means by the variables `cost`

, `income`

, and `travel_time`

. We furthermore want to add alternative specific constants.

The

`cost`

of an alternative is obviously alternative specific. However, we can argue that it does not matter for which alternative we spend our money. Therefore, we want to estimate a generic coefficient for`cost`

.The

`income`

of a decision maker is constant across alternatives, but can have a different influence on the alternatives. It is therefore a covariate of type 2.The

`travel_time`

is a covariate of type 3: It is alternative specific but in contrast to the`cost`

, we can imagine that spending time in public transportation means is different from spending time in ones own car.

Therefore, we specify:

`= choice ~ cost | income | travel_time form `

We typically would expect heterogeneity in preferences regarding spending money on a transportation means, therefore we impose a random effect on `cost`

:

`= "cost" re `

This section explains how to prepare empirical data for estimation using the function `prepare()`

.

Say we have a data set with empirical choice data, let’s call it `choice_data`

. It must meet the following requirements:

It must be a data frame.

It must be in wide format, that means each row represents one choice occasion.

It must contain a column named

`id`

, which contains a unique identifier for each decision maker.It must contain a column named

`choice`

, where`choice`

must match the name of the dependent variable in`form`

.For each alternative specific covariate

`p`

in`form`

and each choice alternative`j`

,`choice_data`

must contain a column named`p_j`

.For each covariate

`q`

that is constant across covariates (covariate of type 2),`choice_data`

must contain a column named`q`

.

To prepare `choice_data`

for estimation, we must call

`= prepare(form = form, choice_data = choice_data) data `

The function `prepare()`

has the following optional arguments:

`alternatives`

: We may not want to consider all alternatives in`choice_data`

. In that case, we can specify a character vector`alternatives`

with selected names of alternatives.`re`

: The character vector of variable names of`form`

with random effects.`id`

: A character, the name of the column in`choice_data`

that contains a unique identifier for each decision maker. The default is`"id"`

.`standardize`

: A character vector of variable names of`form`

that get standardized, see below.

Let’s prepare the *Train* data set of the mlogit package for estimation. We consider the covariates `price`

(type 1), `time`

, `comfort`

and `change`

(each of type 3), where we link `price`

and `time`

to random effects^{2}.

```
data("Train", package = "mlogit")
= prepare(form = choice ~ price | 0 | time + comfort + change,
data choice_data = Train,
re = c("price","time"))
```

This section explains how to simulate choice data using the function `simulate()`

.

If we want to simulate the choices of `N`

deciders in `T`

choice occasions^{3} among `J`

alternatives from our model formulation `form`

, we have to call

`= simulate(form = form, N = N, T = T, J = J) data `

The function `simulate()`

has the following optional arguments:

`re`

: The character vector of variable names of`form`

with random effects.`alternatives`

: A character vector with the names of the choice alternatives with length`J`

.`distr`

: A named list of number generation functions from which the covariates are drawn. Each element of`distr`

must be of the form`"cov" = list("name" = "<name of the number generation function>", ...)`

, where`cov`

is the name of the covariate^{4}and`...`

are required parameters for the number generation function. Covariates for which no distribution is specified are drawn from a standard normal distribution. Possible number generation functions arefunctions of the type

`r*`

from base R (e.g.`rnorm`

) where all required parameters (except for`n`

) must be specified,the function

`sample`

, where all required parameters except for`size`

) must be specified.

`standardize`

: A character vector of variable names of`form`

that get standardized, see below.

We can specify true parameter values by adding values for

`alpha`

, the fixed coefficient vector,`C`

, the number (greater or equal 1) of latent classes of decision, makers`s`

, the vector of class weights,`b`

, the matrix of class means as columns,`Omega`

, the matrix of class covariance matrices as columns,`Sigma`

, the differenced error term covariance matrix,`Sigma_full`

, the full error term covariance matrix.

We revisit our example of the simulated choice of transportation means, where we already specified:

```
= choice ~ cost | income | travel_time
form = "cost" re
```

Let us now simulate the choices of `N = 100`

decision makers in `T = 10`

choice occasions on the `J = 3`

alternatives “car”, “bus” and “train”. We want `C = 2`

true latent classes and specific distributions^{5} for our covariates:

```
= 100
N = 10
T = 3
J = c("car", "bus", "train")
alternatives = list("cost" = list("name" = "rnorm", sd = 3),
distr "income" = list("name" = "sample", x = (1:10)*1e3, replace = TRUE),
"travel_time_car" = list("name" = "rlnorm", meanlog = 1),
"travel_time_bus" = list("name" = "rlnorm", meanlog = 2))
= simulate(form = form, N = N, T = T, J = J, re = re,
data alternatives = alternatives, distr = distr, C = 2)
```

Both `simulate()`

and `prepare()`

have the optional input `standardize`

, which is a character vector of names of covariates that get standardized, i.e. normalize to mean 0 and standard deviation 1. If `standardize = "all"`

, all covariates get standardized.

Covariates of type 1 or 3 have to be addressed by `covariate_alternative`

.

If `standardize = "all"`

, all covariates get standardized.

In our example of the simulated choice of transportation means, scaling the `income`

is reasonable and can improve model fitting. For demonstration purpose, we also standardize `travel_time`

for each alternative:

```
= c("income", "travel_time_car", "travel_time_bus",
standardize "travel_time_train")
= simulate(form = form, N = N, T = T, J = J, re = re,
data alternatives = alternatives, parm = parm, distr = distr,
standardize = standardize)
```

We can check if the data preparation or simulation worked as expected using the `summary()`

function. The columns `z`

and `re`

indicate standardized and random effect covariates, respectively. The rest of the output is self-explanatory.

`summary(data)`

Alternative specific constants can be interpreted as covariates of type 2. Due to the dummy variable trap, we cannot estimate alternative specific constants for all the alternatives. Therefore, they are added for all except for the last alternative.↩︎

Note that alternative specific constants are excluded here.↩︎

`T`

can be either a positive number, representing a fixed number of choice occasions for each decision maker, or a vector of length`N`

, i.e. a decision maker specific number of choice occasions.↩︎For a covariate

`cov`

of type 1 or 3, you can either choose`"name" = cov`

(to draw the covariate for all alternatives from the same distribution) or`"name" = cov_alternative`

(to draw the covariate for a specific alternative from a specific distribution).↩︎Note that the

`cost`

covariate for all alternatives is drawn from the same distribution. Also note that since we did not specify a distribution for`travel_time_bus`

, this covariate is drawn from a standard normal distribution.↩︎