# Pre-normalizing a dependent variable using lcmm

Functions lcmm, multlcmm and jointlcmm handle dependent variables that are not necessarily Gaussian. These functions rely on the simultaneous normalization of the variable and estimation of the regression parameters using parameterized link functions (argument “link=”).

However in some cases, one may want to pre-normalize once for all a dependent variable so that standard methods for Gaussian outcomes can then be used without caution.

The methodology has been fully described and validated for MMSE in Philipps et al. (2014) (see https://doi.org/10.1159/000365637 )

We describe here how this can be done using CES-D example.

# CES-D example

CES-D is the scale of depressive symptomatology in the Paquid dataset made of 20 items. Its sumscore is extremely skewed with a large proportion of small values:

summary(paquid$CESD) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 2.000 6.000 8.488 12.000 52.000 146 hist(paquid$CESD, breaks=50)

# Normalizing a variable with lcmm

The first step is to normalize the variable by estimating a latent process mixed model. This model should roughly fit the data but does not need to be the perfect model or the exact same model as planned for the future complete analysis. One possibility is to define an “empty model” for the covariates (but not the time functions and random effects) :

#We recenter and scale the time variable "age" in order to avoid numerical problems
paquid$age65 <- (paquid$age-65)/10
mpreH <- lcmm(CESD ~ age65 + I(age65^2), random = ~ age65 + I(age65^2), subject = 'ID', data=paquid, link = '5-quant-splines')

Here a splines link function with 5 knots placed at the quantiles is used.

The variable “obs” of output table “mpreH$pred” includes the normalized values of CES-D for all the observations of the dataset: head(mpreH$pred)
ID     pred_m   resid_m   pred_ss   resid_ss      obs    pred_m1  pred_ss1
1  1 0.11087852 1.3084186 0.9342122  0.4850849 1.419297 0.11087852 0.9342122
2  2 0.06084686 1.2129375 1.4361563 -0.1623720 1.273784 0.06084686 1.4361563
3  2 0.13130684 1.8192143 1.5798646  0.3706565 1.950521 0.13130684 1.5798646
4  2 0.31339540 1.9993106 1.8557188  0.4569872 2.312706 0.31339540 1.8557188
5  2 0.82900686 1.9299128 2.2349562  0.5239635 2.758920 0.82900686 2.2349562
7  3 0.26327546 0.8557633 0.5912831  0.5277557 1.119039 0.26327546 0.5912831

The normalized variable (to be called for instance “normCESD”) can now be added to the dataset

paquid$normCESD <- NULL paquid$normCESD[!is.na(paquid$CESD)] <- mpreH$pred$obs for further analysis. summary(paquid[,c("CESD","normCESD")]) CESD normCESD Min. : 0.000 Min. :-1.8700 1st Qu.: 2.000 1st Qu.:-0.3916 Median : 6.000 Median : 0.5733 Mean : 8.488 Mean : 0.6016 3rd Qu.:12.000 3rd Qu.: 1.5582 Max. :52.000 Max. : 6.1425 NA's :146 NA's :146 # Comparison before and after normalization The transformation does not change the structure of the data. In particular, the spike at 0 is still present. par(mfrow=c(1,2)) hist(paquid$CESD, breaks=50, cex.main=0.9, main="Distribution of CESD")
hist(paquid\$normCESD, breaks=50, cex.main=0.9, main="Distribution of normCESD")

From the histogram, this is not clear that the normalized CESD has a Gaussian distribution. Yet, this normalization makes the use of methods for Gaussian outcomes correct.

For instance, when fitting a linear mixed model including the variable male, the subject-specific residuals plots become correct (right part):

normCESD <- hlme(normCESD ~ age65*male, random = ~ age65, subject = 'ID', data=paquid)
plot(normCESD, cex.main=0.8)

In comparison, without the normalization step, the subject-specific residuals exhibited a departure from normality.

CESD <- hlme(CESD ~ age65*male, random = ~ age65, subject = 'ID', data=paquid)
plot(CESD, cex.main=0.8)

# To go further

For future use, it can be interesting to define the metric of normCESD. Indeed, for now, its scale in not easy to understand as it depends on the data structure. Two options are possible:

## 1. Standardizing normCESD

The variable can be standardized (like for a Z-score) by removing the mean at a time and dividing by the standard deviation at the same time. This can be done if many data are observed at the same time, like at baseline. Here, with age as the time scale, we could not use that easily.

Unfortunately, baseline data is not available in the dataset! So here is a theoretical example of the computation:

## 2. Rescaling normCESD into 0 - 100

The variable can be scaled in 0-100 with 0 corresponding to the minimum value observed in the sample (usually 0) and 100 the maximum observed value. This works whatever the timescale under study:

## Example of model with normCESD100

The statistical analysis can now be performed using one of the normalized variables, normCESD, ZnormCESD or normCESD100.

With normCESD100 for example, a linear mixed model with a linear trajectory according to age with adjustment for male, education and their interaction with time as well as the birth cohort effect (age at entry) can be fitted:

Or a linear mixed model with a linear trajectory according to time since entry with adjustment for male, education and their interaction with time as well as the birth cohort effect (age at entry):

Or any other statistical method assuming normality for the outcome!