Validating cluster models

Niek Den Teuling

2020-11-06

In this vignette we demonstrate different ways in which longitudinal cluster models can be internally validated.

library(latrend)

Demonstration

We explore the latrendData dataset. This is a synthetic dataset for which the reference group of each trajectory is available, indicated by the Class column. However, in this vignette we will assume that the true group specification and number of groups are unknown. Instead, we have a candidate model which we wish to validate internally.

data(latrendData)
head(latrendData)
#>   Id      Time           Y   Class
#> 1  1 0.0000000 -1.08049205 Class 1
#> 2  1 0.2222222 -0.68024151 Class 1
#> 3  1 0.4444444 -0.65148373 Class 1
#> 4  1 0.6666667 -0.39115398 Class 1
#> 5  1 0.8888889 -0.19407876 Class 1
#> 6  1 1.1111111 -0.02991783 Class 1

Specify the package options to adopt the respective column names of the loaded dataset, for convenience.

options(latrend.id = "Id", latrend.time = "Time")

The candidate model

We consider a KML model with 3 clusters to be our candidate model that we will validate. We defined the method with a reduced number of repeated random starts (as indicated by the nbRedrawing argument) in order to reduce the computation time needed in the repeated evaluations below. This is only done for demonstration purposes.

kml <- lcMethodKML("Y", nClusters = 3, nbRedrawing = 5)
kml
#> lcMethodKML specifying "longitudinal k-means (KML)"
#>  time:           getOption("latrend.time")
#>  id:             getOption("latrend.id")
#>  nClusters:      3
#>  nbRedrawing:    5
#>  maxIt:          200
#>  imputationMethod:"copyMean"
#>  distanceName:   "euclidean"
#>  power:          2
#>  distance:       function() {}
#>  centerMethod:   meanNA
#>  startingCond:   "nearlyAll"
#>  nbCriterion:    1000
#>  scale:          TRUE
#>  response:       "Y"

Evaluate stability of the estimation through repeated estimation

The purpose of model validation is essentially to identify that the model is robust and generalizes well to unseen data. A necessary condition for these aspects is that the model is reproducible on the original training data. This evaluation helps to ensure that the model estimation procedure is robust, i.e., does not yield spurious model solutions.

We can fit a method repeatedly using the latrendRep data.

repModels <- latrendRep(kml, data = latrendData, .rep=10)
print(repModels, excludeShared = FALSE)
#> List of 10 lcModels with
#>    .name .method        data       seed time id nClusters nbRedrawing maxIt
#> 1      1     kml latrendData 1140350788 Time Id         3           5   200
#> 2      2     kml latrendData  312928385 Time Id         3           5   200
#> 3      3     kml latrendData  866248189 Time Id         3           5   200
#> 4      4     kml latrendData 1909893419 Time Id         3           5   200
#> 5      5     kml latrendData  554504146 Time Id         3           5   200
#> 6      6     kml latrendData  884616499 Time Id         3           5   200
#> 7      7     kml latrendData  803234389 Time Id         3           5   200
#> 8      8     kml latrendData 1158971242 Time Id         3           5   200
#> 9      9     kml latrendData  934673902 Time Id         3           5   200
#> 10    10     kml latrendData 1632225031 Time Id         3           5   200
#>    imputationMethod distanceName power       distance
#> 1          copyMean    euclidean     2 function () {}
#> 2          copyMean    euclidean     2 function () {}
#> 3          copyMean    euclidean     2 function () {}
#> 4          copyMean    euclidean     2 function () {}
#> 5          copyMean    euclidean     2 function () {}
#> 6          copyMean    euclidean     2 function () {}
#> 7          copyMean    euclidean     2 function () {}
#> 8          copyMean    euclidean     2 function () {}
#> 9          copyMean    euclidean     2 function () {}
#> 10         copyMean    euclidean     2 function () {}
#>                                          centerMethod startingCond nbCriterion
#> 1  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 2  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 3  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 4  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 5  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 6  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 7  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 8  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 9  function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#> 10 function (x, ...) {    mean(x, ..., na.rm = TRUE)}    nearlyAll        1000
#>    scale response
#> 1   TRUE        Y
#> 2   TRUE        Y
#> 3   TRUE        Y
#> 4   TRUE        Y
#> 5   TRUE        Y
#> 6   TRUE        Y
#> 7   TRUE        Y
#> 8   TRUE        Y
#> 9   TRUE        Y
#> 10  TRUE        Y

A convenient way to assess the stability across repeated runs is to compare the models on one or more internal model metrics. Similar solutions should yield similar metric scores.

repSelfMetrics <- metric(repModels, name = c("BIC", "WMAE", "APPA"))
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
head(repSelfMetrics)
#>        BIC      WMAE APPA
#> 1 655.5420 0.1925632   NA
#> 2 655.5420 0.1925632   NA
#> 3 655.5420 0.1925632   NA
#> 4 655.0419 0.1924808   NA
#> 5 655.5420 0.1925632   NA
#> 6 655.0419 0.1924808   NA
summary(repSelfMetrics[, "WMAE"])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.1925  0.1925  0.1925  0.1925  0.1926  0.1926

As can be seen from the numbers, the models are (practically) identical in terms of model fit, measurement error, and cluster separation.

Comparison between fits

Alternatively, we can select the model with the best fit, and compare it against the other fitted models.

bestRepModel <- min(repModels, "BIC")
externalMetric(repModels, bestRepModel, name = "adjustedRand")
#>  [1] 1 1 1 1 1 1 1 1 1 1

As indicated by the adjusted Rand index, the methods are highly similar (a score of 1 indicates a perfect agreement). Note however that there are some discrepancies among the repeated runs on one or two trajectories.

Similarly, we can compute the pairwise adjusted Rand indices, resulting in a similarity matrix

simMat <- externalMetric(repModels, name = "adjustedRand")
round(simMat, 2)
#>    1 2 3 4 5 6 7 8 9
#> 2  1                
#> 3  1 1              
#> 4  1 1 1            
#> 5  1 1 1 1          
#> 6  1 1 1 1 1        
#> 7  1 1 1 1 1 1      
#> 8  1 1 1 1 1 1 1    
#> 9  1 1 1 1 1 1 1 1  
#> 10 1 1 1 1 1 1 1 1 1
summary(simMat)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       1       1       1       1       1       1

Evaluating replicability and stability through bootstrapping

bootModels <- latrendBoot(kml, data = latrendData, samples = 10)
bootModels
#> List of 10 lcModels with
#>    .name .method                                       data       seed
#> 1      1     kml   bootSample(latrendData, "Id", 64231719L) 1126264256
#> 2      2     kml  bootSample(latrendData, "Id", 893996438L)   62640092
#> 3      3     kml bootSample(latrendData, "Id", 1434113967L)  359523700
#> 4      4     kml  bootSample(latrendData, "Id", 958577579L)  320237104
#> 5      5     kml bootSample(latrendData, "Id", 2079738042L)  439046689
#> 6      6     kml bootSample(latrendData, "Id", 2012583691L) 1515304167
#> 7      7     kml  bootSample(latrendData, "Id", 520205446L) 1617984507
#> 8      8     kml  bootSample(latrendData, "Id", 648143680L)   43470232
#> 9      9     kml bootSample(latrendData, "Id", 2127214623L)  446397691
#> 10    10     kml  bootSample(latrendData, "Id", 882537923L)   50033203
bootMetrics <- metric(bootModels, name = c("BIC", "WMAE", "APPA"))
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.

#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
bootMetrics
#>         BIC      WMAE APPA
#> 1  386.5510 0.1874964   NA
#> 2  430.5063 0.1901289   NA
#> 3  374.7246 0.1859108   NA
#> 4  544.3614 0.1984359   NA
#> 5  480.6107 0.1986547   NA
#> 6  461.5246 0.1910348   NA
#> 7  410.5406 0.1845597   NA
#> 8  372.2148 0.1864856   NA
#> 9  466.8833 0.1950178   NA
#> 10 417.8994 0.1902601   NA
colMeans(bootMetrics)
#>         BIC        WMAE        APPA 
#> 434.5816773   0.1907985          NA
apply(bootMetrics, 2, sd)
#>          BIC         WMAE         APPA 
#> 54.356727198  0.005065835           NA

Ten-fold cross validation

Lastly, we can fit models using \(k\)-fold cross-validation to validate the models on previously unseen data from the test folds.

trainModels <- latrendCV(kml, data = latrendData, folds = 10, seed = 1)
trainModels
#> List of 10 lcModels with
#>    .name .method                                           data       seed
#> 1      1     kml  trainFold(latrendData, fold = 1, "Id", 10, 1)  778402354
#> 2      2     kml  trainFold(latrendData, fold = 2, "Id", 10, 1) 1162807070
#> 3      3     kml  trainFold(latrendData, fold = 3, "Id", 10, 1)  117301597
#> 4      4     kml  trainFold(latrendData, fold = 4, "Id", 10, 1) 1602292808
#> 5      5     kml  trainFold(latrendData, fold = 5, "Id", 10, 1)  595415630
#> 6      6     kml  trainFold(latrendData, fold = 6, "Id", 10, 1)  665002458
#> 7      7     kml  trainFold(latrendData, fold = 7, "Id", 10, 1)  950483445
#> 8      8     kml  trainFold(latrendData, fold = 8, "Id", 10, 1)  564460360
#> 9      9     kml  trainFold(latrendData, fold = 9, "Id", 10, 1) 1404535239
#> 10    10     kml trainFold(latrendData, fold = 10, "Id", 10, 1)  779295076

Manual cross validation

Alternatively, we can generate the training data folds ourselves, and fit models using latrendBatch.

dataFolds <- createTrainDataFolds(latrendData, folds = 10)
foldModels <- latrendBatch(kml, data = dataFolds)
foldModels
#> List of 10 lcModels with
#>    .name .method            data       seed
#> 1      1     kml  dataFolds[[1]]   26926021
#> 2      2     kml  dataFolds[[2]] 1943230904
#> 3      3     kml  dataFolds[[3]] 1181228795
#> 4      4     kml  dataFolds[[4]] 1907309274
#> 5      5     kml  dataFolds[[5]] 1135519048
#> 6      6     kml  dataFolds[[6]] 1705419739
#> 7      7     kml  dataFolds[[7]]  955514422
#> 8      8     kml  dataFolds[[8]]  376276510
#> 9      9     kml  dataFolds[[9]]  225308591
#> 10    10     kml dataFolds[[10]]  778273294

The list of test data folds is obtained using

testDataFolds <- createTestDataFolds(latrendData, dataFolds)