In this vignette we demonstrate different ways in which longitudinal cluster models can be internally validated.
library(latrend)
We explore the latrendData
dataset. This is a synthetic dataset for which the reference group of each trajectory is available, indicated by the Class
column. However, in this vignette we will assume that the true group specification and number of groups are unknown. Instead, we have a candidate model which we wish to validate internally.
data(latrendData)
head(latrendData)
#> Id Time Y Class
#> 1 1 0.0000000 -1.08049205 Class 1
#> 2 1 0.2222222 -0.68024151 Class 1
#> 3 1 0.4444444 -0.65148373 Class 1
#> 4 1 0.6666667 -0.39115398 Class 1
#> 5 1 0.8888889 -0.19407876 Class 1
#> 6 1 1.1111111 -0.02991783 Class 1
Specify the package options to adopt the respective column names of the loaded dataset, for convenience.
options(latrend.id = "Id", latrend.time = "Time")
We consider a KML model with 3 clusters to be our candidate model that we will validate. We defined the method with a reduced number of repeated random starts (as indicated by the nbRedrawing
argument) in order to reduce the computation time needed in the repeated evaluations below. This is only done for demonstration purposes.
<- lcMethodKML("Y", nClusters = 3, nbRedrawing = 5)
kml
kml#> lcMethodKML specifying "longitudinal k-means (KML)"
#> time: getOption("latrend.time")
#> id: getOption("latrend.id")
#> nClusters: 3
#> nbRedrawing: 5
#> maxIt: 200
#> imputationMethod:"copyMean"
#> distanceName: "euclidean"
#> power: 2
#> distance: function() {}
#> centerMethod: meanNA
#> startingCond: "nearlyAll"
#> nbCriterion: 1000
#> scale: TRUE
#> response: "Y"
The purpose of model validation is essentially to identify that the model is robust and generalizes well to unseen data. A necessary condition for these aspects is that the model is reproducible on the original training data. This evaluation helps to ensure that the model estimation procedure is robust, i.e., does not yield spurious model solutions.
We can fit a method repeatedly using the latrendRep
data.
<- latrendRep(kml, data = latrendData, .rep=10)
repModels print(repModels, excludeShared = FALSE)
#> List of 10 lcModels with
#> .name .method data seed time id nClusters nbRedrawing maxIt
#> 1 1 kml latrendData 1140350788 Time Id 3 5 200
#> 2 2 kml latrendData 312928385 Time Id 3 5 200
#> 3 3 kml latrendData 866248189 Time Id 3 5 200
#> 4 4 kml latrendData 1909893419 Time Id 3 5 200
#> 5 5 kml latrendData 554504146 Time Id 3 5 200
#> 6 6 kml latrendData 884616499 Time Id 3 5 200
#> 7 7 kml latrendData 803234389 Time Id 3 5 200
#> 8 8 kml latrendData 1158971242 Time Id 3 5 200
#> 9 9 kml latrendData 934673902 Time Id 3 5 200
#> 10 10 kml latrendData 1632225031 Time Id 3 5 200
#> imputationMethod distanceName power distance
#> 1 copyMean euclidean 2 function () {}
#> 2 copyMean euclidean 2 function () {}
#> 3 copyMean euclidean 2 function () {}
#> 4 copyMean euclidean 2 function () {}
#> 5 copyMean euclidean 2 function () {}
#> 6 copyMean euclidean 2 function () {}
#> 7 copyMean euclidean 2 function () {}
#> 8 copyMean euclidean 2 function () {}
#> 9 copyMean euclidean 2 function () {}
#> 10 copyMean euclidean 2 function () {}
#> centerMethod startingCond nbCriterion
#> 1 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 2 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 3 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 4 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 5 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 6 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 7 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 8 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 9 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> 10 function (x, ...) { mean(x, ..., na.rm = TRUE)} nearlyAll 1000
#> scale response
#> 1 TRUE Y
#> 2 TRUE Y
#> 3 TRUE Y
#> 4 TRUE Y
#> 5 TRUE Y
#> 6 TRUE Y
#> 7 TRUE Y
#> 8 TRUE Y
#> 9 TRUE Y
#> 10 TRUE Y
A convenient way to assess the stability across repeated runs is to compare the models on one or more internal model metrics. Similar solutions should yield similar metric scores.
<- metric(repModels, name = c("BIC", "WMAE", "APPA"))
repSelfMetrics #> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
head(repSelfMetrics)
#> BIC WMAE APPA
#> 1 655.5420 0.1925632 NA
#> 2 655.5420 0.1925632 NA
#> 3 655.5420 0.1925632 NA
#> 4 655.0419 0.1924808 NA
#> 5 655.5420 0.1925632 NA
#> 6 655.0419 0.1924808 NA
summary(repSelfMetrics[, "WMAE"])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.1925 0.1925 0.1925 0.1925 0.1926 0.1926
As can be seen from the numbers, the models are (practically) identical in terms of model fit, measurement error, and cluster separation.
Alternatively, we can select the model with the best fit, and compare it against the other fitted models.
<- min(repModels, "BIC")
bestRepModel externalMetric(repModels, bestRepModel, name = "adjustedRand")
#> [1] 1 1 1 1 1 1 1 1 1 1
As indicated by the adjusted Rand index, the methods are highly similar (a score of 1 indicates a perfect agreement). Note however that there are some discrepancies among the repeated runs on one or two trajectories.
Similarly, we can compute the pairwise adjusted Rand indices, resulting in a similarity matrix
<- externalMetric(repModels, name = "adjustedRand")
simMat round(simMat, 2)
#> 1 2 3 4 5 6 7 8 9
#> 2 1
#> 3 1 1
#> 4 1 1 1
#> 5 1 1 1 1
#> 6 1 1 1 1 1
#> 7 1 1 1 1 1 1
#> 8 1 1 1 1 1 1 1
#> 9 1 1 1 1 1 1 1 1
#> 10 1 1 1 1 1 1 1 1 1
summary(simMat)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 1 1 1 1 1
<- latrendBoot(kml, data = latrendData, samples = 10)
bootModels
bootModels#> List of 10 lcModels with
#> .name .method data seed
#> 1 1 kml bootSample(latrendData, "Id", 64231719L) 1126264256
#> 2 2 kml bootSample(latrendData, "Id", 893996438L) 62640092
#> 3 3 kml bootSample(latrendData, "Id", 1434113967L) 359523700
#> 4 4 kml bootSample(latrendData, "Id", 958577579L) 320237104
#> 5 5 kml bootSample(latrendData, "Id", 2079738042L) 439046689
#> 6 6 kml bootSample(latrendData, "Id", 2012583691L) 1515304167
#> 7 7 kml bootSample(latrendData, "Id", 520205446L) 1617984507
#> 8 8 kml bootSample(latrendData, "Id", 648143680L) 43470232
#> 9 9 kml bootSample(latrendData, "Id", 2127214623L) 446397691
#> 10 10 kml bootSample(latrendData, "Id", 882537923L) 50033203
<- metric(bootModels, name = c("BIC", "WMAE", "APPA"))
bootMetrics #> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
#> Warning in FUN(X[[i]], ...): Internal metric(s) "APPA" are not defined.
#> Returning NA.
bootMetrics#> BIC WMAE APPA
#> 1 386.5510 0.1874964 NA
#> 2 430.5063 0.1901289 NA
#> 3 374.7246 0.1859108 NA
#> 4 544.3614 0.1984359 NA
#> 5 480.6107 0.1986547 NA
#> 6 461.5246 0.1910348 NA
#> 7 410.5406 0.1845597 NA
#> 8 372.2148 0.1864856 NA
#> 9 466.8833 0.1950178 NA
#> 10 417.8994 0.1902601 NA
colMeans(bootMetrics)
#> BIC WMAE APPA
#> 434.5816773 0.1907985 NA
apply(bootMetrics, 2, sd)
#> BIC WMAE APPA
#> 54.356727198 0.005065835 NA
Lastly, we can fit models using \(k\)-fold cross-validation to validate the models on previously unseen data from the test folds.
<- latrendCV(kml, data = latrendData, folds = 10, seed = 1)
trainModels
trainModels#> List of 10 lcModels with
#> .name .method data seed
#> 1 1 kml trainFold(latrendData, fold = 1, "Id", 10, 1) 778402354
#> 2 2 kml trainFold(latrendData, fold = 2, "Id", 10, 1) 1162807070
#> 3 3 kml trainFold(latrendData, fold = 3, "Id", 10, 1) 117301597
#> 4 4 kml trainFold(latrendData, fold = 4, "Id", 10, 1) 1602292808
#> 5 5 kml trainFold(latrendData, fold = 5, "Id", 10, 1) 595415630
#> 6 6 kml trainFold(latrendData, fold = 6, "Id", 10, 1) 665002458
#> 7 7 kml trainFold(latrendData, fold = 7, "Id", 10, 1) 950483445
#> 8 8 kml trainFold(latrendData, fold = 8, "Id", 10, 1) 564460360
#> 9 9 kml trainFold(latrendData, fold = 9, "Id", 10, 1) 1404535239
#> 10 10 kml trainFold(latrendData, fold = 10, "Id", 10, 1) 779295076
Alternatively, we can generate the training data folds ourselves, and fit models using latrendBatch
.
<- createTrainDataFolds(latrendData, folds = 10)
dataFolds <- latrendBatch(kml, data = dataFolds)
foldModels
foldModels#> List of 10 lcModels with
#> .name .method data seed
#> 1 1 kml dataFolds[[1]] 26926021
#> 2 2 kml dataFolds[[2]] 1943230904
#> 3 3 kml dataFolds[[3]] 1181228795
#> 4 4 kml dataFolds[[4]] 1907309274
#> 5 5 kml dataFolds[[5]] 1135519048
#> 6 6 kml dataFolds[[6]] 1705419739
#> 7 7 kml dataFolds[[7]] 955514422
#> 8 8 kml dataFolds[[8]] 376276510
#> 9 9 kml dataFolds[[9]] 225308591
#> 10 10 kml dataFolds[[10]] 778273294
The list of test data folds is obtained using
<- createTestDataFolds(latrendData, dataFolds) testDataFolds