This is a step-by-step guide to explain the procedure proposed in the paper “Estimation of planned and unplanned missing individual scores in longitudinal designs using continuous-time state-space models” (https://doi.org/10.1037/met0000664). This guide aims to explain how to estimate a CT-SSM-LCS model and how to compute the RTS smoothed Kalman scores for data sets with missing data.
IMPORTANT: The R syntax and the structure of the data set should be accommodated to the objective of the particular study where this methodology is going to be applied.
In order to reproduce this example, you need to download this simulated data set: https://github.com/jamartinezhuertas/RTS-KS/blob/main/data.rds.
library(OpenMx)
library(dplyr)
First, we need to prepare invariant part of the OpenMx SSM models. The model includes seven parameters (see the main manuscript for an explanation of their interpretation. See Hunter, 2018, for an explanation of the model matrices):
A
with name b_y
.x0
with name y0mn
.x0
with name yAmn
.P0
with name y0v
.P0
with name yAv
.P0
with name y0Acv
.R
with name MerY
.opmxL <- list()
# Matrix A. Latent dynamics
opmxL$amat_ct <- mxMatrix(name = "A", "Full", 2, 2, free = c(TRUE,FALSE,FALSE,FALSE),
values = c(-.2,0,1,0),
dimnames = list( c("y0", "yA"), c("y0", "yA") ),
labels = c("b_y", NA,NA,NA),
lbound = c(-1, NA,NA,NA),
ubound = c(0, NA,NA,NA))
# Matrix B. Effects of time-varying covariates on latent variables (null in our model)
opmxL$bmat <- mxMatrix(name = "B", "Zero", 2, 1)
# Matrix C. Measurement model linking latent and observed variables
opmxL$cmat <- mxMatrix(name = "C", "Full", 1, 2, free = FALSE,
values = c(1,0),
dimnames = list( c("y"), c("y0", "yA") ),
labels = c(NA, NA) )
# Matrix D. Effects of time-varying covariates on observed variables (null in our model)
opmxL$dmat <- mxMatrix("Zero", 1, 1, name = "D")
# Matrix Q. Covariance matrix for the dynamic error (i.e., latent innovations)
opmxL$qmat <- mxMatrix("Zero", 2, 2, name = "Q")
# Matrix R. Covariance matrix for the measurement error
opmxL$rmat <- mxMatrix("Diag", 1, 1, TRUE, 2,
name = "R", labels = "MerY")
# Matrix x0. Means of the initial latent levels
opmxL$xmat <- mxMatrix(name = "x0", "Full", 2, 1, free = TRUE,
values = c(12, 7),
labels = c("y0mn", "yAmn"))
# Matrix P0. Covariance of the initial latent levels
opmxL$pmat <- mxMatrix(name = "P0", "Symm", 2, 2, TRUE,
values = c(25, 3, .4),
labels = c("y0v", "y0Acv", "yAv"),
lbound = c(0, NA, 0))
# Matrix u. Values of the time-varying covariates (null in our model)
opmxL$umat <- mxMatrix("Zero", 1, 1, name = "u")
# Matrix t. Variable in the data set with the times at which measurement happened.
opmxL$tmat <- mxMatrix('Full', 1, 1, name='time', labels='data.age')
Once the SSM matrices have been defined, we store them in the object
opmxL$modL_ct
. Then we create the expectation
continuous-time SSM object, specifying the names of the matrices in
opmxL$expODE
.
opmxL$modL_ct <- with(opmxL, list(amat_ct, bmat, cmat, dmat,
qmat, rmat, xmat, pmat,
umat, tmat))
opmxL$expODE <- mxExpectationStateSpaceContinuousTime(A = "A", B = "B",
C = "C", D = "D",
Q = "Q", R = "R",
x0 = "x0", P0 = "P0",
u = "u", t = "time")
So far, we have defined a n=1 model. The following function will be used later for creating creating this model for each individual in the sample.
genMxIndModels_ct <- function(x, dwork, modNames_ct) {
DataSetForSubjectK <- dwork[[x]]
ctModel <- opmxL$modL_ct
indivmodels <- mxModel(name = modNames_ct[x],
ctModel,
opmxL$expODE,
mxFitFunctionML(),
mxData(DataSetForSubjectK, type ='raw') ) }
Now we load the data set we want to analyze. In this example, we use a simulated data set with 200 participants in an accelerated longitudinal designs (ALD) with no unplanned data missingness (it can be downloaded at the beginning of this document) :
data <- readRDS("data.rds")
The data
object is a list; each element of the list
contains the one data frame for each participant. Lets explore the
contents of the data frames of two different individuals (individual #1
of the first cohort of the study, and individual #100 of the second
cohort).
Three characteristics of these example data are worth noting:
The simulated accelerated longitudinal design has three observations per individual, are measured in three alternate years. See the main text for a full explanation of this design.
The information of the individual measurements is: (1) the time
point at which the measurement was taken (age
, with
age=0 for five years), and (2) the observed score of the
variable (y
).
In addition to the rows with observed measures, we included
several rows with additional time points without observed information
(i.e., missing observations in variable y
). We will impute
the latent true levels at these specific time points t. Note
that the rows need to be ordered by the time variable.
Consider individual #1. This person was measured at ages 5.6730,
7.4038 and 9.1923. After centering age at t=0, the times are
rescaled to 0.6730, 2.4038 and 4.1923. The observed scores are: 4.1926,
6.0996 and 10.6837, respectively. If we had the chance to follow this
individual case during the complete trajectory, we would have also
observed this participant at other ages, e.g.: 6.9038, 8.5577, 10.0577…
These are the time points at which we will impute the missing data for
y
using Kalman scores. See the main text for a full
explanation of the simulated cohort-sequential design.
data[[1]]
Now, we estimate the CT-SSM-LCS model.
# Create individual models
cases <- seq(1,length(data))
modNames_ct <- paste0("i", cases, "ODE")
indivmodels_ct <- lapply(cases, genMxIndModels_ct, data, modNames_ct)
# Create multiple-subject model
multiSubjODE <- mxModel(name = "MultiODE", indivmodels_ct,
mxFitFunctionMultigroup(modNames_ct))
# Fit the model
multiSubjODERun <- mxRun(multiSubjODE)
# Examine group-level results
multiSubjODEsumm <- summary(multiSubjODERun)
multiSubjODEsumm$parameters
# Obtain individual-level Kalman Scores
ks <- lapply(X = multiSubjODERun@submodels,
FUN = mxKalmanScores, frontend=FALSE )
In this example, the names of the individual models have the format
ixODE
, where x
represents the ID of the
participant. For example, the results for individuals 1 and 2 are
labeled as i1ODE
and i2ODE
, respectively. More
info can be found in the documentation of the functions
mxKalmanScores
and
mxExpectationStateSpaceContinuousTime
, and in Hunter
(2018). The function mxKalmanScores
return, for each
individual, a list with several objects. Of interest here are
Rauch-Tung-Striebel (RTS) smoothed latent scores which are the estimated
Kalman scores, stored in xSmoothed
, and their corresponding
error covariance matrices PSmoothed
, which provide a
measure of the estimation uncertainty.
str(ks$i1ODE)
## List of 10
## $ xPredicted: num [1:16, 1:2] 9.12 10.3 9.03 6.87 6.1 ...
## $ PPredicted: num [1:2, 1:2, 1:16] 24.66 3.98 3.98 1.36 23.78 ...
## $ xUpdated : num [1:16, 1:2] 9.12 1.79 4.56 4.3 4.08 ...
## $ PUpdated : num [1:2, 1:2, 1:16] 24.66 3.98 3.98 1.36 2.03 ...
## $ xSmoothed : num [1:16, 1:2] -2.65 -1.71 1.99 2.43 2.47 ...
## $ PSmoothed : num [1:2, 1:2, 1:16] 3.325 0.536 0.536 0.809 1.567 ...
## $ m2ll : num [1:16] 0 5.11 0 4.76 0 ...
## $ L : num [1:16] 1 0.0776 1 0.0926 1 ...
## $ yPredicted: num [1:16, 1] NA 10.3 9.03 6.87 6.1 ...
## $ SPredicted: num [1, 1, 1:16] NA 26 5 3.95 3.68 ...
Now we illustrate how to compute the RTS smoothed Kalman scores for
the first individual in the data set. To do so, we need to specify the
index or name of the individual in the following function. In
model
we provide the estimated model for the individual of
interest. In data
we provide the dataset with the observed
scores and the additional rows with the time points at which the latent
variables should be estimated. Note that our model includes two latent
variables: time-varying the level (first column) and the time-invariant
additive component (second column). Therefore, estimates of the
unobserved ability level are stored in the first column.
id <- 1
(cKsId <- mxKalmanScores(model = multiSubjODERun@submodels[[id]], data = data[[id]])$xSmoothed)
## Computing Kalman scores in frontend R. This may take a few seconds.
## [,1] [,2]
## [1,] 8.055989 7.308151
## [2,] 9.210740 7.294178
## [3,] 15.209991 6.973313
## [4,] 17.203232 6.740263
## [5,] 19.597583 6.760903
## [6,] 21.080069 6.780692
## [7,] 22.157716 6.780692
## [8,] 23.432075 6.780692
## [9,] 24.500958 6.780692
## [10,] 25.242781 6.780692
## [11,] 25.877797 6.780692
## [12,] 25.951360 6.780692
## [13,] 26.381499 6.780692
## [14,] 26.779519 6.780692
## [15,] 26.850196 6.780692
## [16,] 27.145425 6.780692
Now we combine the Kalman scores with the original data set for the
individual, where age
is the age at the different time
points, y_observed
stores the observed values (in this
case, we have only three observations per individual), and
y_KS
stores the predicted RTS smoothed Kalman score for
each time point.
data_ind <- as.data.frame(cbind(data[[id]]$age, data[[id]]$y, cKsId[-1,1])) # The first row of cKsId must be removed
colnames(data_ind) <- c("age", "y_observed", "y_KS")
(data_ind)
The function mxKalmanScores
also provides the error
associated to each Kalman score in the object PSmoothed
.
These errors can be used to compute the confidence interval for each
time point.
alpha <- .05 # (1 - confidence level)
qZ <- qnorm(p = 1-(alpha/2))
CI <- matrix(data = NA, ncol = 2, nrow = nrow(data_ind))
errors <- sqrt(mxKalmanScores(model = multiSubjODERun@submodels[[id]],
data = data[[id]])$PSmoothed[1,1,2:(nrow(data_ind)+1)])
CI[,1] <- data_ind$y_KS + qZ*errors
CI[,2] <- data_ind$y_KS - qZ*errors
CI <- as.data.frame(CI)
CI <- cbind(CI, data_ind$age)
colnames(CI) <- c("Upp", "Low", "age")
(CI)
First we combine the rows of all the individuals into a single data frame.
data_long <- as.data.frame(do.call(rbind, data))
data_long <- data_long[!is.na(data_long$y),]
Now we present a plot that includes, for the individual of interest:
library(ggplot2)
ggplot() +
# Group trajectories
geom_line(data = data_long, aes(x=age+5, y=y, group=id), colour = "dark grey", alpha=0.2)+
# CI for KS
geom_line(data=CI, aes(x=data_ind$age+5, y=Upp), size = 0.2, color='blue') +
geom_line(data=CI, aes(x=data_ind$age+5, y=Low), size = 0.2, color='blue') +
geom_ribbon(aes(x=CI$age+5, ymax= CI$Upp, ymin= CI$Low),
fill="blue", alpha=.15) +
# KS estimated latent scores
geom_line(data=data_ind, aes(x=age+5, y=y_KS), size = .8, color='blue', linetype = "dashed") +
## Observed scores
geom_point(data=data_ind, aes(x=age+5, y=y_observed), size = 2) +
geom_line(data=data_ind[!is.na(data_ind$y_observed),], aes(x=age+5, y=y_observed, group=1)) +
scale_x_continuous(breaks=seq(5, 20, 1)) +
ylim(0,45) + ylab("y scores") + xlab("age") +
theme_classic(base_size = 12, base_line_size = 0.5)