Step-by-step guide in R: Estimation of planned and unplanned missing individual scores in longitudinal designs using continuous-time state-space models

This is a step-by-step guide to explain the procedure proposed in the paper “Estimation of planned and unplanned missing individual scores in longitudinal designs using continuous-time state-space models” (https://doi.org/10.1037/met0000664). This guide aims to explain how to estimate a CT-SSM-LCS model and how to compute the RTS smoothed Kalman scores for data sets with missing data.

IMPORTANT: The R syntax and the structure of the data set should be accommodated to the objective of the particular study where this methodology is going to be applied.

0. Download the simulated data set

In order to reproduce this example, you need to download this simulated data set: https://github.com/jamartinezhuertas/RTS-KS/blob/main/data.rds.

1. Load the required R packages

library(OpenMx)
library(dplyr)

2. Prepare the matrices for the CT-LCS-SSM

First, we need to prepare invariant part of the OpenMx SSM models. The model includes seven parameters (see the main manuscript for an explanation of their interpretation. See Hunter, 2018, for an explanation of the model matrices):

Self-feedback $\beta$, included in matrix A with name b_y.
Mean of the initial latent level $μ_{y0}$, included in matrix x0 with name y0mn.
Mean of the latent additive component $μ_{ya}$, included in matrix x0 with name yAmn.
Variance of the initial level $σ^2_{y0}$, included in matrix P0 with name y0v.
Variance of the additive component $σ^2_{ya}$, included in matrix P0 with name yAv.
Covariance between initial level and additive component $σ_{0,a}$, included in matrix P0 with name y0Acv.
Measurement error variance $σ^2_e$, included in matrix R with name MerY.

opmxL <- list()

# Matrix A. Latent dynamics
opmxL$amat_ct <- mxMatrix(name = "A", "Full", 2, 2, free = c(TRUE,FALSE,FALSE,FALSE),
                          values = c(-.2,0,1,0),
                          dimnames = list( c("y0", "yA"), c("y0", "yA") ),
                          labels = c("b_y", NA,NA,NA),
                          lbound = c(-1, NA,NA,NA),
                          ubound = c(0, NA,NA,NA))

# Matrix B. Effects of time-varying covariates on latent variables (null in our model)
opmxL$bmat <- mxMatrix(name = "B", "Zero", 2, 1)

# Matrix C. Measurement model linking latent and observed variables
opmxL$cmat <- mxMatrix(name = "C", "Full", 1, 2, free = FALSE,
                       values = c(1,0), 
                       dimnames = list( c("y"), c("y0", "yA") ),
                       labels = c(NA, NA)  )

# Matrix D. Effects of time-varying covariates on observed variables (null in our model)
opmxL$dmat <- mxMatrix("Zero", 1, 1, name = "D")

# Matrix Q. Covariance matrix for the dynamic error (i.e., latent innovations)
opmxL$qmat <- mxMatrix("Zero", 2, 2, name = "Q")

# Matrix R. Covariance matrix for the measurement error
opmxL$rmat <- mxMatrix("Diag", 1, 1, TRUE, 2,
                       name = "R", labels = "MerY")

# Matrix x0. Means of the initial latent levels
opmxL$xmat <- mxMatrix(name = "x0", "Full", 2, 1, free = TRUE,
                       values = c(12, 7),
                       labels = c("y0mn", "yAmn"))

# Matrix P0. Covariance of the initial latent levels
opmxL$pmat <- mxMatrix(name = "P0", "Symm", 2, 2, TRUE,
                       values = c(25, 3, .4),
                       labels = c("y0v", "y0Acv", "yAv"),
                       lbound = c(0, NA, 0))

# Matrix u. Values of the time-varying covariates (null in our model)
opmxL$umat <- mxMatrix("Zero", 1, 1, name = "u")

# Matrix t. Variable in the data set with the times at which measurement happened.
opmxL$tmat <- mxMatrix('Full', 1, 1, name='time', labels='data.age')

Once the SSM matrices have been defined, we store them in the object opmxL$modL_ct. Then we create the expectation continuous-time SSM object, specifying the names of the matrices in opmxL$expODE.

opmxL$modL_ct <- with(opmxL, list(amat_ct, bmat, cmat, dmat,
                                  qmat, rmat, xmat, pmat,
                                  umat, tmat))

opmxL$expODE <- mxExpectationStateSpaceContinuousTime(A = "A", B = "B",
                                                      C = "C", D = "D",
                                                      Q = "Q", R = "R",
                                                      x0 = "x0", P0 = "P0",
                                                      u = "u", t = "time")

So far, we have defined a n=1 model. The following function will be used later for creating creating this model for each individual in the sample.

genMxIndModels_ct <- function(x, dwork, modNames_ct) {
  DataSetForSubjectK <- dwork[[x]]
  ctModel <- opmxL$modL_ct
  indivmodels <- mxModel(name = modNames_ct[x],
                         ctModel,
                         opmxL$expODE,
                         mxFitFunctionML(),
                         mxData(DataSetForSubjectK, type ='raw')  )  }

3. Read your data set (and tips for imputing missing data points)

Now we load the data set we want to analyze. In this example, we use a simulated data set with 200 participants in an accelerated longitudinal designs (ALD) with no unplanned data missingness (it can be downloaded at the beginning of this document) :

data <- readRDS("data.rds")

The data object is a list; each element of the list contains the one data frame for each participant. Lets explore the contents of the data frames of two different individuals (individual #1 of the first cohort of the study, and individual #100 of the second cohort).

Three characteristics of these example data are worth noting:

The simulated accelerated longitudinal design has three observations per individual, are measured in three alternate years. See the main text for a full explanation of this design.
The information of the individual measurements is: (1) the time point at which the measurement was taken (age, with age=0 for five years), and (2) the observed score of the variable (y).
In addition to the rows with observed measures, we included several rows with additional time points without observed information (i.e., missing observations in variable y). We will impute the latent true levels at these specific time points t. Note that the rows need to be ordered by the time variable.

Consider individual #1. This person was measured at ages 5.6730, 7.4038 and 9.1923. After centering age at t=0, the times are rescaled to 0.6730, 2.4038 and 4.1923. The observed scores are: 4.1926, 6.0996 and 10.6837, respectively. If we had the chance to follow this individual case during the complete trajectory, we would have also observed this participant at other ages, e.g.: 6.9038, 8.5577, 10.0577… These are the time points at which we will impute the missing data for y using Kalman scores. See the main text for a full explanation of the simulated cohort-sequential design.

data[[1]]

4. Estimate the CT-SSM-LCS model

Now, we estimate the CT-SSM-LCS model.

# Create individual models
cases <- seq(1,length(data))
modNames_ct <- paste0("i", cases, "ODE") 
indivmodels_ct <- lapply(cases, genMxIndModels_ct, data, modNames_ct)
# Create multiple-subject model
multiSubjODE <- mxModel(name = "MultiODE", indivmodels_ct,
                        mxFitFunctionMultigroup(modNames_ct))
# Fit the model
multiSubjODERun <- mxRun(multiSubjODE)
# Examine group-level results
multiSubjODEsumm <- summary(multiSubjODERun)
multiSubjODEsumm$parameters

# Obtain individual-level Kalman Scores
ks <- lapply(X = multiSubjODERun@submodels,
             FUN = mxKalmanScores, frontend=FALSE )

In this example, the names of the individual models have the format ixODE, where x represents the ID of the participant. For example, the results for individuals 1 and 2 are labeled as i1ODE and i2ODE, respectively. More info can be found in the documentation of the functions mxKalmanScores and mxExpectationStateSpaceContinuousTime, and in Hunter (2018). The function mxKalmanScores return, for each individual, a list with several objects. Of interest here are Rauch-Tung-Striebel (RTS) smoothed latent scores which are the estimated Kalman scores, stored in xSmoothed, and their corresponding error covariance matrices PSmoothed, which provide a measure of the estimation uncertainty.

str(ks$i1ODE)

## List of 10
##  $ xPredicted: num [1:16, 1:2] 9.12 10.3 9.03 6.87 6.1 ...
##  $ PPredicted: num [1:2, 1:2, 1:16] 24.66 3.98 3.98 1.36 23.78 ...
##  $ xUpdated  : num [1:16, 1:2] 9.12 1.79 4.56 4.3 4.08 ...
##  $ PUpdated  : num [1:2, 1:2, 1:16] 24.66 3.98 3.98 1.36 2.03 ...
##  $ xSmoothed : num [1:16, 1:2] -2.65 -1.71 1.99 2.43 2.47 ...
##  $ PSmoothed : num [1:2, 1:2, 1:16] 3.325 0.536 0.536 0.809 1.567 ...
##  $ m2ll      : num [1:16] 0 5.11 0 4.76 0 ...
##  $ L         : num [1:16] 1 0.0776 1 0.0926 1 ...
##  $ yPredicted: num [1:16, 1] NA 10.3 9.03 6.87 6.1 ...
##  $ SPredicted: num [1, 1, 1:16] NA 26 5 3.95 3.68 ...

5. Compute the RTS smoothed Kalman scores

Now we illustrate how to compute the RTS smoothed Kalman scores for the first individual in the data set. To do so, we need to specify the index or name of the individual in the following function. In model we provide the estimated model for the individual of interest. In data we provide the dataset with the observed scores and the additional rows with the time points at which the latent variables should be estimated. Note that our model includes two latent variables: time-varying the level (first column) and the time-invariant additive component (second column). Therefore, estimates of the unobserved ability level are stored in the first column.

id <- 1
(cKsId <- mxKalmanScores(model = multiSubjODERun@submodels[[id]], data = data[[id]])$xSmoothed)

## Computing Kalman scores in frontend R.  This may take a few seconds.

##            [,1]     [,2]
##  [1,]  8.055989 7.308151
##  [2,]  9.210740 7.294178
##  [3,] 15.209991 6.973313
##  [4,] 17.203232 6.740263
##  [5,] 19.597583 6.760903
##  [6,] 21.080069 6.780692
##  [7,] 22.157716 6.780692
##  [8,] 23.432075 6.780692
##  [9,] 24.500958 6.780692
## [10,] 25.242781 6.780692
## [11,] 25.877797 6.780692
## [12,] 25.951360 6.780692
## [13,] 26.381499 6.780692
## [14,] 26.779519 6.780692
## [15,] 26.850196 6.780692
## [16,] 27.145425 6.780692

Now we combine the Kalman scores with the original data set for the individual, where age is the age at the different time points, y_observed stores the observed values (in this case, we have only three observations per individual), and y_KS stores the predicted RTS smoothed Kalman score for each time point.

data_ind <- as.data.frame(cbind(data[[id]]$age, data[[id]]$y, cKsId[-1,1])) # The first row of cKsId must be removed
colnames(data_ind) <- c("age", "y_observed", "y_KS")
(data_ind)

The function mxKalmanScores also provides the error associated to each Kalman score in the object PSmoothed. These errors can be used to compute the confidence interval for each time point.

alpha <- .05 # (1 - confidence level)
qZ <- qnorm(p = 1-(alpha/2))
CI <- matrix(data = NA, ncol = 2, nrow = nrow(data_ind))
errors <- sqrt(mxKalmanScores(model = multiSubjODERun@submodels[[id]],
                              data = data[[id]])$PSmoothed[1,1,2:(nrow(data_ind)+1)])
CI[,1] <- data_ind$y_KS + qZ*errors
CI[,2] <- data_ind$y_KS - qZ*errors
CI <- as.data.frame(CI)
CI <- cbind(CI, data_ind$age)
colnames(CI) <- c("Upp", "Low", "age")
(CI)

6. Graphical representation of the RTS smoothed Kalman scores

First we combine the rows of all the individuals into a single data frame.

data_long <- as.data.frame(do.call(rbind, data))
data_long <- data_long[!is.na(data_long$y),]

Now we present a plot that includes, for the individual of interest:

The RTS smoothed Kalman scores estimates (blue dashed line)
The confidence interval (blue shaded region)
The observed scores (black line and points)
For the rest of the sample, observed trajectories (grey lines).

library(ggplot2)

ggplot() + 
    
    # Group trajectories
    geom_line(data = data_long, aes(x=age+5, y=y, group=id), colour = "dark grey", alpha=0.2)+
    
    # CI for KS
    geom_line(data=CI, aes(x=data_ind$age+5, y=Upp), size = 0.2, color='blue') + 
    geom_line(data=CI, aes(x=data_ind$age+5, y=Low), size = 0.2, color='blue') +
    geom_ribbon(aes(x=CI$age+5, ymax= CI$Upp, ymin= CI$Low),
                fill="blue", alpha=.15) +
    
    # KS estimated latent scores
    geom_line(data=data_ind, aes(x=age+5, y=y_KS), size = .8, color='blue', linetype = "dashed") +
    
    ## Observed scores
    geom_point(data=data_ind, aes(x=age+5, y=y_observed), size = 2) + 
    geom_line(data=data_ind[!is.na(data_ind$y_observed),], aes(x=age+5, y=y_observed, group=1)) + 
    
    scale_x_continuous(breaks=seq(5, 20, 1)) +
    ylim(0,45) + ylab("y scores") + xlab("age") +
    theme_classic(base_size = 12, base_line_size = 0.5)