Primary Submission Category: Applications in Health and Biology
Handling Informative Missing Data in Electronic Health Records: Imputation with the Semiparametric Gaussian Copula Model
Authors: Yongjun Lee, Anna Guo, Daniel Scharfstein, Razieh Nabi,
Presenting Author: Yongjun Lee*
Emulating a clinical trial based on electronic health records requires careful adjustment for pre-treatment confounders. However, these variables are often subject to informative missingness. Assumptions about missing data are untestable.
Missing data directed acyclic graphs (mDAGs) provide a framework to represent these assumptions and can be used to evaluate whether the conditional distribution of the missing data given the observed data is identified. If identified, this conditional distribution can be used to impute the missing data.
In this paper, we propose a novel imputation framework based on the Semiparametric Gaussian Copula model (SGCM), built on a latent multivariate Gaussian structure that is linked to the variables of interest through monotonic increasing transformations for continuous variables and a set of cut points for ordered categorical variables.
We illustrate our approach to impute missing confounders in a target trial emulation designed to evaluate treatments for patients with non-small cell lung cancer based on data from the Flatiron Health electronic health records database. We also conduct a simulation study to investigate the performance of our approach.
