Primary Submission Category: Machine Learning and Causal Inference
Precise Estimates from Safe Data Integration in a Paired, Cluster-Randomized Field Trial
Authors: Adam Sales, Johann Gagnon-Bartsch, Mingyu Feng, Kevin Huang,
Presenting Author: Adam Sales*
Randomized controlled trials (RCTs) in social or biomedical science often rely on covariate and outcome data drawn from a larger administrative database. For instance, a recent paired, clustered RCT of an online homework platform used covariates—demographics and prior achievement measures—and outcomes—standardized test scores—from the state’s educational database. To avoid confounding, typical analyses use only data from randomized subjects, discarding the remainder of the database.
In this study, we show how to substantially improve the precision of effect estimates by using the entire dataset, including both randomized and non-randomized subjects, without introducing any new confounding and relying largely on standard modeling techniques, and demonstrate the method by re-analyzing data from the online homework study. We used observational data to train a random forest model predicting student test scores from student- and school-level covariates, and used it to generate predicted outcomes for students in the randomized schools. Then, we modified the hierarchical linear model from the original analysis by adding the predicted outcomes as an additional regressor, reducing the standard error by 15% — the improvement expected from increasing the sample size by over a third. Since it requires few additional modeling skills and no additional assumptions, this method can be easily adapted wherever auxiliary observational data are available.
