Skip to content

Abstract Search

Primary Submission Category: Design-Based Causal Inference

Improving efficiency in double-sampling strategies for informative missing data in treatment effect estimation

Authors: Shuo (Mila) Sun, Alex Levis, Rajarshi Mukherjee, Rui Wang, Sebastien Haneuse,

Presenting Author: Shuo (Mila) Sun*

Missing or incomplete data is a widespread challenge in observational studies, especially when the data at hand were not originally collected for research purposes. These data may be particularly susceptible to being missing-not-at-random (MNAR). To mitigate bias due to MNAR data, we propose to use a double-sampling strategy, through which the otherwise missing data are ascertained on a sub-sample of study units. We generalize the nonparametric estimation results to the case where the data are initially subject to arbitrary coarsening, and develop nonparametric efficient estimators of any smooth full data functional of interest. Since the double-sampling strategy can be planned from the beginning, it provides an opportunity to allocate resources effectively within a fixed budget. Motivated by this, we derive the optimal sampling rule that minimizes semiparametric efficiency bound, subject to a budget constraint. The optimal double-sampling rules generally depend on the unknown full data distribution. To address this, we conduct a pilot study to estimate unknown quantities and investigate asymptotic properties, using average treatment effects (ATEs) as an example, considering both fixed pilot sample sizes and cases where the sample size approaches zero at a specific rate. Two simulation studies, assuming Hǒlder smooth functions and sparsity functions, respectively, verify the efficiency of the proposed optimal sampling rules in finite samples.