Primary Submission Category: Machine Learning and Causal Inference
Optimal allocation of data-collection resources in partially identified causal systems
Authors: Dean Knox, Guilherme Duarte,
Presenting Author: Dean Knox*
Complications in applied research often prevent point identification of causal estimands using cheaply available data—at best, sharp bounds containing ranges of possibilities can be reported. To make progress, researchers frequently collect more information by (1) re-cleaning existing datasets, (2) gathering secondary datasets, or (3) pursuing entirely new designs. Common examples include manually correcting missingness, recontacting attrited units, validating proxies with ground-truth data, finding new instrumental variables, and conducting follow-up experiments. These auxiliary tasks are costly, forcing tradeoffs with (4) larger samples from the original approach.
We define each task’s efficiency as expected information gain per unit cost. Gain is formalized as the narrowing of the confidence region on sharp bounds, capturing two kinds of benefits: point-identifying new aspects of the causal system and reducing statistical uncertainty. Leveraging recent advances in automatic bounding (Duarte et al., 2022), we prove efficiency is computable for essentially any discrete causal system, estimand, and auxiliary data task.
We propose a method for optimal adaptive allocation of data-collection resources. Users first input a causal graph, estimand, and past data. They then enumerate distributions from which future samples can be drawn, fixed and per-sample costs, and any prior beliefs. Our method automatically derives and sequentially updates the optimal data-collection strategy.