Skip to content

Abstract Search

Primary Submission Category: Machine Learning and Causal Inference

Proximal Causal Inference With Text Data

Authors: Jacob M. Chen, Rohit Bhattacharya, Katherine A. Keith,

Presenting Author: Jacob M. Chen*

Recently, researchers have proposed causal inference methods that attempt to mitigate confounding bias by including unstructured text data as proxies of confounding variables that are partially or imperfectly measured. These approaches assume analysts have supervised labels of the confounders given text for at least a subset of instances, a constraint that is not always feasible due to data privacy or cost. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that splits pre-treatment text data, infers two proxies from two different zero-shot models on the separate splits, and plugs these proxies into the proximal g-formula. We prove that our particular method of text-based proxy generation satisfies the identification conditions required by the proximal g-formula while some other seemingly reasonable proposals do not. We evaluate our method in fully synthetic and semi-synthetic settings. For our semi-synthetic setting, we use real-world clinical notes from the MIMIC-III dataset and the Flan-T5 model, an instruction-tuned large language model, to infer proxies in a zero-shot manner. We find that our procedure results in causal estimates with low bias, whereas alternative procedures do not. This combination of proximal causal inference and zero-shot classifiers is novel to our knowledge and expands the set of text-specific causal methods available to practitioners.