Primary Submission Category: Machine Learning and Causal Inference
Text-as-Treatment Causal Estimation with Sparse Autoencoders
Authors: Amar Venugopal, Amir Feder, Omri Feldman, Jann Spiess,
Presenting Author: Amar Venugopal*
Large language models (LLMs) have rich internal representations of language, the study of which can enable the design of controlled experiments with latent language treatments. Recent work in sparse autoencoders (SAEs) allows for intervention on specific concepts embedded in text, generating new texts that vary in the intensities of those concepts. However, these methods are highly sensitive to the choice of concepts and hyperparameters. In this paper we present a novel hypothesis generation methodology that discovers concepts of interest in labeled textual data and identifies the optimal SAE features and layers for such interventions. Using semi-synthetic datasets, we show that the downstream experiments used to validate these hypotheses present a unique challenge for causal inference with latent treatments. Specifically, we demonstrate that the estimation of the conditional average treatment effect (CATE) suffers from significant bias due to inherent positivity violations and treatment leakage. We characterize the estimation bias induced in this setting and propose a solution based on covariate residualization. Our results show that this approach effectively mitigates estimation error, providing a robust foundation for causal effect estimation in text-as-treatment settings.
