Skip to content

Abstract Search

Primary Submission Category: Machine Learning and Causal Inference

Causally Sufficient Dimension Reduction for Text Data

Authors: Zach Wood-Doughty, Kayla Schroeder, Razieh Nabi,

Presenting Author: Zach Wood-Doughty*

Statistical modeling of text is complicated by the data’s high dimensionality. Topic models are a central tool for representing text documents in low-dimensional space, and have been used in thousands of analyses of domains including literature and healthcare. While probabilistic topic models allow for human interpretation of large text corpora, the representations learned by these methods are inherently associational. Topic models reveal underlying structure within a corpus, but that structure does not necessarily represent the underlying causal structure that connects the text to other variables. When incorporating topic models as a piece of larger analyses, it is common to for researchers to manually interpret the topics and then fit models that link those topics to over variables of interest. However, if our goal is to understand causal relations between the text and other variables, we need to be explicit about the assumptions required to produce unbiased estimates of causal effects. This work applies causal sufficient dimensionality reduction to text data, enabling causal analyses of textual treatments.