Primary Submission Category: Applications in Health and Biology
Causal Inference for Noisy Sequence Count Data
Authors: Tinghua Chen, Justin Silverman,
Presenting Author: Tinghua Chen*
Modern sequencing technologies, including microbiome profiling, single-cell RNA sequencing, and bulk gene expression assays, generate high-dimensional count data that serve as noisy, indirect measurements of latent biological quantities such as microbial abundance or gene expression. Analyses of these data are challenged by compositional constraints, sparsity, sampling variability, and pervasive confounding. Despite this, existing causal inference methods typically define potential outcomes on the observed count scale, even though treatments act on latent biological states rather than on stochastic, technology-dependent counts, resulting in causal effects that lack clear biological interpretation. We propose a causal inference framework that explicitly targets latent potential outcomes underlying sequencing data by jointly modeling latent biological quantities and the measurement processes that generate observed counts. This approach enables causal estimands to be defined on the latent biological scale, characterizes identifiability and uncertainty under realistic sequencing models, and supports scalable estimation of latent average treatment effects using a combination of parametric and flexible nonparametric components. The proposed methods are evaluated through simulations and applications to microbiome, single-cell, and gene expression datasets, providing a principled foundation for biologically interpretable causal inference from sequencing data.
