Primary Submission Category: Applications in Health and Biology
Causal Inference with Protein Sequences: Biologically Plausible Dimension Reduction Using AlphaFold
Authors: Matthew Laws, Rohit Bhattacharya,
Presenting Author: Matthew Laws*
Identifying disease-causing mutations is crucial for designing targeted treatments. However, randomized experiments for this purpose are often prohibitively slow, expensive, and typically conducted in model organisms, limiting scalability and generalizability. This work explores how observational causal inference can address this challenge. The primary obstacle is the high dimensionality of the treatment variable: the mutation space of a protein sequence. While high-dimensional confounding is well-studied in causal inference, handling high-dimensional treatments remains relatively under explored.
We introduce a novel, tailored approach that utilizes protein folding as a biologically plausible method for dimensionality reduction. Using AlphaFold3 (Abramson et al., 2024), we fold mutated sequences into their 3D protein structures and align them with healthy counterparts to generate a continuous misalignment score. Then, applying continuous treatment methods, we estimate the causal effects of misaligned protein structures on disease development. Assuming mutations disrupt protein alignment, our method provides bounds on the effects of genetic mutations on disease progression.
We evaluated our methodology using both real and semi-synthetic datasets, with a particular focus on BRCA1 mutations—a gene strongly associated with breast cancer development. We also compare with more generic approaches of dealing with high dimensional treatments and show that ours is more desirable.