Primary Submission Category: Sensitivity Analysis
Using large language models for sensitivity analysis in causal inference: cases studies on Cornfield inequalities and E-values
Authors: Qingyan Xiang, Jiahao Zhang, Bojian Feng,
Presenting Author: Qingyan Xiang*
Unmeasured confounding is a central challenge in causal inference from observational studies. Sensitivity analysis methods such as Cornfield inequalities and E-values assess robustness to unmeasured confounding, but are often difficult for interdisciplinary researchers to compute and interpret. Recent advances in large language models (LLMs) offer accessible tools to support sensitivity analyses, yet their reliability has not been evaluated. We assess four LLMs (ChatGPT, Claude, DeepSeek, and Gemini) using case studies from smoking, back pain, Alzheimer’s disease, and environmental health research. Performance is evaluated by (1) E-value calculation accuracy, (2) qualitative interpretation of robustness to unmeasured confounding, and (3) identification of plausible unmeasured confounders. ChatGPT, Claude, and Gemini accurately reproduce reported E-values, whereas DeepSeek shows small biases. All models generate conclusions consistent with effect sizes and identify biologically plausible unmeasured confounders. To our knowledge, this is the first work using cases studies to evaluate the performance of LLMs on sensitivity analysis. The results suggest that structured prompting enables LLMs to support sensitivity analysis, which can fruther inform researchers to improve their study design and decision-making in observational studies.
