Primary Submission Category: Machine Learning and Causal Inference
Comparing Causal Forest and BART for Estimating Treatment Effect Heterogeneity with Cluster Data: An Application to LLM Evaluation
Authors: JIA QUAN, Walter Leite,
Presenting Author: JIA QUAN*
Causal Forests (Wager & Athey, 2018) and Bayesian Additive Regression Trees (BART; Hahn, Murray, & Carvalho, 2020) based causal inference are widely used to estimate conditional average treatment effects (CATEs), yet they operationalize heterogeneity differently. As large language models (LLMs) are increasingly deployed in high-stakes domains, rigorous causal evaluation of model choices becomes essential. We apply both methods to data from an educational application generating decodable reading passages to estimate heterogeneous effects of a fine-tuned versus off-the-shelf LLM on story quality across prompt types, grade levels, and linguistic benchmarks. This application involves data conditions frequently encountered in AI evaluation research: clustered observations, high-dimensional moderators, skewed outcomes, and complex interactions. We identify three implementation issues that critically affect BART’s performance: outcome-scale sensitivity, cluster parameterization, and estimand specification in which marginalizing over random effects attenuates individual CATEs. After resolving these, both methods recovered nearly identical ATEs (Causal Forest: g = 0.213; BART: g = 0.210), with strong agreement in ITE ranking and moderator importance.
Our findings demonstrate how modern causal ML methods can provide principled, heterogeneous treatment effect estimates for generative AI systems, offering a framework for cross-disciplinary evaluation of model deployment decisions.
