Primary Submission Category: Machine Learning and Causal Inference
Replay and Ground: Causal Offline Evaluation of Language Models
Authors: Jikai Jin, Vasilis Syrgkanis,
Presenting Author: Jikai Jin*
Evaluating language models offline is challenging when deployment logs are confounded: routing decisions depend on latent user or task factors that simultaneously influence quality outcomes, making naive observational comparisons unreliable. We address this problem in a setting where three data sources are available: a large confounded observational log (OBS), a small randomized experiment (EXP), and an offline replay simulator, and the evaluation target is each language models’s causal value: the expected reward under a policy that routes all traffic to that agent.
We make two main contributions. First, we show that causal value is nonparametrically identified by combining replay-generated mediators with a reward surface fit on randomized EXP data, without requiring the observational log to be unconfounded. Second, building on this identification result, we develop hybrid reward-model estimators that exploit OBS at scale — either by learning a deconfounded representation from OBS auxiliary labels, or by grounding an OBS-trained reward model with a small EXP-estimated bias correction — and pair these with both a direct plug-in and a doubly robust value estimator. Empirically, no single estimator uniformly dominates: hybrid methods exhibit systematic, predictable crossovers across reward nonlinearity, confounding strength, and EXP budget.
