Primary Submission Category: Policy Learning
Off-policy evaluation using debiased calibration
Authors: Jae-kwang Kim, Yuyang Li, Yumou Qiu,
Presenting Author: Jae-kwang Kim*
Off-policy evaluation (OPE) estimates the expected performance of an evaluation policy using data collected under a distinct behavior policy, enabling policy assessment when direct experimentation is infeasible. Standard OPE methods typically assume stable covariate distributions across datasets—a condition often violated in practice due to temporal or environmental data shifts.
This paper introduces a generalized entropy calibration (GEC) weighting framework to improve OPE under covariate shift. A self-normalized estimator with GEC weighting is first developed for stationary settings, and its efficiency and double-robustness properties are established. Building on this foundation, a doubly-weighted estimator is proposed to further correct for selection bias induced by covariate shift, using two sets of calibration weights to sequentially adjust for policy and covariate discrepancies across datasets.
The resulting multi-calibrated framework accommodates multiple data-transfer mechanisms and attains multiple robustness. Extending beyond conventional doubly robust estimators, the proposed method achieves table-wise quadratic robustness, ensuring consistency when any one of four model combinations is correctly specified. We also provide an alternative proof of the semiparametric efficiency bound for OPE under covariate shift and show that the proposed estimator attains this bound when all models are correctly specified.
Theoretical validation and numerical studies are provided
