Skip to content

Abstract Search

Primary Submission Category: Policy Learning

Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference

Authors: Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aurelien Bibaut,

Presenting Author: Lars van der Laan*

Double reinforcement learning (DRL) (Kallus and Uehara, 2020) enables statistically efficient inference on the value of one policy in a nonparametric Markov Decision Process (MDP) given trajectories generated by another policy, but this necessarily requires stringent overlap between the state distributions, which is often violated in practice. To relax this and extend DRL, we study efficient inference on linear functionals of the Q-function (of which policy value is a special case) in infinite-horizon time-invariant MDPs under semiparametric restrictions on the Q-function. These restrictions can relax the overlap requirement and lower the efficiency bound, yielding more precise estimates. As an important example we study evaluation of long-term value under domain adaptation given a few short trajectories from the new domain and restrictions on the difference between the domains, which can be used for long-term causal inference. Our method combines flexible estimates of the Q-function and of the Riesz representer of the functional of interest (e.g., stationary state density ratio for policy value) and is automatic in that we do not need to know the form of the latter, only the functional we care about. To address potential model misspecification bias, we extend the adaptive debiased machine learning (ADML) framework of van der Laan et al. (2023) to construct nonparametrically valid and superefficient estimators that are adaptive to the functional form of