Primary Submission Category: Machine Learning and Causal Inference
Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization
Authors: Sanath Kumar Krishnamurthy, Ruohan Zhan, Susan Athey, Emma Brunskill,
Presenting Author: Sanath Kumar Krishnamurthy*
We design adaptive experimentation (contextual bandit) algorithms to learn a targeted treatment assignment policy. Our design focuses on two important objectives: optimizing outcomes for the subjects in the experiment (cumulative regret minimization) and gathering data that will be useful for learning a policy with high expected reward (simple regret minimization). Recently, [Li et al 2022] formally showed that it is impossible for an algorithm to simultaneously achieve minimax optimal cumulative regret guarantees and instance optimal simple regret guarantees. They also proposed a novel algorithm that is the first to achieve instance-optimal simple regret guarantees, but is impractical to use due to computational considerations (their exploration policy relies on explicitly constructing a distribution over a potentially large set of policies). To address these issues, we propose a novel family of contextual bandit algorithms that are simple (easy to implement), flexible (works with any model and policy class), statistically efficient, and computationally efficient. Our family of algorithms comes with a parameter that allows for flexible trading off between guarantees on the two competing objectives, simple and cumulative regret minimization. Technically, our algorithm is enabled by the construction of a set of arms at every context that are probabilisticaly reliable over the context distribution to contain the arm recommended by the optimal policy.