Primary Submission Category: Instrumental Variables
Two-Stage Machine Learning for Instrumental Variable Regression
Authors: David Bruns-Smith,
Presenting Author: David Bruns-Smith*
Instrumental variable (IV) regression with two-stage least squares (2SLS) is a widely-used estimator for observational causal inference. Unfortunately, modern non-linear machine learning (ML) models cannot be directly integrated into 2SLS, a fact sometimes called the “forbidden regression” problem. Existing extensions like nonparametric IV (NPIV) and kernel IV (KIV) have limited performance in practice, and newer approaches with trees or neural networks require iterative or adversarial training, introducing thorny convergence issues.
We propose two-stage machine learning (2SML), a simple two-stage IV procedure compatible with arbitrary ML models, and implementable in a few lines of code. Our key insight is that the second stage of 2SLS can be reformulated as a least squares problem where the predictions are projected onto a certain feature expansion, obtained via a first-stage regression of the outcome on the instruments. For kernel methods, 2SML is numerically equivalent to NPIV/KIV, but with a markedly different interpretation of the first stage. However, our framework also seamlessly generalizes to support tree ensembles and neural networks.
We establish L2-convergence, and evaluate 2SML on a novel, challenging IV benchmark, demonstrating significant performance gains over existing approaches. Additionally, we develop a complementary debiasing procedure that provides valid confidence intervals for linear functional estimands.