Primary Submission Category: Machine Learning and Causal Inference
Random Forest Counterfactual Similarity for Causal Inference
Authors: Bernardo Modenesi, Sima Najafzadehkhoei,
Presenting Author: Bernardo Modenesi*
Causal inference fundamentally relies on the ability to construct credible counterfactuals, i.e., to identify which units are “similar enough” in pre-treatment covariates to support causal comparisons. In practice, similarity is often imposed using Euclidean distances on standardized covariates, linear regression adjustment, or one-dimensional balancing summaries such as the propensity score. These choices can be brittle in tabular data with nonlinear interactions, heterogeneous feature relevance, mixed variable types, and missingness, where small changes in preprocessing or feature scaling can substantially alter which counterfactuals are deemed comparable. We propose a proximity-based notion of counterfactual similarity learned from random forests, yielding a data-adaptive metric that emphasizes covariate dimensions that matter for partitioning the population while down-weighting irrelevant variation. We show how this learned similarity can be used as a unifying primitive for several causal workflows: (i) proximity-based matching and stratification, (ii) proximity-weighted estimators that localize adjustment to data-supported neighborhoods, and (iii) estimation of heterogeneous treatment effects by comparing proximity-defined counterfactual outcome models across treatment groups. We further introduce diagnostics for counterfactual quality based on local overlap and neighborhood stability, enabling transparent assessment of where causal conclusions are well supported.
