Online Program

Return to main conference page
Tuesday, September 24
Tue, Sep 24, 10:15 AM - 11:30 AM
Thurgood Marshall Ballroom
Plenary Session 2

Targeted Machine Learning for Causal Inference based on Real World Data (301053)

*Mark van der Laan, UC Berkeley 

We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. This roadmap involves 1) causal models; 2) defining counterfactual outcomes indexed by choice of intervention; 3) defining causal contrast of interest; 4) identification of this causal quantity from observed data under non-testable assumptions, and thereby definition of the target estimand of the data distribution. At this stage the statistical estimation problem is defined by the target estimand and our knowledge about the data generating distribution, so that the remaining steps are 5) specification of an a priori defined estimator and 6) statistical inference, possibly augmented with a sensitivity analysis.

Regarding step 5-6, we review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful machine learning algorithm. In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). The asymptotic normality and efficiency of the TMLE relies on the asymptotic negligibility of a second-order remainder term. This typically requires the initial (super-learner) estimator to converge at a rate faster than n-1/4 in sample size n. We show that a new Highly Adaptive LASSO (HAL) of the data distribution and its functionals converges indeed at a sufficient rate regardless of the dimensionality of the data/model, under almost no additional regularity. This allows us to propose a general TMLE, using a super-learner whose library includes HAL, that is asymptotically normal and efficient in great generality.

We demonstrate the practical performance of the corresponding HAL-TMLE (and its confidence intervals) for the average causal effect for dimensions up till 10 based on simulations that randomly generate data distributions. We also discuss a nonparametric bootstrap method for inference taking into account the higher order contributions of the HAL-TMLE, providing excellent robust coverage. We also demonstrate the TMLE for real world data sets, and discuss how a user can use simulated, past studies, and outcome blind data from the current study, to define the precise specification of the target estimand and TMLE. Finally, we compare TMLE with methods such as Inverse probability of treatment and censoring weighting, propensity score matching, among others, and also show that TMLE can be naturally combined with matching.

Biographical Sketch

Mark van der Laan, Ph.D., is the Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics at the University of California, Berkeley. He has made contributions to survival analysis, semiparametric statistics, multiple testing, and causal inference. He also developed the targeted maximum likelihood methodology and general theory for super-learning. He is a founding editor of the Journal of Causal Inference and International Journal of Biostatistics. He has authored 4 books on targeted learning, censored data and multiple testing, authored over 300 publications, and graduated 45 Ph.D. students.

He received his Ph.D. from Utrecht University in 1993 with a dissertation titled "Efficient and Inefficient Estimation in Semiparametric Models". He received the COPSS Presidents' Award in 2005, the Mortimer Spiegelman Award in 2004, and the van Dantzig Award in 2005.