Vasco Yasenov
  • About Me
  • CV
  • Blog
  • Research
  • Software
  • Others
    • Methods Map
    • Kids Books
    • TV Series Ratings

On this page

  • Background
  • Notation
  • A Closer Look
    • Two Nuisances, Three Estimators
    • The Recipe: Four Steps
    • Why Bother With the Targeting Step?
  • Bottom Line
  • References

A Simple Estimation Recipe for Targeted Learning

causal inference
machine learning
Published

June 12, 2026

6 min read

Background

Targeted learning is extremely popular in public health and epidemiology, yet it has never really made its way into the economics and econometrics world. It was introduced and extended by Mark van der Laan and coauthors in a long series of papers and books.The targeted learning class of methods enjoys very attractive theoretical properties. It is doubly robust and, under the right conditions, semiparametrically efficient, and it is method-agnostic in that it can plug in a wide range of machine learning algorithms.

This note aims to demystify targeted learning for a broader audience. From a practical standpoint, I find the best way to do that is through the lens of estimation, since that is where practitioners actually spend their time with a method. It is also the easiest way to compare it against the causal inference tools you already know — be it propensity score methods, double machine learning, or regression adjustment.

My focus is on the how-to and less on theoretical derivations or guarantees, and the overview borrows heavily from Schuler and Rose (2017). I deliberately stay with the basic version of targeted learning — targeted maximum likelihood estimation (TMLE) for the average treatment effect in a cross-sectional observational study. The framework extends to many other estimands, longitudinal data, and beyond, but those are out of scope here.

Notation

Suppose we observe \(n\) i.i.d. draws of \(O = (X, A, Y)\), where:

  • \(A \in \{0, 1\}\) is a binary treatment (the “exposure”),
  • \(X\) is a vector of confounders, and
  • \(Y\) is a continuous outcome.

Each unit has a pair of potential outcomes \(Y(1)\) and \(Y(0)\), and the target is the average treatment effect (ATE),

\[\psi = \mathbb{E}\big[Y(1) - Y(0)\big].\]

Two nuisance functions do all the work. The outcome regression is the conditional mean of the outcome,

\[\bar{Q}(A, X) = \mathbb{E}[Y \mid A, X],\]

and the propensity score is the conditional probability of treatment,

\[g(X) = \mathbb{P}(A = 1 \mid X).\]

For \(\psi\) to carry a causal interpretation we need the usual trio of assumptions — no unmeasured confounding, positivity (\(0 < g(X) < 1\)), and SUTVA — which I will take as given since the focus here is estimation, not identification.

A concrete example to fix ideas, taken from the paper: \(A\) is whether a person exercises regularly, \(Y\) is a depression score, and \(X\) collects sex, use of psychosocial therapy, and antidepressant use.

A Closer Look

Two Nuisances, Three Estimators

The reason TMLE is easy to place on a mental map is that it shares its ingredients with methods you already use. G-computation (regression adjustment) models only the outcome regression \(\bar{Q}\) and then averages the predicted difference. Inverse probability weighting (IPW) models only the propensity score \(g\) and reweights the observed outcomes. Each leans entirely on getting one of the two nuisances right.

TMLE uses both. It starts from an outcome-regression prediction and then nudges it using the propensity score, in a way designed specifically for the ATE. That extra step is what buys double robustness: the final estimate is consistent if either \(\bar{Q}\) or \(g\) is correctly estimated, not necessarily both. If both are consistent, the estimator is also efficient. This is the practical payoff, and it is why the method is worth the extra bookkeeping.


The Recipe: Four Steps

Here is the whole procedure for the ATE, stripped to its essentials. One technical preliminary: because the targeting step runs through a logistic regression, the continuous outcome \(Y\) is first rescaled to the unit interval \((0, 1)\), and everything is undone at the end.

NoteAlgorithm: TMLE for the ATE
  1. Initial outcome model. Fit \(\bar{Q}(A, X)\) and use it to predict each unit’s pair of potential outcomes, \(\bar{Q}(1, X_i)\) and \(\bar{Q}(0, X_i)\).
  2. Propensity model. Fit \(g(X) = \mathbb{P}(A = 1 \mid X)\), giving \(\hat{\pi}_1 = \hat g(X_i)\) and \(\hat{\pi}_0 = 1 - \hat g(X_i)\).
  3. Targeting step. Form the “clever covariate” \(H(A, X) = \frac{A}{\hat{\pi}_1} - \frac{1 - A}{\hat{\pi}_0}\), run a one-parameter logistic regression of \(Y\) on \(H\) using the initial \(\bar{Q}\) as an offset to estimate a fluctuation \(\hat{\varepsilon}\), and update the predicted potential outcomes on the logit scale.
  4. Plug-in estimate. Average the difference between the updated potential outcomes: \(\hat{\psi} = \frac{1}{n}\sum_i \big[\bar{Q}^*(1, X_i) - \bar{Q}^*(0, X_i)\big]\).

The only genuinely new object is the clever covariate \(H\). Note that it looks a lot like an inverse-probability weight — and that is no accident — but here it enters as a regressor in the fluctuation rather than as a weight. Step 3 is the “targeting”: it tilts the initial fit just enough to remove first-order bias in the direction that matters for the ATE, while leaving the rest of the outcome model alone.


Why Bother With the Targeting Step?

A fair question is why we do not simply stop after step 1 and report the average difference in predicted outcomes — that is exactly G-computation. The answer is robustness. A plain outcome-model plug-in is only as good as that one model; if it is misspecified, the estimate is biased, full stop. The targeting step injects information from the propensity score so that a mistake in one of the two models can be absorbed by the other. In the paper’s simulation, when either the outcome or the exposure model is deliberately misspecified, TMLE stays essentially unbiased, while G-computation and IPW pick up large bias from the same misspecification.

Two practical notes follow from this. First, TMLE is a substitution (plug-in) estimator — the ATE is computed by plugging fitted values into the target functional — which makes it more stable than estimating-equation methods like IPW when propensity scores get close to \(0\) or \(1\). Second, because the recipe never cares how \(\bar{Q}\) and \(g\) are estimated, you should estimate them with flexible machine learning rather than hand-specified regressions. In practice this means an ensemble (the super learner), which lets the data choose among a library of ML methods instead of betting on one functional form.

Bottom Line

  • Targeted maximum likelihood estimation is a very popular class of methods in the public health and epidemiology communities, and underused elsewhere.
  • It is doubly robust — consistent if either the outcome regression or the propensity score is right — and semiparametrically efficient when both are.
  • It is flexible: the recipe is agnostic to how the two nuisance functions are estimated, so it accommodates a wide range of machine learning algorithms (ideally an ensemble).
  • In essence, TMLE is outcome-regression plus propensity-score modeling, with one extra step that removes bias from the outcome model through a clever use of the estimated propensity score.

References

  • Schuler, M. S., & Rose, S. (2017). Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology, 185(1), 65–73.

  • van der Laan, M. J., & Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.

  • van der Laan, M. J., & Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1), Article 11.

© 2025 Vasco Yasenov

 

Powered by Quarto