The Many Flavors of Matching for Causal Inference

causal inference

matching methods

flavors

Published

May 27, 2025

Background

If you’ve worked on causal inference with observational data, you’ve likely faced the fundamental challenge: the treated and control groups often look very different. Matching methods aim to fix that. The idea is simple and intuitive—let’s compare treated units to similar control units and mimic the conditions of a randomized experiment as best as we can.

But here’s the twist: there multiple ways to define “similar.” Should we look for exact matches? Should we match on covariates directly or on some summary score like the propensity score? Should we optimize the matches globally or locally? Over the years, researchers have developed a wide variety of matching methods, each with its own advantages and pitfalls. The landscape can be overwhelming, especially if you’re new to causal inference.

In this article, I’ll walk through the most popular matching strategies for causal inference. I’ll talk about what each method does, when to use it, and where it might lead you astray. The focus is on the intuition and technical description—not on the code. Whether you’re doing matching for the first time or looking to expand your toolkit, you will find something useful here.

Notation

Let’s set up the basic framework with minimal fluff. Suppose we have \(n\) units indexed by \(i = 1, \dots, n\). Each unit has:

A binary treatment indicator \(D_i \in \{0, 1\}\), where \(D_i = 1\) for treated units and \(D_i = 0\) for controls.
A vector of observed covariates \(X_i\).
Potential outcomes \(Y_i(1)\) and \(Y_i(0)\), where \(Y_i(1)\) is the outcome if treated, and \(Y_i(0)\) if untreated. We observe only their realized outcome \(Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0)\).

We impose the usual assumptions of unconfoundedness (treatment assignment is independent of potential outcomes given covariates) and overlap (treated and control units have similar covariate distributions).

Our goal is to estimate treatment effects like the Average Treatment Effect (ATE) or the ATE on the Treated (ATT): \[ \text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid D = 1]. \]

The core idea behind matching is to find comparable untreated units for each treated unit so we can approximate \(Y(0)\) for the treated group. We then look discard the unmatched units and look at the difference in outcomes between treated and matched controls to estimate the treatment effect.

Let’s abuse notation a bit and define the sample-analogue of the ATT as:

\[\widehat{\text{ATT}}=\frac{1}{N_{\text{treated}}}\sum_{i:D=1} Y(1)_i - \hat{Y}(0)^{\text{imputed}}_i.\]

These methods can be, and often are, combined with regression adjustments to reduce bias and improve efficiency and robustness, but I will leave that aside here.

A Closer Look

We are now ready to go through seven of the most popular matching approaches.

Exact Matching

Exact matching is the simplest—and most restrictive—approach to causal inference:

Match treated and control units exactly on all observed covariates \(X\).

That is, if a treated unit has \(X = x\), we look for control units with the exact same \(X = x\). While this method is conceptually elegant and easy to understand, it’s rarely practical.

Exact matches become increasingly unlikely in high-dimensional settings or when covariates are continuous, where no two units are likely to be identical. In those cases, exact matching often fails to find matches for many treated units, leading to loss of sample size or biased estimates. Despite its limitations, exact matching is an important baseline: it helps clarify the assumptions behind more flexible methods.

When to use it? When covariates are discrete and there aren’t too many of them. Great overlap between the treated and control groups.

Strengths: Conceptually clear, no modeling assumptions, perfect balance on matched covariates.

Weaknesses: Infeasible with continuous variables or many covariates (the curse of dimensionality); can lead to lots of unmatched units.

Mahalanobis Distance Matching

Instead of requiring exact equality between covariates, Mahalanobis matching

Uses a continuous distance metric to find treated and control units that are similar in terms of their covariate values.

The Mahalanobis distance between two units \(i\) and \(j\), with covariates \(X_i\) and \(X_j\), is defined as:

\[ d(X_i, X_j) = \sqrt{(X_i - X_j)^\top S^{-1} (X_i - X_j)}, \]

where \(S\) is the sample covariance matrix of the covariates \(X\).

This metric accounts for both the scale and the correlation structure of the covariates. Unlike Euclidean distance, which treats each covariate as equally important and independent, Mahalanobis distance adjusts for the fact that some variables may be more variable than others, or may be correlated.

Intuitively, Mahalanobis distance answers the question: how many standard deviations apart are these two vectors, once we’ve accounted for the spread and correlation of the variables? A small Mahalanobis distance indicates that the two units are close in the joint covariate space, even if they differ somewhat along individual dimensions. It still becomes less reliable in high dimensions, where all units tend to be far from one another.

When to use it? When covariates are continuous and moderately low-dimensional.

Strengths: Handles continuous variables well, accounts for correlation between covariates.

Weaknesses: Sensitive to high dimensionality; doesn’t work well with mixed discrete and continuous variables.

Propensity Score Matching

Propensity Score Matching (PSM) is one of the most influential ideas in observational causal inference. Rosenbaum and Rubin’s foundational result shows that if treatment assignment is unconfounded given covariates \(X\), then it is also unconfounded given the propensity score:

\[ e(X) = \mathbb{P}(D = 1 \mid X), \]

the probability of receiving treatment conditional on observed covariates. In other words,

Instead of matching on the full covariate vector \(X\), we can just match on a single scalar summary—the estimated propensity score.

This is the key idea: propensity scores reduce the curse of dimensionality. By summarizing the information in \(X\) into one number that captures the likelihood of treatment, we make matching more feasible and scalable, especially when \(X\) includes many variables.

In practice, the propensity score is rarely known and must be estimated—typically using logistic regression, probit models, or machine learning methods like random forests or gradient boosting. Once estimated, treated and control units are matched based on the closeness of their propensity scores, often using nearest-neighbor matching, caliper matching, or kernel methods. Trimming is therefore an important aspect of the process, where units with very high or very low propensity scores are excluded to improve balance and reduce bias.

PSM improves comparability between groups by balancing the covariates in expectation, but it comes with trade-offs. Matching on the propensity score alone does not guarantee covariate balance in any particular dataset, so it’s important to assess and diagnose balance post-matching. Moreover, PSM is sensitive to model misspecification and can perform poorly if the propensity score is estimated inaccurately or if the overlap between groups is weak.

Despite these caveats, PSM remains a popular and conceptually powerful tool, especially when combined with diagnostics and robustness checks.

When to use it? When the number of covariates is large or mostly continuous.

Strengths: Reduces a multivariate problem to a one-dimensional one. Flexible.

Weaknesses: Balance on the propensity score does not guarantee balance on covariates. Sensitive to propensity model misspecification.

Coarsened Exact Matching

Coarsened Exact Matching (CEM) offers a practical compromise between the rigidity of exact matching and the flexibility needed for real-world data. The core idea is to

Coarsen continuous covariates into broader, meaningful categories and then perform exact matching on these coarsened values.

Formally, each covariate is discretized into bins, and treated and control units are matched only if they fall into the same bin across all coarsened covariates. This process reduces the granularity of the match criteria, increasing the likelihood of finding matches, while still ensuring comparability within the matched groups. Examples are turning age into 5-year intervals or income into quantile-based brackets.

By construction, CEM guarantees balance on the coarsened covariates—unlike propensity score matching, where balance must be checked and cannot be guaranteed a priori. CEM also allows researchers to control the level of approximation: the finer the bins, the closer it is to exact matching; the coarser the bins, the more matches you retain but the more heterogeneity you permit within matched pairs. Researchers can apply finer coarsening to critical variables and coarser groupings to less central ones.

However, CEM’s effectiveness depends heavily on the choice of binning. Poorly chosen coarsening can either lead to very few matches (if too fine) or poor covariate balance (if too coarse). There is a trade-off between retaining sample size and improving covariate similarity, and CEM makes this trade-off explicit and user-controllable.

When to use it? When you can meaningfully coarsen your covariates and want to retain interpretability.

Strengths: Improves balance while avoiding the curse of dimensionality. You control the coarsening.

Weaknesses: Requires subjective choices about how to coarsen. Coarsening too much can reduce matching quality; coarsening too little can lead to few matches.

Optimal Matching

Optimal matching takes a global approach to the matching problem. Rather than matching each treated unit to its nearest control in isolation (as in nearest neighbor matching), it

Finds the set of matched pairs that minimizes the total distance across all matched units.

Formally, it solves:

\[ \min_{\text{matching}} \sum_{(i, j) \in \text{pairs}} d(X_i, X_j), \]

where \(d(X_i, X_j)\) is a distance measure between treated unit \(i\) and control unit \(j\).

The key benefit is that it avoids poor global matches that can arise when matching is done greedily or locally, one unit at a time. Optimal matching is especially useful when treatment and control groups differ significantly in size or distribution, and when you want to minimize overall imbalance rather than optimize matches for individual units.

However, because it solves a global optimization problem, it can be computationally intensive for large datasets. Also, while it minimizes overall distance, it doesn’t necessarily guarantee good covariate balance unless combined with preprocessing (e.g., matching on propensity scores or coarsened covariates).

Still, optimal matching is a powerful and principled method, particularly when used with careful distance choices and diagnostics.

When to use it? When you care about global match quality rather than greedy, local matching.

Strengths: Finds globally optimal matches (not just nearest neighbor).

Weaknesses: Computationally intensive for large datasets. Sensitive to distance metric choice.

Genetic Matching

Genetic matching is an advanced matching method that uses a genetic algorithm to find an optimal weighting of covariates in the distance metric. The idea is to

Automate the process of choosing how much weight each covariate should receive when determining similarity between treated and control units.

Rather than manually selecting a distance metric like Mahalanobis or Euclidean, genetic matching searches over a space of weighted Mahalanobis distances, adjusting the weights to minimize covariate imbalance after matching. The optimization goal is to improve covariate balance. The result is a customized distance metric that gives higher weight to variables that are harder to balance and less to those that are already balanced.

Genetic matching can be used with or without propensity score preprocessing, and can accommodate interactions or higher-order terms. It’s especially powerful in settings with many covariates or complex imbalance patterns that simple metrics fail to capture.

However, the method is computationally intensive, often requiring many iterations of matching and balance assessment. Its performance also depends on the choice of balance metrics and tuning parameters in the genetic algorithm.

When to use it? When achieving good balance is critical and standard matching methods struggle.

Strengths: Excellent balance across covariates; data-driven weighting.

Weaknesses: Computationally heavy. May require tuning. Randomness in the genetic algorithm introduces some variability.

Caliper Matching

Caliper matching introduces a distance threshold to restrict which treated and control units can be matched. Specifically,

A treated unit is only matched to a control unit if the distance between them is within a pre-specified caliper.

That is, if the difference falls below a set limit. For example, when matching on propensity scores, a common rule is to match only if the absolute difference in propensity scores is less than 0.1:

\[ |e(X_i^{\text{treated}}) - e(X_j^{\text{control}})| < \text{caliper} \]

This constraint helps avoid poor matches, especially when treated and control groups have limited overlap. Without calipers, nearest neighbor matching might pair units with very different covariate profiles, particularly in the tails of the propensity score distribution. These poor matches can increase bias and undermine the credibility of causal estimates.

Caliper matching is not a matching method on its own but rather a modification to existing strategies—most often to nearest neighbor matching. It can also be combined with optimal matching or Mahalanobis distance.

Choosing the right caliper width is important: too wide, and the constraint has little effect; too narrow, and many treated units may be left unmatched, reducing sample size and precision.

Caliper matching is particularly useful when the common support assumption is questionable—i.e., when treated and control groups do not overlap well in covariate space. In such cases, calipers serve as a safeguard to maintain the quality of matches by explicitly enforcing local comparability.

When to use it? When you want to avoid poor matches with big distance gaps.

Strengths: Prevents bad matches. Easy to implement alongside other methods.

Weaknesses: May leave some treated units unmatched if no control falls within the caliper.

Bottom Line

Matching methods are powerful tools for causal inference.
They come in many flavors, each with its own strengths and weaknesses.
No single method is best for all situations; the choice depends on the data, the research question, and the assumptions you are willing to make.

Where to Learn More

The book Causal Inference for Statistics, Social, and Biomedical Sciences by Imbens and Rubin (2015) provides excellent coverage of matching and its theoretical underpinnings. I also recommend Stuart (2010)’s seminal review paper cited below. The MatchIt and Matching R packages documentation are also goldmines for practical implementation details.

References

Abadie, A., & Imbens, G. W. (2016). Matching on the estimated propensity score. Econometrica, 84(2), 781-807.

Ben-Michael, E., Feller, A., Hirshberg, D. A., & Zubizarreta, J. R. (2021). The balancing act in causal inference. arXiv preprint arXiv:2110.14831.

Diamond, A., & Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3), 932-945.

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.

Imbens, G. W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2), 373-419.

Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.

Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1–21.

Rosenbaum, P. R. (2002). Observational Studies. Springer.