Vasco Yasenov

6 Underrated Plot Types

Mon, 06 Apr 2026 07:00:00 GMT

7 min read

Background

Most data science workflows rely on a familiar trio of plots: histograms, scatterplots, and boxplots. They are useful, but they leave a lot of structure hidden in the data.

There are several plots that statisticians use regularly but that rarely show up in typical data science notebooks. Many of these are extremely informative for diagnostics, distribution comparison, or exploring high-dimensional relationships.

In this post I’ll look at six of them. To keep things simple I will use the same dataset throughout: the classic iris dataset. The goal is not mathematical rigor but practical intuition and code you can reuse. All examples below are shown in R and Python.

A Closer Look

Let’s start by loading the data.

library(ggplot2)
library(ggridges)  # install.packages("ggridges")
library(hexbin)    # install.packages("hexbin")
library(corrplot)  # install.packages("corrplot")

data(iris)

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats
from sklearn.datasets import load_iris

iris_bunch = load_iris(as_frame=True)
iris = iris_bunch.frame.copy()
iris["species"] = iris["target"].map(dict(enumerate(iris_bunch.target_names)))
iris = iris.rename(columns={
    "sepal length (cm)": "sepal_length",
    "sepal width (cm)": "sepal_width",
    "petal length (cm)": "petal_length",
    "petal width (cm)": "petal_width",
})

Q-Q Plot

A Q-Q plot compares sample quantiles to theoretical quantiles from a reference distribution. In practice that reference is usually the normal distribution, which makes the plot a fast diagnostic for residual checks and distributional shape. If the points line up, the sample is broadly consistent with the reference. If they bend away from the line, that tells you where the mismatch lives: skewness shows up as asymmetric curvature, while heavy tails pull the extremes away from the line. One can also use Q-Q plots to compare two empirical distributions, but I’d argue there are better ways to do that.

What I like about Q-Q plots is that they force you to think about where a distribution departs from a model, not just whether a normality test rejects. The downside is that they are easy to overread in small samples and less useful if you do not have a meaningful reference distribution in mind. Unlike traditional statistical tests, Q-Q plots do not spit out a -value, so you have to interpret the plot yourself.

ggplot(iris, aes(sample = Sepal.Length)) +
  stat_qq(color = "#66c2a5", size = 2) +
  stat_qq_line(color = "black", linewidth = 0.8) +
  theme_minimal() +
  labs(
    title = "Q-Q Plot of Sepal Length",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

stats.probplot(iris["sepal_length"], dist="norm", plot=plt)
plt.title("Q-Q Plot of Sepal Length")
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Sample Quantiles")
plt.show()

Violin Plot

A violin plot combines a boxplot with a smoothed (symmetric) density estimate. That makes it useful when a plain boxplot feels too compressed. Two groups can have similar medians and quartiles but very different shapes, and a violin plot makes that visible immediately. In the iris data, it is a quick way to see that species differ not only in central tendency but in how concentrated or dispersed their sepal lengths are.

The main drawback is that the density is smoothed, so small samples can look more structured than they really are. It can also be sensitive to the smoothing parameters (bandwidth more than kernel type). Violins also become noisy if you cram in too many categories. Still, when I want a compact distribution comparison across a handful of groups, violin plots are often a strict upgrade over boxplots.

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.shape = NA) +
  theme_minimal() +
  labs(title = "Violin Plot of Sepal Length by Species")

sns.violinplot(data=iris, x="species", y="sepal_length", inner="box")
plt.title("Violin Plot of Sepal Length by Species")
plt.xlabel("")
plt.ylabel("Sepal Length")
plt.show()

ECDF Plot

The empirical cumulative distribution function shows the share of observations less than or equal to a given value. That sounds modest, but it is one of the cleanest ways to compare distributions because it avoids arbitrary bin choices and displays the full sample directly. When one ECDF sits to the right of another, you can read that as a first-order stochastic dominance story, at least visually.

The ECDF is defined as

Do you remember that the PDF is the derivative of the CDF? Yes, CDF is really central to probability theory and understanding any variable at hand. In microeconomic theory classes, ECDFs are used to establish stochastic dominance relationships. I like ECDFs because they are honest. They show every observation’s contribution to the distribution without smoothing it away. The tradeoff is that they are less familiar to many audiences and can look busy when too many groups are overlaid. For side-by-side distribution comparison, though, they are hard to beat.

ggplot(iris, aes(Sepal.Length, color = Species)) +
  stat_ecdf(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "ECDF of Sepal Length by Species",
    x = "Sepal Length",
    y = "Empirical CDF"
  )

sns.ecdfplot(data=iris, x="sepal_length", hue="species")
plt.title("ECDF of Sepal Length by Species")
plt.xlabel("Sepal Length")
plt.ylabel("Empirical CDF")
plt.show()

Ridgeline Plot

Ridgeline plots stack several density curves vertically, which makes them especially useful when you want to compare many related distributions at once. The variables, however, need to be on more-or-less the same scale for the plot to make sense. They are common in cohort analysis and time-based comparisons, but they also work well for grouped exploratory analysis like the species differences in iris.

Their advantage is compactness: you can compare several distributions without the visual clutter of heavy overlap. Their weakness is that they are still density plots, so the same caution about smoothing applies. I use ridgelines when I want a plot that is more expressive than small multiples but less chaotic than overlaying five or six densities in one panel.

ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
  geom_density_ridges(alpha = 0.7, color = "white") +
  theme_ridges() +
  labs(
    title = "Ridgeline Plot of Sepal Length by Species",
    x = "Sepal Length",
    y = NULL
  )

species_order = ["setosa", "versicolor", "virginica"]
x_grid = np.linspace(iris["sepal_length"].min() - 0.3,
                     iris["sepal_length"].max() + 0.3, 300)
offsets = [0.0, 1.0, 2.0]

fig, ax = plt.subplots(figsize=(7, 5))
for offset, species in zip(offsets, species_order):
    subset = iris.loc[iris["species"] == species, "sepal_length"]
    kde = stats.gaussian_kde(subset)
    density = kde(x_grid)
    density = density / density.max() * 0.8
    ax.fill_between(x_grid, offset, offset + density, alpha=0.7)
    ax.plot(x_grid, offset + density, color="black", linewidth=0.8)
    ax.text(x_grid.min() - 0.02, offset + 0.12, species, ha="right")

ax.set_title("Ridgeline Plot of Sepal Length by Species")
ax.set_xlabel("Sepal Length")
ax.set_yticks([])
plt.show()

Hexbin Plot

Scatterplots are great until they are not. Have you tried a scatterplot with a million points? It’s slow and it’s hard to see anything. Once the sample gets large enough, overplotting hides the very structure you want to see. Hexbin plots solve that by aggregating points into small hexagonal cells and coloring those cells by count. You give up the exact point cloud, but in return you get a much clearer view of where the data are concentrated.

The iris data are too small to truly need a hexbin, which is worth saying out loud. But the plot still illustrates the logic well. On genuinely large datasets, this is often the right substitute for a scatterplot. The cost is that rare points and local outliers become less visible, so it is better for density structure than for point-level inspection.

ggplot(iris, aes(Sepal.Length, Petal.Length)) +
  geom_hex() +
  theme_minimal() +
  labs(
    title = "Hexbin Plot of Sepal vs Petal Length",
    x = "Sepal Length",
    y = "Petal Length"
  )

plt.hexbin(
    iris["sepal_length"],
    iris["petal_length"],
    gridsize=14,
    cmap="YlOrRd"
)
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.title("Hexbin Plot of Sepal vs Petal Length")
plt.colorbar(label="Count")
plt.show()

Corrgram

A corrgram turns a correlation matrix into something you can actually read. Before fitting a regression, building a clustering pipeline, or running PCA, I almost always want to know which variables are moving together and which are largely independent. A corrgram gives that answer in a single glance.

The upside is speed: strong blocks, redundancies, and likely multicollinearity jump out immediately. The downside is that (Pearson) correlation is a blunt summary. It only captures linear association, ignores conditional relationships, and can be badly distorted by outliers. Presumably, one can move away from Pearson correlation and do the same plot with other correlation measures. Corrgrams also don’t work well with too many variables. So I treat corrgrams as a screening device, not as evidence of mechanism. Used that way, they are extremely effective.

corr_matrix <- cor(iris[, 1:4])

corrplot(
  corr_matrix,
  method = "color",
  type = "upper",
  tl.col = "black",
  tl.srt = 45
)

corr = iris.drop(columns=["species", "target"]).corr()

sns.heatmap(corr, annot=True, cmap="RdBu_r", center=0)
plt.title("Correlation Matrix (Corrgram)")
plt.show()

In the iris data, the corrgram immediately tells you that petal length and petal width are carrying very similar information. That is exactly the kind of thing you want to know before moving on to feature engineering, PCA, or a predictive model.

Bottom Line

Q-Q plots are among the fastest ways to diagnose whether a distributional assumption is wrong and where it fails.
Violin plots and ECDFs are often better than boxplots and histograms when the goal is comparing full distributions across groups.
Ridgeline plots are excellent for compact multi-group distribution comparisons, especially when overlaid densities start to look messy.
Hexbin plots are the right replacement for scatterplots once overplotting becomes a real problem.
Corrgrams are simple but high-value screening tools before modeling, especially when redundancy and multicollinearity are on the table.

Where to Learn More

Wilke’s Fundamentals of Data Visualization is what I have in my bookshelf, but I admit I don’t reach for it very often. Novice data scientists will surely benefit from it, though.

References

Wilke, C. O. (2019). Fundamentals of Data Visualization. O’Reilly Media.

The Many Flavors of Principal Component Analysis

Sun, 05 Apr 2026 07:00:00 GMT

7 min read

Background

Principal component analysis (PCA) is one of those methods that everyone learns early and then quietly keeps using for years. The appeal is obvious: take a high-dimensional data matrix, rotate it into orthogonal directions of maximum variance, and keep only the first few directions. That gives you compression, visualization, denoising, and sometimes a useful preprocessing step for downstream models.

The common misconception is that PCA is a generic tool for finding the “most important” variables or the “true latent factors” in the data. It is neither. Classical PCA finds directions of high variance. That is often useful, but it is not the same thing as finding predictive features, interpretable components, or nonlinear structure. Once you keep that distinction straight, the many PCA variants make much more sense: each flavor modifies classical PCA to target a different practical goal.

In this post I will use the standard PCA formulation as the baseline and then focus on four variants that I think matter most in applied work. The goal is to get a broad sense of some of the most popular ways PCA has evolved over the years.

Notation

Let be a data matrix with rows as observations and columns as variables. Assume the columns have been centered so that

When variables are on very different scales, it is often better to standardize them as well and work with the correlation matrix rather than the covariance matrix. I will write the empirical covariance matrix as

The first principal component loading vector solves

Subsequent components solve the same problem subject to orthogonality constraints. If , the corresponding score matrix is

Equivalently, if is the singular value decomposition, the columns of are the loading vectors and the diagonal entries of are the explained variances.

A Closer Look

Classical PCA

Classical PCA is the benchmark because its optimization problem is clean and its geometry is transparent. The first component is the unit vector that captures the most sample variance; the second is the best such vector orthogonal to the first; and so on. If the singular values of are , then the proportion of variance explained by the first components is

In practice, two issues matter more than the derivation. First, PCA is extremely sensitive to scaling. If one variable is measured in dollars and another in percentages, the dollar variable may dominate the first component unless the data are standardized. Second, variance is not the same thing as signal. A noisy feature with large variance can easily drive the first component. I treat classical PCA as a compression tool, not as an automatic discovery engine.

library(stats)

set.seed(1988)
n <- 300
p <- 8

# Simulate a low-rank signal with two latent factors
latent_factors <- matrix(rnorm(n * 2), n, 2)
loadings_true <- matrix(c(
  0.9,  0.0,
  0.8,  0.1,
  0.7, -0.1,
  0.6,  0.2,
  0.0,  0.8,
  0.1,  0.7,
 -0.1,  0.6,
  0.2,  0.5
), nrow = p, byrow = TRUE)

X <- latent_factors %*% t(loadings_true) + matrix(rnorm(n * p, sd = 0.3), n, p)
colnames(X) <- paste0("feature_", 1:p)

# Standardize before PCA because variables may be on different scales
pca_fit <- prcomp(X, center = TRUE, scale. = TRUE)

# Explained variance ratio
explained_var <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
round(explained_var[1:4], 3)

# First two loading vectors
round(pca_fit$rotation[, 1:2], 3)

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

np.random.seed(1988)
n = 300
p = 8

latent_factors = np.random.randn(n, 2)
loadings_true = np.array([
    [0.9,  0.0],
    [0.8,  0.1],
    [0.7, -0.1],
    [0.6,  0.2],
    [0.0,  0.8],
    [0.1,  0.7],
    [-0.1, 0.6],
    [0.2,  0.5],
])

X = latent_factors @ loadings_true.T + np.random.randn(n, p) * 0.3

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

pca = PCA(n_components=4)
pca.fit(X_std)

print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 3))
print("First two loading vectors:\n", np.round(pca.components_[:2].T, 3))

Sparse PCA

Sparse PCA modifies the loading vectors so that many coordinates are exactly zero. A convenient way to write the idea is

or, equivalently, with an penalty on the loadings. The point is not to improve the mathematics of PCA. The point is to make the components readable.

This matters when is large and the classical loading vectors spread small weight across almost every variable. In genomics, marketing, or text applications, that is often useless from a substantive perspective. Sparse PCA forces each component to be built from a smaller set of variables. The tradeoff is that you lose some variance explained, orthogonality becomes less clean, and the components can be more sensitive to tuning choices. In practice, I reach for Sparse PCA when interpretation matters at least as much as compression.

# install.packages("elasticnet")
library(elasticnet)

set.seed(1988)
n <- 200
p <- 12

latent_factor <- rnorm(n)
X <- cbind(
  latent_factor + rnorm(n, sd = 0.2),
  0.9 * latent_factor + rnorm(n, sd = 0.2),
  0.8 * latent_factor + rnorm(n, sd = 0.2),
  0.7 * latent_factor + rnorm(n, sd = 0.2),
  matrix(rnorm(n * (p - 4)), n, p - 4)
)

X <- scale(X)

# Ask for two sparse components with at most 4 nonzero loadings each
spca_fit <- spca(X, K = 2, type = "predictor", sparse = "varnum", para = c(4, 4))

round(spca_fit$loadings[, 1:2], 3)

import numpy as np
from sklearn.decomposition import SparsePCA
from sklearn.preprocessing import StandardScaler

np.random.seed(1988)
n = 200
p = 12

latent_factor = np.random.randn(n)
X = np.column_stack([
    latent_factor + np.random.randn(n) * 0.2,
    0.9 * latent_factor + np.random.randn(n) * 0.2,
    0.8 * latent_factor + np.random.randn(n) * 0.2,
    0.7 * latent_factor + np.random.randn(n) * 0.2,
    np.random.randn(n, p - 4)
])

X_std = StandardScaler().fit_transform(X)

spca = SparsePCA(n_components=2, alpha=1.0, random_state=1988)
spca.fit(X_std)

print("Sparse loadings:\n", np.round(spca.components_.T, 3))

Kernel PCA

Kernel PCA keeps the variance-maximization logic but applies it in a nonlinear feature space. Instead of diagonalizing the covariance matrix of , we diagonalize a centered kernel matrix

where might be a radial basis function kernel or a polynomial kernel. PCA is then performed on the centered version of rather than on the original variables.

This is useful when the data lie on a curved manifold rather than near a linear subspace. The classic example is concentric circles: ordinary PCA sees almost no useful low-dimensional linear structure, while Kernel PCA can often unfold the geometry. The price is interpretability. Classical PCA gives loading vectors in the original variables; Kernel PCA gives components in an implicit feature space. In practice, that makes it more of a nonlinear embedding method than a variable-summary tool. It is also sensitive to kernel choice and scale, so I do not treat it as a push-button replacement for standard PCA.

# install.packages("kernlab")
library(kernlab)

set.seed(1988)
n <- 300
angles <- runif(n, 0, 2 * pi)
radius <- rep(c(1, 2), each = n / 2) + rnorm(n, sd = 0.05)

X_circle <- cbind(
  radius * cos(angles),
  radius * sin(angles)
)

kpca_fit <- kpca(
  x = X_circle,
  kernel = "rbfdot",
  kpar = list(sigma = 5),
  features = 2
)

head(rotated(kpca_fit))

import numpy as np
from sklearn.decomposition import KernelPCA

np.random.seed(1988)
n = 300
angles = np.random.uniform(0, 2 * np.pi, size=n)
radius = np.repeat([1.0, 2.0], repeats=n // 2) + np.random.randn(n) * 0.05

X_circle = np.column_stack([
    radius * np.cos(angles),
    radius * np.sin(angles),
])

kpca = KernelPCA(n_components=2, kernel="rbf", gamma=5)
X_embedded = kpca.fit_transform(X_circle)

print(np.round(X_embedded[:5], 3))

Probabilistic PCA

Probabilistic PCA (PPCA) replaces the deterministic projection view with a latent variable model:

Here is a -dimensional latent factor and is isotropic Gaussian noise. Under maximum likelihood, the estimated subspace coincides with classical PCA in a particular limit, but the formulation buys you something important: a likelihood, uncertainty quantification, and a principled way to deal with missing values.

That makes PPCA attractive when PCA is part of a generative modeling workflow rather than just a preprocessing step. I especially like it when the data matrix has moderate missingness and I do not want to impute first and hope for the best. The main caveat is the isotropic-noise assumption. If feature-specific noise levels differ substantially, PPCA can be too restrictive and factor analysis may be the better model.

# install.packages("pcaMethods")
library(pcaMethods)

set.seed(1988)
n <- 150
p <- 6

latent_factors <- matrix(rnorm(n * 2), n, 2)
loadings_true <- matrix(rnorm(p * 2), p, 2)
X <- latent_factors %*% t(loadings_true) + matrix(rnorm(n * p, sd = 0.2), n, p)

# Introduce missing values
missing_index <- sample(length(X), size = 0.1 * length(X))
X[missing_index] <- NA

ppca_fit <- pca(X, method = "ppca", nPcs = 2, seed = 1988)

# Completed data and estimated scores
X_completed <- completeObs(ppca_fit)
scores(ppca_fit)

# pip install ppca-py
import numpy as np
from ppca import PPCA

np.random.seed(1988)
n = 150
p = 6

latent_factors = np.random.randn(n, 2)
loadings_true = np.random.randn(p, 2)
X = latent_factors @ loadings_true.T + np.random.randn(n, p) * 0.2

# Introduce missing values
missing_mask = np.random.rand(n, p) < 0.10
X[missing_mask] = np.nan

ppca = PPCA(n_components=2)
ppca.fit(X)

scores, score_cov = ppca.posterior_latent(X)
X_imputed = ppca.sample_missing(X, n_draws=1)[0]

print("Estimated noise variance:", round(ppca.noise_variance_, 4))
print("First five latent scores:\n", np.round(scores[:5], 3))

Truncated PCA

This last flavor is a little different. Truncated PCA does not change the statistical target. It changes the computation. Instead of computing the full singular value decomposition, we directly approximate the top singular vectors:

When and are large, or when is sparse, that distinction matters a lot. If all you want are the first few components, computing the full decomposition is wasted effort.

For practitioners, this is often the most useful PCA variant of all because it makes the classical method scale. The catch is conceptual rather than mathematical: randomized or truncated PCA is not discovering a different notion of component. It is approximating the same principal subspace more cheaply. If the approximation error is small, great. If not, you have a computational shortcut, not a new estimator.

# install.packages("irlba")
library(irlba)

set.seed(1988)
n <- 1000
p <- 200

latent_factors <- matrix(rnorm(n * 5), n, 5)
loadings_true <- matrix(rnorm(p * 5), p, 5)
X_large <- latent_factors %*% t(loadings_true) + matrix(rnorm(n * p, sd = 0.5), n, p)

# Fast approximation to the first 5 principal components
pca_fast <- prcomp_irlba(X_large, n = 5, center = TRUE, scale. = TRUE)

pca_fast$sdev^2 / sum(pca_fast$sdev^2)

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

np.random.seed(1988)
n = 1000
p = 200

latent_factors = np.random.randn(n, 5)
loadings_true = np.random.randn(p, 5)
X_large = latent_factors @ loadings_true.T + np.random.randn(n, p) * 0.5

X_large = StandardScaler().fit_transform(X_large)

# Randomized SVD computes an approximate leading subspace
pca_fast = PCA(n_components=5, svd_solver="randomized", random_state=1988)
pca_fast.fit(X_large)

print(np.round(pca_fast.explained_variance_ratio_, 3))

Bottom Line

Classical PCA is a variance-maximizing compression tool, not a generic device for finding the “most important” variables or latent causes.
Sparse PCA is the right upgrade when interpretability matters and dense loading vectors are getting in the way.
Kernel PCA is useful for nonlinear geometry, but you give up the clean loading-vector interpretation that makes ordinary PCA attractive.
Probabilistic PCA is worth using when likelihood, uncertainty, or missing data matter; otherwise classical PCA is usually simpler.
Truncated PCA is often the most practical choice on large matrices because it targets the same principal subspace at a much lower computational cost.

Where to Learn More

For the classical theory, Jolliffe’s Principal Component Analysis is still the standard reference and Jolliffe and Cadima (2016) is a concise modern review. Zou, Hastie, and Tibshirani (2006) is the canonical sparse PCA paper. Schölkopf, Smola, and Müller (1998) remains the core reference for Kernel PCA, while Tipping and Bishop (1999) is the paper to read for the probabilistic view. If your main concern is computation at scale, Halko, Martinsson, and Tropp (2011) is the right randomized linear algebra entry point.

References

Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288.

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.

Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319.

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611-622.

Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265-286.

Brief Overview of Treatment Effect Bounds

Thu, 02 Apr 2026 07:00:00 GMT

7 min read

Background

In applied causal work, the real problem is often not estimation but identification. Attrition, imperfect take-up, endogenous selection, and missing outcomes can all make the average treatment effect impossible to point-identify from the data at hand. In those settings, a precise estimate is not a sign of rigor. It is usually a sign that strong assumptions have been smuggled in.

Bounding methods take a more honest route. Rather than asking for the exact value of a treatment effect, they ask which values remain consistent with the observed data and a stated set of assumptions. The answer is an interval, not a point. That interval may be wide, but its width is itself informative: it tells you how much the design really buys you before additional structure is imposed.

This is why I think treatment effect bounds are worth knowing even for practitioners who usually work with point estimators. They are useful both as primary estimands and as a diagnostic. If weak-assumption bounds are already tight, your design is doing real work. If they are wide, that is a warning against overconfident causal claims.

Notation

For each unit , let and denote the potential outcomes under treatment and control, and let be the treatment indicator. When needed, I use for an ordered instrument or covariate. The observed outcome is

The target parameter is the average treatment effect

When is not point-identified, the object of interest becomes an identified set

where the endpoints depend on the observed distribution and the maintained assumptions. A bound is sharp if every value in that interval is attainable under some data-generating process consistent with those assumptions. Sharp is always good!

A Closer Look

Manski Bounds

Manski (1990) is the natural starting point because it assumes almost nothing beyond bounded outcomes. Suppose , let , and define

Then the missing counterfactual means satisfy

and

Combining them gives sharp bounds on the ATE:

where

and

These bounds are usually wide, and that is exactly the point. Manski bounds tell you what the data alone can support before you add structure. In practice, I treat them as the baseline honesty check.

Tightening Manski: MTR, MTS, and MIV

The usual next step is to ask whether credible qualitative restrictions can narrow the interval. Manski and Pepper (2000) study three of the most useful ones. My first job market paper as a PhD candidate employed these restrictions to tighten the Manski bounds in the context of the labor market impact of immigration.

First, under Monotone Treatment Response (MTR), treatment weakly helps everyone:

MTR tightens the bounds by ruling out any configuration in which treatment hurts some units, so the lower bound rises and negative treatment effects become harder or impossible to sustain. For example, under MTR, cannot be below (each control’s missing is at least that unit’s observed ), not merely ; and cannot exceed .

Second, under Monotone Treatment Selection (MTS), treated units are systematically stronger than untreated units in terms of their potential outcomes. MTS tightens the bounds by imposing an ordering on who selects into treatment, so the observed outcomes in one group become informative about the missing potential outcomes in the other. For example, under MTS, is bounded below by , not merely .

Third, under a Monotone Instrumental Variable (MIV) assumption, an ordered variable shifts potential outcomes in a known direction:

In words, MIV lets us use the ordering in to intersect bounds across instrument values, which can noticeably shrink the identified set. These assumptions get more powerful as the data scientist combines them together. In some cases, the resulting interval can be informative.

Balke-Pearl Bounds for Noncompliance

Balke and Pearl (1997) address randomized assignment with imperfect compliance. Instead of jumping directly to LATE under exclusion and monotonicity, they ask a broader question: what does the observed joint distribution of imply about the population treatment effect under weaker assumptions?

The answer is a sharp nonparametric bound obtained by optimizing over all latent compliance-response types consistent with the observed data:

This is best viewed as a separation between what the experiment identifies and what extra assumptions identify. Balke-Pearl bounds are often much wider than a LATE estimate, but they answer a different question. LATE is a point-identified effect for compliers under stronger structure. Balke-Pearl bounds are partial-identification statements about broader causal quantities. When the policy question is about the full eligible population rather than compliers, that distinction matters.

Lee Bounds for Sample Selection

Lee (2009) is the method I see most often in practice because the intuition is so transparent. Suppose treatment is randomized, but outcomes are only observed for selected units. Wages observed only for employed workers is the canonical example. If treatment changes employment, comparing observed wages across treatment arms is contaminated by selection.

Lee’s key assumption is monotone selection: treatment can move selection in only one direction for every unit. If treatment raises the probability of observation, then the treated group contains some “extra” observed units relative to control. Those units must be trimmed away from one tail or the other of the treated outcome distribution.

Let indicate whether the outcome is observed and suppose . The excess selected share in the treated group is

Trimming a fraction from the upper tail gives one bound; trimming it from the lower tail gives the other.

Algorithm:

Compute the selection rate in each treatment arm.
Identify the arm with the higher selection rate.
Trim the excess share from one tail and then the other of that arm’s observed outcome distribution.
Compare the trimmed means to the mean outcome in the arm with the lower selection rate.

I like Lee bounds because they are easy to explain and easy to audit. The practical warning is equally simple: if treatment plausibly pushes some units into the sample and others out, the monotone-selection logic breaks. ## Bottom Line

Bounds are not a consolation prize. They are the right estimand when the data do not support point identification.
Manski bounds are the benchmark because they show what your design identifies before assumptions start doing the heavy lifting.
Monotonicity restrictions, Lee trimming, and Balke-Pearl bounds can be very informative, but only when their substantive assumptions are defensible.
Wide bounds are often the most important empirical result in the paper because they reveal how little the design alone can rule out.

Where to Learn More

For a broad introduction, I would start with Manski’s Partial Identification of Probability Distributions, which remains the cleanest entry point into the logic of identification regions. Manski and Pepper (2000) is the canonical reference for monotone restrictions such as MTR and MIV. Balke and Pearl (1997) is still the core paper for noncompliance bounds, while Lee (2009) is the practical workhorse for attrition and sample selection.

References

Balke, A., & Pearl, J. (1997). Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439), 1171-1176.

Kowalski, A. E. (2016). Doing more when you’re running LATE: Applying marginal treatment effect methods to examine treatment effect heterogeneity in experiments. American Economic Journal: Applied Economics, 8(2), 1-17.

Lee, D. S. (2009). Training, wages, and sample selection: Estimating sharp bounds on treatment effects. Review of Economic Studies, 76(3), 1071-1102.

Manski, C. F. (1990). Nonparametric bounds on treatment effects. American Economic Review, 80(2), 319-323.

Manski, C. F. (2003). Partial Identification of Probability Distributions. Springer.

Manski, C. F., & Pepper, J. V. (2000). Monotone instrumental variables: With an application to the returns to schooling. Econometrica, 68(4), 997-1010.

What OLS Estimates in Causal Inference

Wed, 01 Apr 2026 07:00:00 GMT

7 min read

Background

OLS is still the default causal estimator in a surprising amount of applied work. That is often understandable. Regression is simple, transparent, and often a reasonable first pass. The problem is interpretation. Once we move beyond randomized experiments with additive constant effects, the coefficient on treatment is not automatically the average treatment effect (ATE), or even an average treatment effect for a population we care about.

What makes this topic tricky is that there are really two separate questions. First, what population quantity does the OLS coefficient target? Second, under what assumptions can that quantity be interpreted causally? OLS itself does not assume a potential outcomes framework. It solves a least-squares projection problem. Potential outcomes enter only when we try to map that projection coefficient to objects like the ATE, ATT, or ATU.

Several somewhat related papers sharpen this distinction. This note provides a brief overview of some of the key developments in our understanding of OLS in causal inference. Taken together, these results explain both why OLS can be useful and why its causal interpretation is often more delicate than practitioners realize.

Notation

Let be the observed outcome, a treatment indicator, and a vector of covariates. Potential outcomes are and , so

Define the conditional mean functions

and the usual causal targets

Now consider the linear regression

The coefficient is the population linear projection coefficient on . By Frisch-Waugh-Lovell,

where is the best linear predictor of using . This expression is purely statistical.

The causal question is whether coincides with a treatment effect parameter under additional assumptions.

A Closer Look

Regression Is a Projection, Not a Causal Model

This is the first point I would emphasize in practice. Writing down

does not, by itself, assume homogeneous treatment effects or even invoke potential outcomes. It simply defines the best linear predictor of given and . If the goal is prediction, that is the end of the story.

For causal interpretation, however, we need more. Under random assignment or selection on observables, plus enough structure on how outcomes vary with , the projection coefficient may line up with a causal estimand. Under constant treatment effects and correct linear adjustment, that estimand is often the ATE. Once treatment effects vary with , the coefficient generally becomes a weighted average of heterogeneous effects rather than the plain sample average.

Aronow and Samii: Asymptotic View

Aronow and Samii (2016) show that regression-adjusted estimators need not be representative of the sample as a whole. In large samples, the estimand targeted by regression can be written as a weighted average of conditional treatment effects, where the weights depend on how treatment assignment varies with covariates and on the linear adjustment built into the regression.

The key practical point is that OLS does not weight covariate strata equally. These weights are proportional to residualized treatment variation (via FWL), not to the precision of outcome estimates. In particular, they do not correspond to inverse-variance weights in general. So even under ignorability, the regression coefficient need not correspond to the ATE for the empirical covariate distribution. It is often better understood as an ATE for an implicit reweighted population. That is a subtle point, but it matters whenever overlap is uneven or the linear model fits some regions of the covariate space much better than others.

Chattopadhyay and Zubizarreta: Finite-Sample View

One limitation of the Aronow-Samii perspective is that it is asymptotic. Chattopadhyay and Zubizarreta (2023) go further by showing that common linear regression estimators admit exact finite-sample weighting representations. For a regression-adjusted ATE estimator,

where the weights are functions of only and , not the realized outcomes.

This is useful for two reasons. First, it makes regression adjustment look less mysterious: OLS is implicitly constructing a weighted comparison between treated and control outcomes. Second, the implied weights can be inspected directly. In their framework, the weights clarify when regression adjustment achieves exact balance on included covariates, how dispersed the weights are, and whether the regression is targeting a population that still looks like the observed sample. That is a much more practical diagnostic than simply reporting a coefficient table.

Słoczyński: Heterogeneous Effects View

Słoczyński (2022) asks what the OLS coefficient means when treatment effects are heterogeneous. His central result is that the coefficient on treatment is generally not the ATE. Instead, it is a convex combination of two group-specific effect parameters that, under additional conditions, can be interpreted as the ATT and the ATU. The striking part is the weighting: the smaller treatment arm gets the larger implicit weight.

So if treated units are rare, OLS tends to lean toward effects for treated units. If treated units are common, it leans toward effects for untreated units. The exact formula depends on the specification and on how treatment assignment varies with covariates, but the qualitative message is robust: heterogeneity changes the target, and OLS can overweight the effect for the smaller group.

This is one of those results that sounds surprising at first and obvious in hindsight. Regression learns treatment effects from residual variation in treatment status. When one group is small, comparisons involving that group carry disproportionate identifying content. The practical implication is straightforward: if you care specifically about the ATE or ATT, you should not assume OLS is giving it to you just because the regression includes controls.

Angrist and Pischke: Saturated Model View

The cleanest interpretation of regression comes from saturated models with discrete covariates, an approach emphasized by Angrist and coauthors. If takes only a small number of values and the regression fully saturates those cells, then OLS is just averaging within-cell treatment-control differences. In that case, regression is a dressed-up version of exact matching.

That perspective is helpful because it shows where the causal content comes from. The coefficient is credible when comparisons are being made within genuinely comparable covariate cells. But it also shows the limitation immediately: with continuous or high-dimensional covariates, literal saturation is impossible and the argument breaks down. At that point, OLS is no longer exact within-cell adjustment. It is a parametric approximation that extrapolates across covariate values. That is often reasonable, but it is no longer harmless.

Bottom Line

OLS does not inherently estimate a causal effect. It estimates a linear projection coefficient that becomes causal only under additional assumptions.
Aronow and Samii show that regression adjustment targets a weighted causal estimand in large samples rather than automatically targeting the sample ATE.
Chattopadhyay and Zubizarreta make this weighting interpretation exact in finite samples and turn it into a useful diagnostic tool.
With heterogeneous treatment effects, Słoczyński shows that OLS becomes a weighted average of group-specific effects, often interpretable as ATT- and ATU-type objects, and the smaller treatment arm gets more weight.
Saturated regressions with discrete covariates are the clean benchmark. With continuous , standard OLS necessarily relies on approximation and implicit weighting.

Where to Learn More

Aronow and Samii (2016) is the right place to start if you want the representativeness argument behind regression adjustment. Chattopadhyay and Zubizarreta (2023) is the most useful paper for understanding exact implied weights in finite samples. Słoczyński (2022) is now the canonical reference on how heterogeneous treatment effects distort the interpretation of the OLS coefficient. For the saturated-regression perspective, I would still point readers to Angrist and Pischke (2009), which makes clear why exact matching logic breaks down once covariates become continuous.

References

Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.

Aronow, P. M., & Samii, C. (2016). Does regression produce representative estimates of causal effects? American Journal of Political Science, 60(1), 250-267.

Chattopadhyay, A., & Zubizarreta, J. R. (2023). On the implied weights of linear regression for causal inference. Biometrika, 110(3), 615-629.

Słoczyński, T. (2022). Interpreting OLS estimands when treatment effects are heterogeneous: Smaller groups get larger weights. Review of Economics and Statistics, 104(3), 501-509.

The Many Flavors of Lasso

Sat, 14 Mar 2026 07:00:00 GMT

9 min read

Background

The Lasso (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani in 1996, has become one of the go-to tools for variable selection and shrinkage in regression problems. But the classic Lasso is just the starting point. Over the years, researchers have developed many variants of Lasso, each designed to address specific limitations or tailor the method to different kinds of data structures.

This article provides a tour of the most popular flavors of Lasso — from standard -penalized regression to modern adaptations like Adaptive Lasso, Elastic Net, Square-root Lasso, and more. For each version, I’ll lay out the objective function, describe when it’s applicable, and summarize its key characteristics.

Notation

Before diving into the variants, let’s revisit what makes Lasso special. In a standard linear regression setup, we model

where:

is the outcome,
is our design matrix,
are the coefficients, and
is the error term.

Traditional ordinary least squares (OLS) minimizes the sum of squared residuals without any constraint on the coefficients.

A Closer Look

Standard Lasso

The standard Lasso solves the following optimization problem:

The appeal of Lasso is straightforward: it trades a convex penalty for exact zeros in the solution. In moderately high dimensions, this often works surprisingly well as a first pass.

The main issue shows up when predictors are correlated. Lasso will typically pick one variable from a correlated group and ignore the rest, and which one it picks can be unstable across folds or small perturbations of the data. At the same time, all coefficients are shrunk, including the large ones, which introduces bias that doesn’t go away even with large samples.

In practice, I treat standard Lasso as a baseline rather than a final model. If it’s stable and predictive, great. If not, it’s usually pointing to a structural issue in the design.

library(glmnet)

# Simulate data
set.seed(1988)
n <- 100
p <- 20
X <- matrix(rnorm(n * p), n, p)
beta_true <- c(3, -2, 1.5, rep(0, p - 3))  # Only 3 non-zero coefficients
y <- X %*% beta_true + rnorm(n)

# Fit standard Lasso
lasso_fit <- glmnet(X, y, alpha = 1)  # alpha = 1 for Lasso

# Cross-validation to select lambda
cv_fit <- cv.glmnet(X, y, alpha = 1)
lambda_opt <- cv_fit$lambda.min

# Get coefficients at optimal lambda
coef(cv_fit, s = "lambda.min")

from sklearn.linear_model import Lasso, LassoCV
from sklearn.datasets import make_regression
import numpy as np

# Simulate data
np.random.seed(1988)
X, y, coef_true = make_regression(n_samples=100, n_features=20, 
                                   n_informative=3, coef=True, 
                                   noise=1.0, random_state=123)

# Fit Lasso with cross-validation
lasso = LassoCV(cv=5, random_state=123)
lasso.fit(X, y)

# Display results
print(f"Optimal lambda: {lasso.alpha_:.4f}")
print(f"Number of non-zero coefficients: {np.sum(lasso.coef_ != 0)}")
print(f"Selected coefficients:\n{lasso.coef_[lasso.coef_ != 0]}")

Adaptive Lasso

Adaptive Lasso extends the standard Lasso by using data-driven weights for each coefficient: where and comes from an initial estimator like OLS or Ridge.

The idea here is to penalize coefficients unevenly. Variables that look important in a first-stage model get penalized less, while weaker ones get pushed harder toward zero. This reduces the bias on large coefficients and improves variable selection consistency under certain conditions.

In practice, Adaptive Lasso is less about prediction and more about recovering a meaningful support. If you care about which variables are selected—not just the predictive accuracy—it’s often worth the extra step.

# Continue from previous example
library(glmnet)

# Step 1: Get initial estimates using Ridge
ridge_fit <- glmnet(X, y, alpha = 0)  # alpha = 0 for Ridge
cv_ridge <- cv.glmnet(X, y, alpha = 0)
beta_init <- as.vector(coef(cv_ridge, s = "lambda.min"))[-1]  # Remove intercept

# Step 2: Compute adaptive weights
gamma <- 1  # Common choice
weights <- 1 / (abs(beta_init) + 1e-8)^gamma  # Add small constant to avoid division by zero

# Step 3: Fit Adaptive Lasso
adaptive_lasso <- glmnet(X, y, alpha = 1, penalty.factor = weights)
cv_adaptive <- cv.glmnet(X, y, alpha = 1, penalty.factor = weights)

# Compare coefficients
cat("Standard Lasso non-zero:", sum(coef(cv_fit, s = "lambda.min")[-1] != 0), "\n")
cat("Adaptive Lasso non-zero:", sum(coef(cv_adaptive, s = "lambda.min")[-1] != 0), "\n")

from sklearn.linear_model import Ridge, Lasso

# Step 1: Get initial estimates using Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
beta_init = ridge.coef_

# Step 2: Compute adaptive weights
gamma = 1
weights = 1 / (np.abs(beta_init) + 1e-8)**gamma

# Step 3: Fit Adaptive Lasso (manual implementation via weighted penalty)
# Scale features by weights
X_weighted = X / weights

# Fit Lasso on weighted features
adaptive_lasso = Lasso(alpha=0.1)
adaptive_lasso.fit(X_weighted, y)

# Transform back to original scale
adaptive_coef = adaptive_lasso.coef_ / weights

print(f"Standard Lasso non-zero: {np.sum(lasso.coef_ != 0)}")
print(f"Adaptive Lasso non-zero: {np.sum(adaptive_coef != 0)}")

Relaxed Lasso

Relaxed Lasso separates selection from estimation. First, run Lasso to pick variables; then refit on that subset, either with OLS or partial shrinkage via a parameter . At you recover Lasso, and at you get post-selection OLS.

The point is to reduce shrinkage bias. Lasso is good at finding the support but tends to underestimate large coefficients. Relaxing the penalty after selection keeps sparsity while improving estimates.

In practice, this works well when you trust the selected variables but want better coefficient accuracy. The main risk is overfitting if too many variables are selected, so it’s worth tuning both and .

library(glmnet)
library(relaxnet)  # For relaxed Lasso

# Fit relaxed Lasso using glmnet (has built-in support)
relaxed_fit <- glmnet(X, y, alpha = 1, relax = TRUE)
cv_relaxed <- cv.glmnet(X, y, alpha = 1, relax = TRUE)

# Manual two-stage approach
# Stage 1: Standard Lasso selection
lasso_coef <- coef(cv_fit, s = "lambda.min")[-1]
selected <- which(lasso_coef != 0)

# Stage 2: OLS on selected variables
if (length(selected) > 0) {
  X_selected <- X[, selected]
  ols_fit <- lm(y ~ X_selected)
  
  # Compare coefficients
  cat("Lasso coefficients (selected):\n")
  print(lasso_coef[selected])
  cat("\nRelaxed (OLS) coefficients:\n")
  print(coef(ols_fit)[-1])
}

# Manual two-stage relaxed Lasso
from sklearn.linear_model import LinearRegression

# Stage 1: Lasso selection
lasso_coef = lasso.coef_
selected = np.where(lasso_coef != 0)[0]

print(f"Lasso selected {len(selected)} variables")

# Stage 2: OLS on selected variables
if len(selected) > 0:
    X_selected = X[:, selected]
    ols = LinearRegression()
    ols.fit(X_selected, y)
    
    # Compare coefficient magnitudes
    print(f"\nLasso coefficients (mean abs): {np.abs(lasso_coef[selected]).mean():.4f}")
    print(f"Relaxed coefficients (mean abs): {np.abs(ols.coef_).mean():.4f}")
    
    # Often relaxed coefficients are larger in magnitude

Square-root Lasso

Square-root Lasso, also known as Scaled Lasso, modifies the objective function to:

The crucial difference from standard Lasso is using the norm directly (without squaring) in the loss term. This seemingly small change has important consequences: the estimator becomes scale-invariant, meaning you don’t need to estimate or know the error variance to set the penalty parameter appropriately. In standard Lasso, the optimal choice of depends on the unknown noise level, but square-root Lasso eliminates this dependence.

This variant is particularly valuable when you have unknown or heteroskedastic error variance, making it robust to variance misspecification. The scale-invariance also simplifies tuning: you can use theoretically-motivated choices for without prior knowledge of the noise level. In practice, this often translates to more stable selection across different datasets and makes the method especially appealing in settings where variance estimation is challenging or the homoskedasticity assumption is questionable.

library(scalreg)  # For square-root Lasso

# Fit square-root Lasso
sqrt_lasso <- scalreg(X, y)

# Compare with standard Lasso
cat("Standard Lasso selected:", sum(coef(cv_fit, s = "lambda.min")[-1] != 0), "variables\n")
cat("Square-root Lasso selected:", sum(sqrt_lasso$coefficients != 0), "variables\n")

# Square-root Lasso is not in sklearn, but we can implement a simple version
from sklearn.linear_model import LassoLars
from scipy.optimize import minimize

# Manual implementation using CVXPY (if available)
try:
    import cvxpy as cp
    
    # Define variables
    beta = cp.Variable(X.shape[1])
    
    # Define objective: ||y - X*beta||_2 + lambda * ||beta||_1
    lambda_sqrt = 0.1
    objective = cp.Minimize(cp.norm(y - X @ beta, 2) + lambda_sqrt * cp.norm(beta, 1))
    
    # Solve
    prob = cp.Problem(objective)
    prob.solve()
    
    sqrt_lasso_coef = beta.value
    print(f"Square-root Lasso selected {np.sum(np.abs(sqrt_lasso_coef) > 1e-6)} variables")
    
except ImportError:
    print("Square-root Lasso requires cvxpy package")
    print("Install with: pip install cvxpy")

Elastic Net

Elastic Net blends and regularization by minimizing:

This is often reparametrized as where controls the mixing between and penalties.

Elastic Net fixes a key issue with Lasso: when predictors are highly correlated, Lasso tends to pick one arbitrarily and ignore the rest. Adding an penalty induces a grouping effect, so correlated variables enter or leave together, while the term still enforces sparsity.

This makes it a better default in settings with multicollinearity—common in practice. The mixing parameter controls the trade-off: closer to 1 behaves like Lasso, closer to like Ridge. In practice, moderate values (e.g. ) work well, with cross-validation refining the choice.

library(glmnet)

# Create correlated predictors to demonstrate Elastic Net advantage
set.seed(1988)
n <- 100
X_base <- matrix(rnorm(n * 5), n, 5)
# Add correlated predictors
X_corr <- cbind(X_base, X_base[, 1:2] + matrix(rnorm(n * 2, sd = 0.1), n, 2))
beta_true <- c(2, -1.5, 0, 0, 0, 2.2, -1.3)  # True coefficients for correlated pairs
y_corr <- X_corr %*% beta_true + rnorm(n)

# Fit Elastic Net with alpha = 0.5 (equal mix of $\ell_1$ and $\ell_2$)
elastic_fit <- cv.glmnet(X_corr, y_corr, alpha = 0.5)

# Compare with pure Lasso (alpha = 1)
lasso_corr <- cv.glmnet(X_corr, y_corr, alpha = 1)

cat("Elastic Net coefficients:\n")
print(coef(elastic_fit, s = "lambda.min"))
cat("\nLasso coefficients:\n")
print(coef(lasso_corr, s = "lambda.min"))

from sklearn.linear_model import ElasticNet, ElasticNetCV

# Create correlated predictors
np.random.seed(1988)
n = 100
X_base = np.random.randn(n, 5)
X_corr = np.hstack([X_base, X_base[:, :2] + np.random.randn(n, 2) * 0.1])
beta_true = np.array([2, -1.5, 0, 0, 0, 2.2, -1.3])
y_corr = X_corr @ beta_true + np.random.randn(n)

# Fit Elastic Net with l1_ratio = 0.5 (equal mix)
elastic = ElasticNetCV(l1_ratio=0.5, cv=5)
elastic.fit(X_corr, y_corr)

# Compare with Lasso
lasso_corr = LassoCV(cv=5)
lasso_corr.fit(X_corr, y_corr)

print("Elastic Net coefficients:")
print(elastic.coef_)
print(f"\nElastic Net selected {np.sum(elastic.coef_ != 0)} variables")
print(f"Lasso selected {np.sum(lasso_corr.coef_ != 0)} variables")

Group Lasso

Group Lasso extends the penalty to operate on predefined groups of variables: where represents the coefficients belonging to group , and is the norm applied within each group.

The key insight is that the norm within groups combined with summation across groups creates a sparsity-inducing penalty at the group level. Either all coefficients in a group are set to zero, or all are kept (though possibly shrunk). This “all or nothing” behavior respects the natural grouping structure in your data.

Group Lasso is useful when variables come in meaningful groups. A common example is categorical features encoded as dummies—you usually want to include or exclude the whole variable, not individual levels. Similar structure appears in multi-task settings or grouped scientific measurements.

Instead of sparsity at the coefficient level, Group Lasso selects entire groups while allowing dense coefficients within them. This makes the model align better with how features are constructed.

library(grpreg)

# Create data with natural groups
# Suppose we have 3 categorical variables with 3, 4, and 5 levels
set.seed(1988)
n <- 100
X1 <- model.matrix(~ factor(sample(1:3, n, replace = TRUE)) - 1)
X2 <- model.matrix(~ factor(sample(1:4, n, replace = TRUE)) - 1)
X3 <- model.matrix(~ factor(sample(1:5, n, replace = TRUE)) - 1)
X_grouped <- cbind(X1, X2, X3)

# Define groups (which columns belong to which group)
groups <- c(rep(1, 3), rep(2, 4), rep(3, 5))

# True model: only group 1 and 3 are relevant
beta_true <- c(2, -1, 1.5, rep(0, 4), 1, -0.5, 0.8, 1.2, -1)
y_grouped <- X_grouped %*% beta_true + rnorm(n)

# Fit Group Lasso
group_lasso <- cv.grpreg(X_grouped, y_grouped, group = groups, penalty = "grLasso")

cat("Group Lasso coefficients by group:\n")
coefs <- coef(group_lasso, s = "lambda.min")[-1]
for (g in unique(groups)) {
  cat(sprintf("Group %d: %d non-zero out of %d\n", 
              g, sum(coefs[groups == g] != 0), sum(groups == g)))
}

from sklearn.linear_model import MultiTaskLasso
# Note: True group Lasso requires specialized packages
# We'll demonstrate with a simplified example

# Simulate grouped structure
np.random.seed(1988)
n = 100
# Create 3 groups with 3, 4, 5 features each
X_grouped = np.random.randn(n, 12)
groups = np.array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3])

# True coefficients (group 1 and 3 active, group 2 zero)
beta_true = np.array([2, -1, 1.5, 0, 0, 0, 0, 1, -0.5, 0.8, 1.2, -1])
y_grouped = X_grouped @ beta_true + np.random.randn(n)

# For true Group Lasso, would need package like 'group-lasso' or 'celer'
# Here we show conceptual grouping with manual implementation
print("For Python Group Lasso, install specialized packages:")
print("  pip install group-lasso")
print("  pip install celer")

Fused Lasso

Fused Lasso adds a penalty on differences between adjacent coefficients:

This method introduces two types of penalties: the standard penalty encourages overall sparsity (setting coefficients to zero), while the fusion penalty encourages adjacent coefficients to be equal. The fusion penalty means that nearby coefficients in the ordering are pulled toward each other, creating piecewise-constant patterns in the coefficient profile.

Fused Lasso is useful when features have a natural ordering and coefficients are expected to vary smoothly or in blocks. Instead of treating coefficients independently, it encourages both sparsity and similarity between neighbors, leading to piecewise-constant patterns.

This shows up in time series, spatial data, or ordered genomic features. The two penalties control the trade-off: drives sparsity, while controls how strongly adjacent coefficients are fused.

library(genlasso)

# Simulate data with ordered features (e.g., time series or spatial)
set.seed(1988)
n <- 100
p <- 50

# Create design matrix with ordered features
X_ordered <- matrix(rnorm(n * p), n, p)

# True coefficients with piecewise constant structure
beta_true <- c(rep(0, 10), rep(2, 15), rep(0, 10), rep(-1.5, 10), rep(0, 5))
y_ordered <- X_ordered %*% beta_true + rnorm(n)

# Fit Fused Lasso
fused_fit <- fusedlasso(y_ordered, X_ordered)

# Get coefficients at a specific lambda
lambda_idx <- 50  # Example index
coefs_fused <- coef(fused_fit, lambda = fused_fit$lambda[lambda_idx])$beta

# Visualize coefficient profile
plot(coefs_fused, type = "s", 
     main = "Fused Lasso Coefficient Profile",
     xlab = "Feature Index", ylab = "Coefficient",
     col = "blue", lwd = 2)
lines(beta_true, col = "red", lty = 2, lwd = 2)
legend("topright", c("Estimated", "True"), 
       col = c("blue", "red"), lty = c(1, 2))

# Fused Lasso implementation using sklearn and custom penalty
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt

# Simulate ordered features
np.random.seed(1988)
n, p = 100, 50
X_ordered = np.random.randn(n, p)

# Piecewise constant true coefficients
beta_true = np.concatenate([
    np.zeros(10), np.full(15, 2), np.zeros(10), 
    np.full(10, -1.5), np.zeros(5)
])
y_ordered = X_ordered @ beta_true + np.random.randn(n)

# Standard Lasso (for comparison)
lasso_ordered = Lasso(alpha=0.1)
lasso_ordered.fit(X_ordered, y_ordered)

# For true Fused Lasso, specialized packages needed
# Conceptual visualization
plt.figure(figsize=(10, 5))
plt.plot(beta_true, 'r--', label='True', linewidth=2)
plt.plot(lasso_ordered.coef_, 'b-', label='Standard Lasso', alpha=0.7)
plt.xlabel('Feature Index')
plt.ylabel('Coefficient Value')
plt.title('Coefficient Profile: Fused Lasso Encourages Piecewise Constant Structure')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

print("For true Fused Lasso in Python, consider packages:")
print("  skfda (functional data analysis)")
print("  or implement using cvxpy with fusion penalty")

Graphical Lasso

Graphical Lasso applies penalization to the estimation of precision matrices (inverse covariance matrices): where is the precision matrix, is the sample covariance matrix, and ensures positive definiteness.

Graphical Lasso shifts the focus from regression to covariance structure, estimating a sparse precision matrix. A zero entry means variables and are conditionally independent given the rest, so the model directly encodes a network of relationships.

This is useful when the goal is to recover dependency structure rather than predict an outcome—common in genomics, finance, or neuroscience. The penalty enforces sparsity, leading to interpretable graphs where most connections are absent. In practice, the main challenge is tuning to balance fit and sparsity.

library(glasso)
library(igraph)

# Simulate multivariate data
set.seed(1988)
n <- 100
p <- 10

# Create a sparse precision matrix (true network structure)
Theta_true <- matrix(0, p, p)
diag(Theta_true) <- 1
# Add some conditional dependencies
Theta_true[1, 2] <- Theta_true[2, 1] <- 0.5
Theta_true[2, 3] <- Theta_true[3, 2] <- 0.4
Theta_true[4, 5] <- Theta_true[5, 4] <- 0.6
Theta_true[7, 8] <- Theta_true[8, 7] <- 0.3

# Generate data from this precision matrix
Sigma <- solve(Theta_true)
X_network <- MASS::mvrnorm(n, mu = rep(0, p), Sigma = Sigma)

# Compute sample covariance
S <- cov(X_network)

# Fit Graphical Lasso
glasso_fit <- glasso(S, rho = 0.1)  # rho is the penalty parameter

# Extract estimated precision matrix
Theta_est <- glasso_fit$wi

# Visualize network
# Create adjacency matrix (thresholded)
adj_matrix <- (abs(Theta_est) > 0.01) * 1
diag(adj_matrix) <- 0

# Plot network
graph_obj <- graph_from_adjacency_matrix(adj_matrix, mode = "undirected")
plot(graph_obj, 
     main = "Estimated Conditional Dependence Network",
     vertex.size = 20,
     vertex.label.cex = 0.8)

from sklearn.covariance import GraphicalLassoCV
import networkx as nx
import matplotlib.pyplot as plt

# Simulate multivariate data
np.random.seed(1988)
n, p = 100, 10

# True sparse precision matrix
Theta_true = np.eye(p)
Theta_true[0, 1] = Theta_true[1, 0] = 0.5
Theta_true[1, 2] = Theta_true[2, 1] = 0.4
Theta_true[3, 4] = Theta_true[4, 3] = 0.6
Theta_true[6, 7] = Theta_true[7, 6] = 0.3

# Generate data
Sigma = np.linalg.inv(Theta_true)
X_network = np.random.multivariate_normal(np.zeros(p), Sigma, size=n)

# Fit Graphical Lasso with cross-validation
glasso = GraphicalLassoCV(cv=5)
glasso.fit(X_network)

# Get estimated precision matrix
Theta_est = glasso.precision_

# Visualize network
plt.figure(figsize=(10, 5))

# Create adjacency matrix (thresholded)
adj_matrix = (np.abs(Theta_est) > 0.01).astype(int)
np.fill_diagonal(adj_matrix, 0)

# Plot using networkx
G = nx.from_numpy_array(adj_matrix)
pos = nx.spring_layout(G, seed=123)

plt.subplot(1, 2, 1)
nx.draw(G, pos, with_labels=True, node_color='lightblue', 
        node_size=500, font_size=10, font_weight='bold')
plt.title('Estimated Network Structure')

# Show precision matrix heatmap
plt.subplot(1, 2, 2)
plt.imshow(Theta_est, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(label='Precision Matrix Entry')
plt.title('Estimated Precision Matrix')
plt.tight_layout()
plt.show()

print(f"Sparsity: {np.sum(np.abs(Theta_est) < 0.01) / p**2:.2%}")

Bottom Line

The Lasso family has expanded to include specialized methods (e.g., Adaptive, Elastic Net, Group Lasso) that address unique challenges like bias reduction, feature correlation, grouping structures, and network discovery.
Selection depends on data characteristics—correlated predictors (Elastic Net), grouped features (Group Lasso), ordered data (Fused Lasso), or bias concerns (Adaptive/Relaxed Lasso)—while all share a core principle of sparsity-promoting penalization.
Despite their differences, all variants rely on penalized optimization to achieve simplicity, offering tailored solutions for different modeling needs.
Modern tools (R: glmnet, grpreg; Python: scikit-learn, group-lasso) make these methods widely available.

Where to Learn More

For a comprehensive treatment of penalized regression methods, see “The Elements of Statistical Learning” by Hastie, Tibshirani, and Friedman (2009), which covers Lasso and many variants in detail. “Statistical Learning with Sparsity” by Hastie, Tibshirani, and Wainwright (2015) provides a more recent and focused treatment. For theoretical properties and high-dimensional asymptotics, Bühlmann and van de Geer’s “Statistics for High-Dimensional Data” (2011) is excellent, but too technical and dense for most readers.

References

Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika, 98(4), 791–806.
Bühlmann, P., & van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press.
Meinshausen, N. (2007). Relaxed lasso. Computational Statistics & Data Analysis, 52(1), 374–393.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.

The Oracle Property: What It Promises (and What It Doesn’t)

Fri, 13 Mar 2026 07:00:00 GMT

4 min read

Background

In high-dimensional regression, we sometimes hear that a method possesses the oracle property. The phrase sounds impressive: it suggests that an estimator behaves as if the true sparsity pattern were known in advance—hence the name, as though an oracle had revealed the true support beforehand.

This note explains what the oracle property actually means, why it is considered desirable, and where its practical relevance is limited. The goal is to distinguish asymptotic guarantees from practical performance. As usual, I introduce some notation so that the discussion rests on a clear mathematical foundation and a shared framework.

Notation

Consider the linear model

with and potentially large. Let the true parameter vector be sparse:

Put simply, is the set of variables that are non-zero in the true parameter vector , and is the number of non-zero variables.

A Closer Look

Definition

An estimator is said to have the oracle property if it can do two things:

Selection consistency:
Asymptotic efficiency: which is the same limiting distribution as the OLS estimator that knows in advance.

If the support were known, estimation reduces to low-dimensional OLS on . That estimator is unbiased, efficient, and easy to analyze. Some of you will remember the Gauss-Markov theorem from your econometrics course which states that, the OLS estimator is the best linear unbiased estimator (BLUE) under homoskedasticity.

Oracle Property Definition

Can a data-driven procedure simultaneously discover and then estimate as efficiently as if were given?

This is an appealing theoretical benchmark for sparse estimators. You can hardly do better than that.

Which Methods Achieve It

Classical LASSO does not generally satisfy the oracle property. Its penalty introduces shrinkage bias that persists asymptotically.

Nonconvex penalties (e.g., SCAD and MCP) were explicitly designed to achieve the oracle property under regularity conditions. Adaptive LASSO can also achieve it when weights are constructed from a root- consistent pilot estimator.

The key mechanism is reduced shrinkage for large coefficients while still penalizing small ones.

Practical Implications

The oracle property is always asymptotic. There are never such guarantees in finite samples. It requires conditions such as:

correct model specification,
suitable signal strength (minimum nonzero coefficient size),
regularity conditions on the design matrix,
appropriate tuning parameter rates.

In finite samples, especially when signals are weak or highly correlated, procedures that theoretically satisfy the oracle property may not outperform simpler methods. In practice, prediction risk often matters more than exact support recovery.

There is also a conceptual point: the oracle benchmark assumes that the “true” model is sparse and well-defined. In many modern applications, sparsity is an approximation rather than a literal truth.

Bottom Line

The oracle property means consistent variable selection plus asymptotically efficient estimation on the true support.
Nonconvex penalties and adaptive LASSO can achieve it; standard LASSO typically does not.
The property is asymptotic and depends on strong conditions (signal strength, design assumptions, tuning rates).
In practice, predictive performance and stability often matter more than satisfying oracle-style guarantees.

Where to Learn More

Fan and Li (2001) introduced SCAD and formalized the oracle property in penalized likelihood estimation. Zou (2006) shows how adaptive LASSO can achieve oracle behavior. Bühlmann and van de Geer’s Statistics for High-Dimensional Data provides a modern, rigorous treatment of sparsity, regularization paths, and inference in high-dimensional regimes.

References

Bühlmann, P., & van de Geer, S. (2011). Statistics for High-Dimensional Data.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.

Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty.

Zou, H. (2006). The adaptive LASSO and its oracle properties.

Why Some Confidence Intervals Are Not Symmetric

Tue, 10 Mar 2026 07:00:00 GMT

4 min read

Background

Most of us were trained to think of a confidence interval as

That template is deeply ingrained. It works beautifully for estimators whose sampling distributions are symmetric and well behaved. But have you ever come across a confidence interval with an off-center point estimate?

The “ margin of error” representation is not a defining property of confidence intervals. It is a consequence of symmetry. Once symmetry disappears because of skewed sampling distributions, nonlinear transformations, boundary constraints, or small-sample behavior, the interval need not be centered around the point estimate.

The goal of this note is to unpack where asymmetry comes from, when it is expected, and how different construction principles lead to intervals that look very different from the textbook -interval. I will also illustrate the phenomenon with a bootstrap example in R and Python.

Notation

Let denote a scalar parameter of interest, and let be an estimator.

A confidence interval is a random set such that, by definition,

When satisfies an asymptotic normality result, then a Wald-type interval takes the familiar form

where the critical value is the quantile of the standard normal.

This interval is symmetric around by construction. Its symmetry is inherited from the symmetry of the limiting Gaussian distribution. Remove that symmetry or step outside the world where the approximation is valid, and the interval will generally no longer be symmetric.

A Closer Look

I will now examine four common sources of confidence interval asymmetry.

Skewed Sampling Distributions

Symmetry of the interval reflects symmetry of the sampling distribution, not symmetry of the data.

Consider estimating a proportion from a binomial model. The MLE is . For moderate and near or , the distribution of is visibly skewed. A Wald interval,

may extend below or above . That is a red flag: the procedure ignores the geometry of the parameter space.

Score intervals and logit-transformed intervals are asymmetric in precisely because they respect this skewness and the constraint. The asymmetry is not a flaw—it is the correction.

Nonlinear Transformations

Suppose for a nonlinear . Even if is approximately normal, the distribution of

is generally not symmetric in finite samples.

A first-order delta method approximation gives

which suggests a symmetric interval in -space. However, mapping that interval back to -space via typically produces asymmetry.

This is routine in practice. Log-scale confidence intervals for positive parameters (e.g., rate ratios, hazard ratios) are symmetric in but asymmetric in . The asymmetry reflects curvature in .

Likelihood-Based Intervals

Likelihood-ratio intervals solve

where is the log-likelihood. When is not quadratic as is common in small samples or near boundaries, the resulting set is not symmetric around .

The quadratic approximation that underlies Wald intervals is a second-order Taylor expansion. If the likelihood is skewed, the quadratic approximation inherits bias, and symmetric intervals misrepresent uncertainty.

Bootstrap Percentile Intervals

Bootstrap percentile intervals are defined directly from empirical quantiles of the bootstrap distribution:

where are bootstrap replicates.

No symmetry is imposed. If the empirical distribution of is skewed, the interval is skewed. This is often desirable: the procedure adapts to the shape of the sampling distribution.

Algorithm: Percentile Bootstrap CI

Draw bootstrap samples by resampling with replacement.
Compute for each resample.
Form the interval from the empirical and quantiles of .

The percentile method is not universally optimal, but it makes the asymmetry explicit instead of suppressing it.

An Example

We simulate from an exponential distribution, which is right-skewed. Even the (asymptotically normal) sample mean can have a noticeably skewed sampling distribution at moderate .

set.seed(1988)

# Generate 50 observations from Exp(1)
x <- rexp(50, rate = 1)
sample_mean <- mean(x)

# Bootstrap distribution of the mean
boot_means <- replicate(10000, mean(sample(x, replace = TRUE)))

# Percentile CI
ci_percentile <- quantile(boot_means, c(0.025, 0.975))

# Symmetric normal approximation
se <- sd(x) / sqrt(length(x))
ci_symmetric <- c(sample_mean - 1.96 * se,
                  sample_mean + 1.96 * se)

lower_distance <- sample_mean - ci_percentile[1]
upper_distance <- ci_percentile[2] - sample_mean

print(lower_distance, upper_distance)

import numpy as np

np.random.seed(1988)

# Generate 50 observations from Exp(1)
x = np.random.exponential(scale=1.0, size=50)
sample_mean = np.mean(x)

# Bootstrap distribution of the mean
boot_means = [
    np.mean(np.random.choice(x, size=50, replace=True))
    for _ in range(10000)
]

# Percentile CI
ci_percentile = np.percentile(boot_means, [2.5, 97.5])

# Symmetric normal approximation
se = np.std(x, ddof=1) / np.sqrt(len(x))
ci_symmetric = [
    sample_mean - 1.96 * se,
    sample_mean + 1.96 * se
]

lower_distance = sample_mean - ci_percentile[0]
upper_distance = ci_percentile[1] - sample_mean

print(lower_distance, upper_distance)

In typical runs, the upper distance exceeds the lower distance. The right tail of the exponential distribution propagates into the bootstrap distribution of the mean. The percentile interval reflects that skewness; the Wald interval does not.

As grows, the central limit theorem compresses this asymmetry. At , it is still visible. At , it is largely gone. The interval geometry tracks the sampling distribution geometry.

Bottom Line

Symmetric intervals arise from symmetric (often Gaussian) approximations; they are not a universal property of confidence intervals.
Skewness, nonlinear transformations, and boundary constraints naturally induce asymmetric intervals.
Likelihood-based and bootstrap methods often expose asymmetry that Wald intervals conceal.
If the parameter space or sampling distribution is asymmetric, an asymmetric interval is typically more faithful to the underlying uncertainty.

References

Casella, G., & Berger, R. L. (2002). Statistical Inference.

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap.

OLS with Fixed vs Random \(X\): What Actually Changes?

Sun, 08 Mar 2026 08:00:00 GMT

3 min read

Background

In regression courses, you will eventually hear the phrase: “OLS works whether is fixed or random.” That statement is correct, but dangerously compressed.

The distinction between fixed and random regressors is not about how you compute . The algebra is identical. The difference is in what is random, what we condition on, and how we interpret sampling statements.

The goal of this note is to make that distinction precise, and to clarify what does—and does not—depend on treating as fixed.

Notation

Consider the well-known linear model

where:

,
with full column rank,
,
with and .

The standard OLS estimator is

The key question is: are we conditioning on , or is itself a random object in the data-generating process?

A Closer Look

Let’s take a closer look at the two cases.

Fixed : Classical Linear Model

In the classical setup, is treated as fixed (non-stochastic). Then, all randomness comes from the error term .

Conditional on ,

Inference is therefore conditional inference. Confidence intervals and -tests are statements about the distribution of given this specific design matrix, .

This framework is natural in designed experiments, where is literally chosen by the researcher.

Random : Econometric View

In most observational settings, is random. We observe i.i.d. draws from an unknown joint distribution, . Under standard regularity conditions, the same OLS estimator, , satisfies

The asymptotic variance becomes

Under homoskedasticity, this simplifies to

The algebra mirrors the fixed- case, but the interpretation changes: we are no longer conditioning on a specific realization of ; we are averaging over its distribution.

What Actually Changes?

Three things matter.

First, the object of inference. With fixed , inference is conditional on the design. With random , inference is about repeated sampling of .

Second, exogeneity assumptions. In the fixed- model, we require . In the random- case, we need the same condition, but it now constrains the joint distribution: it says that once we know the regressors, there is no systematic remaining signal in the error term. Violations become statements about endogeneity, meaning is statistically related to omitted factors inside .

Third, robustness. Heteroskedasticity-robust standard errors are naturally derived in the random- framework, where the conditional variance may depend on . In other words, different parts of the regressor distribution can come with different noise levels, so inference has to account for that variation rather than rely on a single common variance.

What does not change is the formula for . Nor does unbiasedness depend on being fixed; it depends on the conditional mean-zero assumption.

Bottom Line

The OLS estimator is algebraically identical whether is fixed or random.
Fixed- inference is conditional; random- inference averages over the joint distribution.
Consistency hinges on , not on whether is stochastic.
Robust variance formulas arise naturally once is treated as random.

Where to Learn More

For a classical treatment, see Greene’s Econometric Analysis, which clearly distinguishes fixed and stochastic regressors. Wooldridge’s Econometric Analysis of Cross Section and Panel Data provides a modern random- perspective with emphasis on exogeneity conditions and robust inference.

References

Greene, W. H. (2018). Econometric Analysis.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

Logistic Regression in Randomized Trials?

Tue, 17 Feb 2026 08:00:00 GMT

4 min read

Background

Randomized controlled trials (RCTs) are the gold standard for causal inference. Random assignment guarantees that treatment is independent of potential outcomes. As a result, simple differences in observed outcomes identify causal effects without requiring outcome modeling.

With binary outcomes, however, data scientists often default to logistic regression. That instinct feels natural: the outcome is binary, the logit model is standard, and regression allows covariate adjustment. But does logistic regression actually respect what randomization gives us?

Freedman (2008) argues that it does not. Randomization justifies design-based estimators. Logistic regression introduces additional modeling assumptions that randomization does not validate. When those assumptions fail, the regression coefficient on treatment need not estimate the causal quantity of interest—even in large samples.

Notation

Let there be subjects indexed by . Each subject has:

Treatment assignment
Binary outcome
Covariates

Each unit has two potential outcomes: and . Define the finite-population averages

The causal contrast of interest is the difference in log-odds:

A Closer Look

What Randomization Identifies

Because treatment is randomized, the sample analogues

are unbiased for and . The plug-in estimator

is therefore consistent and justified purely by the design.

No outcome model is required.

What Logistic Regression Assumes

A logistic regression specifies

The coefficient is typically interpreted as the treatment effect. This interpretation relies on strong assumptions:

The conditional log-odds is linear in and .
The functional form is correctly specified.
The model captures the true dependence of outcomes on covariates.

Randomization does not validate any of these assumptions. It guarantees independence of treatment assignment—not correctness of the logit specification.

If the model is misspecified, the maximum likelihood estimator converges to a pseudo-true parameter: the value that best fits the assumed model, not necessarily the causal estimand .

The Non-Collapsibility Problem

There is a deeper issue. The logistic coefficient represents a conditional odds ratio. The estimand is a marginal contrast. These quantities are generally not equal.

Odds ratios are non-collapsible: adding covariates changes the estimated coefficient even when there is no confounding. As a result, adjusting for in a logit model can change the treatment coefficient even in a perfectly randomized experiment.

This is not bias from confounding. It is a structural property of the odds ratio. Thus, even with infinite data, need not converge to .

A Safer Use of Logistic Regression

If logistic regression is used, the coefficient itself should not be interpreted as the estimand. Instead, compute model-based plug-in predictions:

Fit the logistic model and obtain .
Predict probabilities under treatment and control for every unit:
Average predicted probabilities:
Form

This estimator targets the correct marginal quantity. Even if the logit model is misspecified, it remains consistent under randomization. The coefficient does not share this guarantee.

An Example

We illustrate with a small randomized experiment. There are units; half are assigned to treatment () and half to control () by complete randomization. Each unit has a binary outcome and a single covariate . We compute three quantities: the design-based plug-in estimator , the logistic regression coefficient on treatment, and the adjusted estimator that uses the fitted logit model to predict probabilities under treatment and control for every unit, then marginalizes and forms the log-odds contrast.

The code below generates data (with a true treatment effect on the log-odds scale), fits a logistic regression of on and , and reports , , and . In general these three numbers differ; and target the marginal causal contrast, while is a conditional parameter.

set.seed(1988)
n <- 200
x <- sample(rep(c(1, 0), each = n / 2))   # complete randomization
z <- rnorm(n, mean = 0, sd = 1)

# True P(Y=1) depends on X and Z (logistic); treatment increases log-odds by 0.8
beta_true <- c(0, 0.8, 0.3)   # intercept, treatment, covariate
eta <- beta_true[1] + beta_true[2] * x + beta_true[3] * z
p <- 1 / (1 + exp(-eta))
y <- rbinom(n, size = 1, prob = p)

# --- Design-based plug-in: delta ---
alpha_T_hat <- mean(y[x == 1])
alpha_C_hat <- mean(y[x == 0])
delta_hat <- log(alpha_T_hat / (1 - alpha_T_hat)) - log(alpha_C_hat / (1 - alpha_C_hat))

# --- Logistic regression: beta_2 (coefficient on treatment) ---
fit <- glm(y ~ x + z, family = binomial)
beta_2 <- coef(fit)["x"]

# --- Adjusted estimator: marginalize fitted probs, then log-odds contrast ---
p_under_treat <- predict(fit, newdata = data.frame(x = 1, z = z), type = "response")
p_under_control <- predict(fit, newdata = data.frame(x = 0, z = z), type = "response")
alpha_T_tilde <- mean(p_under_treat)
alpha_C_tilde <- mean(p_under_control)
delta_tilde <- log(alpha_T_tilde / (1 - alpha_T_tilde)) - log(alpha_C_tilde / (1 - alpha_C_tilde))

cat("Design-based delta_hat:  ", round(delta_hat, 4), "\n")
cat("Logistic coef (beta_2):  ", round(beta_2, 4), "\n")
cat("Adjusted delta_tilde:    ", round(delta_tilde, 4), "\n")

import numpy as np
from sklearn.linear_model import LogisticRegression

np.random.seed(1988)
n = 200
x = np.array([1] * (n // 2) + [0] * (n - n // 2))
np.random.shuffle(x)
z = np.random.normal(0, 1, n)

# True P(Y=1) depends on X and Z (logistic); treatment increases log-odds by 0.8
beta_true = np.array([0, 0.8, 0.3])   # intercept, treatment, covariate
eta = beta_true[0] + beta_true[1] * x + beta_true[2] * z
p = 1 / (1 + np.exp(-eta))
y = np.random.binomial(1, p, n)

# Design-based plug-in: delta
alpha_T_hat = y[x == 1].mean()
alpha_C_hat = y[x == 0].mean()
delta_hat = np.log(alpha_T_hat / (1 - alpha_T_hat)) - np.log(alpha_C_hat / (1 - alpha_C_hat))

# Logistic regression: beta_2 (coefficient on treatment)
X_design = np.column_stack([np.ones(n), x, z])
fit = LogisticRegression(C=1e10).fit(X_design, y)  # no penalty
beta_2 = fit.coef_[0][1]

# Adjusted estimator: marginalize fitted probs, then log-odds contrast
p_under_treat = fit.predict_proba(np.column_stack([np.ones(n), np.ones(n), z]))[:, 1]
p_under_control = fit.predict_proba(np.column_stack([np.ones(n), np.zeros(n), z]))[:, 1]
alpha_T_tilde = p_under_treat.mean()
alpha_C_tilde = p_under_control.mean()
delta_tilde = np.log(alpha_T_tilde / (1 - alpha_T_tilde)) - np.log(alpha_C_tilde / (1 - alpha_C_tilde))

print("Design-based delta_hat:  ", round(delta_hat, 4))
print("Logistic coef (beta_2):  ", round(beta_2, 4))
print("Adjusted delta_tilde:    ", round(delta_tilde, 4))

Bottom Line

Randomization identifies causal effects without modeling.
Design-based estimators and plug-in approaches respect the randomized design. The logit coefficient does not.
Logistic regression introduces functional-form assumptions that randomization does not justify.
The treatment coefficient estimates a conditional odds ratio, not the marginal causal contrast defined by the experiment.
The logistic regression coefficient generally differs from the experimental estimand—even in large samples.

Reference

Freedman, D. A. (2008). Randomization Does Not Justify Logistic Regression. Statistical Science, 23(2), 237–249. https://doi.org/10.1214/08-STS262

Randomization Inference: A Gentle Introduction

Thu, 12 Feb 2026 08:00:00 GMT

6 min read

Background

Randomization inference offers a refreshing alternative to traditional parametric inference, providing exact control over Type I error rates without relying on large-sample approximations or strict distributional assumptions. Born out of Fisher’s famous tea-tasting experiment, the approach leverages the symmetry and structure induced by randomization itself to test hypotheses.

This blog post unpacks the theory and intuition behind randomization inference, drawing on the excellent review by Ritzwoller, Romano, and Shaikh (2025). I’ll cover the key ideas, notation, and algorithms involved, and also touch on modern applications like two-sample tests, regression, and conformal inference. Throughout, I’ll emphasize the practical considerations — when it works, why it works, and where caution is needed.

Notation

Let denote the treatment assignment vector and the observed outcomes. In potential-outcomes notation, each unit has and , and we observe

The assignment mechanism is known. For example, under complete randomization with treated units, is uniformly distributed over all binary vectors with exactly ones.

Let be a test statistic computed from the observed data .

Let denote the set of transformations consistent with the design (e.g., all treatment permutations preserving the treated count). Under a valid randomization hypothesis,

This invariance is the engine of randomization inference.

A Closer Look

Sharp vs Regular Null Hypotheses

The most important distinction in randomization inference is between sharp null hypotheses (which fully determine the unobserved potential outcomes) and regular/weak null hypotheses (which do not).

A sharp null specifies the treatment effect for every unit. The canonical example is Fisher’s no-effect null: Under this null, the missing potential outcomes are imputable from the observed outcomes. That is what makes exact finite-sample randomization tests possible: for each candidate assignment , you can reconstruct the outcomes that would have been observed under and recompute the test statistic.

A regular/weak null is something like “the average treatment effect is zero,” or a regression-style null about a parameter in a model. This null does not let you impute all missing potential outcomes, so the randomization distribution of a non-studentized statistic typically depends on nuisance features (e.g., heteroskedasticity). In that setting, exactness generally fails, and validity is recovered (when it is) by using statistics that are asymptotically pivotal, often via studentization.

The two different -values are not comparable since they are based on different null hypotheses.

Exact Randomization Tests

If the randomization hypothesis holds, we can compute the distribution of by applying all transformations in to the data. The -value is simply the proportion of these transformed test statistics that are as extreme or more extreme than the observed :

Because the null implies invariance under , this procedure achieves exact finite-sample control of the Type I error rate.

Algorithm: Randomization Test

Choose a test statistic .
Define the group of transformations.
Compute on the observed data.
Apply all (or a random sample of) transformations to the data and recompute .
Calculate the -value as the proportion of transformed statistics as or more extreme than .

Because the null implies invariance under , this test controls Type I error exactly in finite samples.

In practice, can be large. Monte Carlo sampling of transformations provides an accurate approximation, with a simple +1 adjustment ensuring exactness under random sampling.

When Exactness Fails

Under weak nulls, permutation tests are no longer automatically valid. The permutation distribution of a statistic may not match its true sampling distribution.

The difference in means illustrates the issue. If treatment and control variances differ, the raw difference in means can severely over-reject under permutation. The statistic is not pivotal.

Studentization resolves the problem. Scaling by an estimated standard error produces an asymptotically pivotal statistic whose limiting null distribution does not depend on nuisance parameters. Rank-based procedures (e.g., Wilcoxon–Mann–Whitney) achieve a similar goal.

The general principle is simple: asymptotic validity requires asymptotic pivotality.

Strengths and Limitations

Randomization inference is particularly powerful when the randomization scheme is known and controlled, as in experiments, when the test statistic is chosen to be pivotal, and when exact finite-sample error control is important.

However, it becomes less effective when covariates are correlated with treatment assignment but not properly accounted for, or when the sample size is too small to approximate the randomization distribution reliably through subsampling.

An Example

We illustrate the procedure with a small randomized experiment. There are units; exactly receive treatment under complete randomization, so is uniformly distributed over all binary vectors with ten ones. Each unit has potential outcomes and with a constant effect , and we observe . The test statistic is the difference in means, .

We test Fisher’s sharp null of no effect: for all . Under this null, the observed would be the same under any assignment, so we can build the randomization distribution by repeatedly permuting (keeping the number of treated units fixed), recomputing for each permuted assignment, and then computing the proportion of those values that are as or more extreme than the observed . That proportion is the randomization -value.

The code below does exactly that, using random permutations and a standard +1 adjustment for the Monte Carlo -value.

set.seed(1988)
n <- 20
n_treat <- 10

# Potential outcomes (fixed) and assignment (random)
y0 <- rnorm(n, mean = 0, sd = 1)
tau <- 1.0
w <- sample(c(rep(1, n_treat), rep(0, n - n_treat)))
y <- y0 + tau * w

# Test statistic: difference in means
t_obs <- mean(y[w == 1]) - mean(y[w == 0])

# Randomization distribution under Fisher sharp null of no effect:
# under H0, y(1)=y(0), so observed outcomes are invariant to assignment.
b <- 5000
t_perm <- replicate(b, {
  w_perm <- sample(w)  # preserves treated count (complete randomization)
  mean(y[w_perm == 1]) - mean(y[w_perm == 0])
})

# Two-sided $p$-value with a +1 adjustment (Monte Carlo exactness)
p_value <- (1 + sum(abs(t_perm) >= abs(t_obs))) / (b + 1)
p_value

import numpy as np

np.random.seed(1988)
n = 20
n_treat = 10

y0 = np.random.normal(0, 1, n)
tau = 1.0
w = np.array([1] * n_treat + [0] * (n - n_treat))
w = np.random.permutation(w)
y = y0 + tau * w

def diff_in_means(y, w):
    return y[w == 1].mean() - y[w == 0].mean()

t_obs = diff_in_means(y, w)

b = 5000
t_perm = np.empty(b)
for i in range(b):
    w_perm = np.random.permutation(w)  # preserves treated count
    t_perm[i] = diff_in_means(y, w_perm)

p_value = (1 + np.sum(np.abs(t_perm) >= np.abs(t_obs))) / (b + 1)
print(p_value)

Bottom Line

Randomization inference provides exact finite-sample error control when the randomization hypothesis holds.
Asymptotic validity can often be rescued by choosing asymptotically pivotal (studentized) test statistics.
Without studentization, permutation tests may fail badly in the presence of unequal variances.
Randomization tests are flexible and nonparametric, making them attractive for experimental data and beyond.

Where to Learn More

The best starting point is the recent review by Ritzwoller, Romano, and Shaikh (2025). For foundational treatments on nonparametric inference, the go-to is Lehmann & Romano’s two-volume door-stopper Testing Statistical Hypotheses. The practical guide by Good (2005) on permutation tests is also highly recommended.

References

Ritzwoller, D. M., Romano, J. P., & Shaikh, A. M. (2025). Randomization Inference: Theory and Applications.
Lehmann, E. L., & Romano, J. P. (2022). Testing Statistical Hypotheses. Springer.
Good, P. (2005). Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer.

Generalized Additive Models: What You Need to Know

Thu, 12 Feb 2026 08:00:00 GMT

6 min read

Background

Generalized Additive Models (GAMs) are one of the most powerful and flexible tools in a data scientist’s toolbox for modeling complex, nonlinear relationships between covariates and an outcome. They generalize linear models by allowing smooth, nonparametric functions of the predictors while still maintaining interpretability and manageable computation. The core idea is simple: instead of forcing relationships to be straight lines, let the data speak for itself.

This article explains what you really need to know about GAMs, following the excellent review by Simon Wood (2025). I’ll go over the basics of how GAMs work, how smoothness is controlled, the computational strategies involved, and key pitfalls to watch out for. I’ll also walk through a code example in both R and Python to show how to fit and interpret these models in practice.

Notation

Consider an outcome variable and predictors . The simplest linear model is:

The GAM replaces the linear terms with smooth functions :

More generally, for non-Gaussian outcomes, GAMs use a link function :

Each is estimated from the data and constrained to be “smooth” through penalization.

A Closer Look

What Makes a GAM?

The backbone of a GAM is its smooth terms. These are typically represented using splines — basis functions that piece together polynomials smoothly at specified knots. But not just any spline will do! In GAMs, smoothness is enforced through penalty terms that discourage excessive wiggliness.

For example, for a cubic spline, the penalty is usually the integral of the squared second derivative:

In coefficient form, estimation solves

where is the smoothing parameter.

Everything in a GAM flows from this penalized least-squares (or penalized likelihood) objective. The balance between fitting the data and keeping the function smooth is controlled by smoothing parameters (). This is regularization: in particular, the standard spline roughness penalties are quadratic (ridge-like). A higher makes the function flatter; a lower allows more flexibility.

How Smoothness Is Estimated

Model selection in GAMs involves three related but distinct questions:

How smooth should each function be? (smoothing parameter selection, )
How flexible is the basis? (choice of basis dimension )
Which smooth terms should be included at all? (term selection, )

The basis dimension controls the maximum possible flexibility (how rich the spline basis is), while the smoothing parameter controls how much of that flexibility is actually used. Intuitively, sets the size of the function space you search over; determines the effective degrees of freedom (wiggliness) within that space. In practice, you choose “large enough” and let do the regularization; if is too small, the smooth can be forced to underfit no matter how you tune .

There are two main strategies to estimate :

Cross-validation (CV): Minimize prediction error by holding out parts of the data. You are familiar with this from traditional machine learning models.
Marginal likelihood (REML): An empirical Bayes approach that tends to perform well in practice.

The marginal likelihood approach treats smooth coefficients as random effects with Gaussian priors (a mixed-model representation), and often yields better-behaved uncertainty quantification than ad hoc tuning.

Similarly, there are two common tools for model selection. The well-known Akaike Information Criterion (AIC) controls the trade-off between goodness of fit and model complexity. Alternatively, one can employ hypothesis testing to check whether each is significantly different from zero.

With , , and selected, we can fit the GAM and make predictions. Let’s shift the focus to a few more nuanced, but important, topics.

Why Rank Reduction Matters

Full spline bases can be large and computationally expensive. To address this, GAMs often use low-rank spline bases (e.g., thin plate regression splines): you represent each smooth with a modest number of basis functions (controlled by ), rather than using a very large “full” basis. This keeps computation tractable while retaining most of the flexibility practitioners want. Consequently, GAM fitting scales better to larger datasets while preserving interpretability.

Beyond the Mean

GAMs aren’t limited to modeling the mean and naturally extend to modeling other aspects of the distribution. They can handle location, scale, and shape modeling — meaning that the variance, skewness, or other distributional parameters can also depend on smooth functions of predictors. This generalization brings GAMs into the world of generalized additive models for location, scale, and shape (GAMLSS).

They can even be extended to quantile regression and non-exponential family distributions, making them incredibly versatile. However, while GAMs allow flexible modeling of conditional expectations, they do not by themselves address common thorny issues such as endogeneity, causal identification, or selection bias. They simply allow for more depth in modeling the relationship between the outcome and the covariates and thus should be utilized in the context of machine learning/prediction.

Hypothesis Testing

Testing whether a smooth term is zero corresponds to testing whether its associated function is identically zero. Because smooth terms are penalized, the effective degrees of freedom are estimated from the data, and the resulting test statistics rely on large-sample approximations. The reported -values are therefore approximate and should be interpreted as heuristic diagnostics rather than exact finite-sample guarantees.

An Example

library(mgcv)
set.seed(1988)
n <- 200
x <- runif(n, 0, 10)
y <- sin(x) + rnorm(n, 0, 0.3)
model <- gam(y ~ s(x), method = "REML")
summary(model)
plot(model, residuals = TRUE)

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.gam.api import GLMGam, BSplines

np.random.seed(1988)
n = 200
x = np.random.uniform(0, 10, n)
y = np.sin(x) + np.random.normal(0, 0.3, n)

# Build a cubic B-spline basis for x
X = x[:, None]
bs = BSplines(X, df=[10], degree=[3], knot_kwds=[{"lower_bound": x.min(), "upper_bound": x.max()}])

# Gaussian GAM (identity link) via the GLM-GAM interface
exog = np.ones((n, 1))  # intercept only
gam = GLMGam(y, smoother=bs, exog=exog).fit()
print(gam.summary())

plt.figure()
XX = np.linspace(x.min(), x.max(), 200)[:, None]
exog_pred = np.ones((len(XX), 1))
plt.plot(XX[:, 0], gam.predict(exog=exog_pred, exog_smooth=XX), label="GAM fit")
plt.scatter(x, y, alpha=0.3)
plt.legend()
plt.show()

Bottom Line

GAMs allow flexible, nonlinear modeling while retaining interpretability.
Smoothness is controlled by penalties, estimated via CV or marginal likelihood (REML).
Rank reduction makes GAMs computationally feasible even with large datasets.
GAMs generalize beyond means to scale, shape, and quantile modeling.

Where to Learn More

The recent review by Simon Wood (2025) is the most comprehensive and readable guide to modern GAMs. For practical hands-on work, Wood’s book Generalized Additive Models: An Introduction with R (2017) remains the go-to resource. See also Hastie (2017). For Bayesian extensions check Rue et al. (2009).

References

Hastie, T. J. (2017). Generalized additive models. Statistical models in S, 249-307.
Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(2), 319-392.
Wood, S. N. (2025). Generalized Additive Models. Annual Review of Statistics and Its Application, 12, 497–526.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R. CRC Press.

Understanding Correlated Random Effects Models

Wed, 11 Feb 2026 08:00:00 GMT

8 min read

Background

For decades, panel data analysis has largely revolved around a familiar dichotomy: fixed effects (FE) versus random effects (RE). More recently, generalized fixed effects and difference-in-differences designs have surged in popularity, particularly in causal inference. Yet between FE and RE lies a more general and conceptually illuminating framework: the correlated random effects (CRE) model. Although it receives less attention today, CRE remains a powerful tool for understanding the foundations of panel data methods.

Fixed effects models eliminate all time-invariant unobserved heterogeneity but sacrifice the ability to estimate the effects of time-invariant covariates. Random effects models, by contrast, retain those variables but rely on a strong assumption: that the unobserved individual-specific effects are uncorrelated with the regressors. When this assumption fails—as it often does—the RE estimator becomes biased. The correlated random effects (CRE), also known as the hybrid model, relaxes this assumption by explicitly modeling the potential correlation.

In this article, I examine the intuition behind the CRE model, explain how it bridges FE and RE, and show how it decomposes within- and between-unit variation. I conclude with a hands-on implementation in both R and Python to demonstrate how the model works in practice. The focus is on the linear versions of these models, and extending these ideas to nonlinear models is not always straightforward.

Notation

Let us consider a standard panel data setup where we observe units over time periods . The outcome is , and is a vector of time-varying covariates.

The linear panel data model is:

where is the individual-specific effect and is the idiosyncratic error term. Our goal is to consistently estimate the causal effect of time-varying regressors (a component of ) when unobserved heterogeneity may be correlated with them.

The core differences between FE and RE models lie in the way they handle , and the assumptions they make about the relationship between and .

A Closer Look

Refresher on Fixed and Random Effects

In panel data models, the goal is often to account for unobserved heterogeneity across units (e.g., individuals, firms, regions). Two popular approaches to handle this are fixed effects (FE) and random effects (RE) models. Understanding these two approaches is critical before we dive into correlated random effects.

Fixed Effects (FE) Model

The fixed effects model controls for all time-invariant characteristics of the units by allowing each unit to have its own intercept. The key feature of FE models is that is treated as a set of unknown parameters to be estimated (or differenced out). Importantly, is allowed to be correlated with the regressors (i.e., ). This addresses endogeneity driven by time-invariant omitted variables, but it does not, by itself, resolve endogeneity arising from time-varying confounding, simultaneity, or reverse causality (which lives in ).

Fixed effects estimation often proceeds by demeaning the data within each unit (also known as the “within transformation”), removing :

where and are the within-unit means. This is convenient but comes at the cost of not estimating the time-invariant effects of the covariates, which can be of interest in many applications. Even if one attempts to consistently estimate the ’s parameters, this is usually not feasible due to the relative short panels typically used in empirical work.

Fixed effects are especially popular in causal inference because they remove bias from any time-invariant omitted variables. They can be seen as a generalization of the familiar difference-in-differences (DiD) approach, which is just a special case of FE with two time periods and a treatment indicator. They can also fairly easily be extended to triple difference designs, staggered adoption designs, and other more complex causal inference settings.

An example would be an analysis of state-level minimum wage changes on employment outcomes. Different states adopted minimum wage changes at different times, so a simple difference-in-differences analysis would be inappropriate. However, a fixed effects model can be used to estimate the effect of the minimum wage on employment outcomes, holding constant the state-specific time-invariant characteristics (e.g., state-level demographics, permanent economic conditions, policy environment, etc.).

Random Effects (RE) Model

In the RE model, is treated as a random variable drawn from a distribution (usually assumed to be normal):

The crucial assumption in RE models is:

Equivalently, RE assumes , where . This allows for more efficient estimation through Generalized Least Squares (GLS), but if the assumption fails, the RE estimates will be biased and inconsistent. The RE model is not commonly used in causal inference because, unlike the FE model, it rules out correlation between covariates and time-invariant unobserved heterogeneity. In short, the FE model is robust but discards between-unit variation, while the RE model is more efficient but relies on a strong independence assumption between covariates and unobserved heterogeneity. The Hausman test evaluates whether the additional orthogonality restrictions imposed by the random effects model are supported by the data.

Correlated Random Effects (CRE) Model

Intuition

The correlated random effects (CRE) model differs from standard fixed and random effects by explicitly modeling the correlation between the unit-specific effects and the covariates . Instead of assuming independence (as in RE) or differencing out the effects entirely (as in FE), CRE includes the unit-level means of the covariates as additional regressors, allowing for consistent estimation while still retaining the ability to estimate time-invariant variables.

The correlated random effects (CRE) model offers a middle ground between FE and RE approaches. Traditional RE models assume that unobserved heterogeneity is uncorrelated with covariates. FE models remove all unit-level heterogeneity but cannot estimate time-invariant covariates. CRE models address these limitations by including group means of time-varying covariates, decomposing variation into within and between components. Instead of pretending the individual effect is unrelated to observed covariates, we model exactly how it is related — through the individual’s average covariate values.

Estimation and Inference

One way to motivate CRE (Mundlak) is to model the conditional mean of the unit effect as a function of unit-level covariate averages. In the linear case, write:

where is the individual mean of . Substituting into the outcome equation yields:

In practice, you include the unit means for each time-varying regressor (and for any transformations/interactions you want the CRE adjustment to apply to). Estimation uses RE-style methods on this augmented specification; the mean terms absorb the part of that is correlated with , leaving orthogonal. This also makes it easy to compare within and between effects (for a scalar , the between effect is ).

Advantages and Challenges

The CRE model offers several advantages. It allows estimation of time-invariant variables, decomposes effects into within- and between-unit components, improves efficiency under relaxed assumptions, and provides a diagnostic check on the plausibility of random effects assumptions.

It is well suited for repeated measures data where both time-varying and time-invariant predictors matter, especially when there is potential endogeneity between covariates and individual effects. Typical applications include policy evaluation, health research, and education studies.

However, CRE models still rely on the random intercept assumption, do not address endogeneity driven by time-varying unobservables (e.g., simultaneity or reverse causality), require care with interaction terms, and may produce biased estimates when the number of clusters is small.

An Example

library(plm)
library(dplyr)

set.seed(1988)
n <- 100
t <- 5
data <- data.frame(
  id = rep(1:n, each = t),
  time = rep(1:t, n)
)
data <- data %>%
  group_by(id) %>%
  mutate(
    z = rnorm(1),
    x = rnorm(n(), mean = z),
    alpha = 0.7 * z + rnorm(1),
    eps = rnorm(n(), sd = 1),
    y = 1 + 0.5 * x + alpha + eps
  )

pdata <- pdata.frame(data, index = c("id", "time"))

fe_model <- plm(y ~ x, data = pdata, model = "within")
re_model <- plm(y ~ x, data = pdata, model = "random")
pdata$mean_x <- ave(pdata$x, pdata$id, FUN = mean)
cre_model <- plm(y ~ x + mean_x, data = pdata, model = "random")

summary(fe_model)
summary(re_model)
summary(cre_model)

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

np.random.seed(1988)
n, t = 100, 5
df = pd.DataFrame({
    'id': np.repeat(np.arange(1, n+1), t),
    'time': np.tile(np.arange(1, t+1), n)
})

# Induce correlation between x_it and alpha_i via an id-level latent z_i
z = np.random.randn(n)
df['z'] = np.repeat(z, t)
df['x'] = df['z'] + np.random.randn(n*t)
alpha = 0.7 * z + np.random.randn(n)
df['alpha'] = np.repeat(alpha, t)

df['eps'] = np.random.randn(n*t)
df['y'] = 1 + 0.5 * df['x'] + df['alpha'] + df['eps']
df['mean_x'] = df.groupby('id')['x'].transform('mean')

# Random-intercept RE model
model_re = smf.mixedlm("y ~ x", df, groups=df["id"]).fit(reml=False)

# CRE (Mundlak) model: RE with unit means included
model_cre = smf.mixedlm("y ~ x + mean_x", df, groups=df["id"]).fit(reml=False)

print(model_re.summary())
print(model_cre.summary())

Bottom Line

CRE models relax the strict RE assumptions by modeling the correlation between unit effects and covariates.
They provide within and between estimates while allowing time-invariant variables.
Appropriate for longitudinal, multilevel, and policy evaluation studies.

Where to Learn More

“Microeconometrics: Methods and Applications” by one of my PhD advisors, Colin Cameron, and his long-time coauthor Trivedi, is a classic textbook on panel data models with which I have spent countless hours. It’s a great starting point for most of the material in my blog. Schunck (2013) provides a comprehensive overview of CRE models. Mundlak’s foundational work is essential for understanding the theoretical basis. Tools like R’s plm and Python’s statsmodels can implement these models with the correct transformations.

References

Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.
Schunck, R. (2013). Within and between estimates in random-effects models: Advantages and drawbacks of correlated random effects and hybrid models. The Stata Journal, 13(1), 65-76.
Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46(1), 69–85.

The Many Flavors of Propensity Score Methods In Causal Inference

Thu, 22 Jan 2026 08:00:00 GMT

10 min read

Background

Introduced by Rosenbaum and Rubin in 1983, the propensity score, the probability of receiving treatment given observed covariates, has become the workhorse for handling confounding in observational studies.

But here’s the thing: the propensity score itself is just the starting point. It designates an entire class of statistical methods for treatment effect estimation. In practice, there are tons of ways to use propensity scores. You can match on them, stratify your sample, weight your observations, or plug them into doubly robust estimators that combine modeling of both the treatment and the outcome. You can tweak how you weight the units—downweighting those with extreme scores or focusing on the region where treated and control groups overlap.

In this post, I’ll explore the many flavors of propensity score methods. As always, the focus is on the intuition, the basic math, and practical considerations. Oh, there is also some R and python code.

Notation

We’re operating in the familiar causal inference setup:

: treatment indicator.
: observed covariates.
: potential outcomes.

We conveniently invoke the traditional identification assumptions – conditional ignorability, overlap and SUTVA. As a refresher, the propensity score is simply:

The key seminal result from Rosenbaum and Rubin (1983) states:

meaning that, conditional on the propensity score, treatment assignment is as good as random. The main implication of this theorem is dimensionality reduction – the propensity score alone is “enough” to adjust for bias between the treatment and control groups.

A crucial but often overlooked point is that different propensity score–based estimators target different causal estimands (e.g., the Average Treatment Effect (ATE), the Average Treatment Effect on the Treated (ATT), or effects defined on overlap populations) so choosing a method implicitly means choosing which population’s effect you want to estimate.

A Closer Look

Propensity Score Estimation

First things first. Before we even begin to discuss propensity score methods, we need to estimate the propensity score itself. This is commonly done via logistic regression (probit has really gone out of fashion). In very, very rare cases the propensity score is known and this step can be skipped. Occasionally, machine learning methods can be employed as well, but one has to be careful there. The subtlety is that, contrary to a traditional machine learning setup, our goal here is not finding the best fit. This is where machine learning methods can mislead us. Instead, we are after controlling for in-sample bias between the treatment and control groups.

The following examples apply several popular propensity score methods to the Iris dataset using both R and Python. For demonstration, we define an artificial binary treatment (D) based on Petal.Length. The outcome variable is Sepal.Length, and the predictors are the remaining covariates.

# Load necessary libraries
library(MatchIt)

# Load iris dataset and create treatment variable
data(iris)
iris$D <- ifelse(iris$Petal.Length > 3, 1, 0)

# Fit propensity score model using logistic regression
ps_model <- glm(D ~ Sepal.Width + Petal.Width, 
                data = iris, 
                family = binomial(link = "logit"))

summary(ps_model)

# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load iris dataset and create treatment variable
iris = load_iris(as_frame=True).frame
iris['D'] = (iris['petal length (cm)'] > 3).astype(int)

# Prepare features and fit propensity score model
X = iris[['sepal width (cm)', 'petal width (cm)']]
y = iris['D']
model = LogisticRegression(max_iter=1000)
model.fit(X, y)

print(f"Intercept: {model.intercept_[0]:.3f}, Coef: {model.coef_.flatten()}")

Nearest Neighbor Matching

Target Estimand: Typically ATT.

This is often the first method people try after estimating the propensity score. Once is estimated, treated units are matched to control units with the closest propensity scores (nearest neighbor). You can match one-to-one, one-to-many, with or without replacement.

This class of methods tends to work well when the number of controls is large enough to find good matches for treated units. The approach is simple and intuitive, reducing high-dimensional matching to a single dimension. However, it’s worth noting that balance on the propensity score doesn’t guarantee balance on covariates, and the method can be sensitive to poor matches when suitable controls are scarce.

Lastly, inference after matching is subtle; standard errors must account for the matching procedure, and naïve bootstrap methods are generally invalid. Matching with replacement introduces some additional complexity since some data points are used more than once.

matchit_nn <- matchit(D ~ Sepal.Width + Petal.Width, data = iris, method = "nearest")
summary(matchit_nn)

from causalinference import CausalModel

# Prepare data for CausalModel
Y = iris['sepal length (cm)'].values
T = iris['D'].values
X = iris[['sepal width (cm)', 'petal width (cm)']].values

# Fit causal model with nearest neighbor matching
cm = CausalModel(Y=Y, D=T, X=X)
cm.est_propensity_s()
cm.est_via_matching(bias_adj=True)
print(cm.estimates)

Caliper Matching

Target Estimand: Typically ATT.

Caliper matching adds a threshold: only match treated and control units if their propensity scores are within a specified distance (the caliper). Often the caliper is set to times the standard deviation of the logit of the propensity score (Austin 2010).

This approach is particularly useful when you want to avoid bad matches that can arise in standard nearest neighbor matching. By imposing a maximum allowable distance, caliper matching prevents extreme mismatches and generally improves balance between treatment and control groups. The main tradeoff is that it may discard treated units if no control unit falls within the caliper distance, potentially reducing sample size and raising questions about external validity for the excluded observations.

# Estimate propensity scores for caliper calculation
ps_for_caliper <- glm(D ~ Sepal.Width + Petal.Width, data = iris, family = binomial)
ps_vals <- predict(ps_for_caliper, type = "response")
logit_ps <- log(ps_vals / (1 - ps_vals))

# Calculate caliper (0.2 * SD of logit PS, as recommended by Rosenbaum & Rubin)
caliper_width <- 0.2 * sd(logit_ps)

# Perform caliper matching
matchit_caliper <- matchit(D ~ Sepal.Width + Petal.Width, 
                           data = iris, 
                           method = "nearest", 
                           distance = "glm",
                           caliper = caliper_width,
                           std.caliper = FALSE)
summary(matchit_caliper)

# Calculate propensity scores
ps = model.predict_proba(X)[:, 1]

# Calculate caliper (0.2 * SD of logit of propensity score)
logit_ps = np.log(ps / (1 - ps + 1e-10))  # Small constant to avoid division by zero
caliper = 0.2 * np.std(logit_ps)

# Simplified 1:1 caliper matching (with replacement)
matched_pairs = []
treated_idx = iris[iris['D'] == 1].index
control_idx = iris[iris['D'] == 0].index

for t_idx in treated_idx:
    # Calculate distances in logit space
    t_logit = logit_ps[t_idx]
    c_logits = logit_ps[control_idx]
    distances = np.abs(t_logit - c_logits)
    
    min_dist = distances.min()
    if min_dist <= caliper:
        min_dist_idx = control_idx[np.argmin(distances)]
        matched_pairs.append((t_idx, min_dist_idx))

print(f"Matched {len(matched_pairs)} out of {len(treated_idx)} treated units within caliper")

Stratification / Blocking

Target Estimand: Typically ATE.

Here, the range of propensity scores is divided into strata (often quintiles), and treatment effects are estimated within each stratum, then averaged across strata.

Stratification is particularly appealing when matching isn’t feasible or when you prefer a more aggregate approach to adjustment. The method is straightforward to implement and achieves balance on average within each stratum. However, because it discretizes the propensity score into bins, the adjustment can be somewhat coarse, and bias may not be fully eliminated within each stratum, especially if there’s substantial heterogeneity in propensity scores within a given stratum.

matchit_strat <- matchit(D ~ Sepal.Width + Petal.Width, data = iris, method = "subclass", subclass = 5)
md <- match.data(matchit_strat)
stratum_effects <- sapply(1:max(md$subclass, na.rm = TRUE), function(s) {
  sub <- md[md$subclass == s, ]
  if (sum(sub$D) > 0 && sum(1 - sub$D) > 0)
    mean(sub$Sepal.Length[sub$D == 1]) - mean(sub$Sepal.Length[sub$D == 0]) else NA
})
ate_stratified <- mean(stratum_effects, na.rm = TRUE)
print(paste("Stratified ATE:", round(ate_stratified, 3)))

# Calculate propensity scores
if 'ps' not in iris.columns:
    iris['ps'] = model.predict_proba(X)[:, 1]

# Stratification by propensity score quintiles
iris['ps_stratum'] = pd.qcut(iris['ps'], q=5, labels=False, duplicates='drop')

# Estimate treatment effect within each stratum
stratum_effects = []
for stratum in iris['ps_stratum'].unique():
    stratum_data = iris[iris['ps_stratum'] == stratum]
    treated = stratum_data[stratum_data['D'] == 1]['sepal length (cm)']
    control = stratum_data[stratum_data['D'] == 0]['sepal length (cm)']
    if len(treated) > 0 and len(control) > 0:
        effect = treated.mean() - control.mean()
        stratum_effects.append(effect)

# Overall effect (simple average across strata)
ate_stratified = np.mean(stratum_effects)
print(f"Stratified ATE: {ate_stratified:.3f}")

Inverse Probability Weighting (IPW)

Target Estimand: ATT/ATE.

IPW turns the propensity score into weights: This reweights the sample so that treated and control groups resemble each other on observed covariates.

This method is ideal when you want to utilize the entire dataset without discarding any units. IPW is conceptually simple and makes full use of all available observations. The main challenge, however, is its sensitivity to extreme propensity scores near 0 or 1. When units have very low or very high probabilities of treatment, the inverse weighting can produce extremely large weights, leading to unstable estimates with high variance. This is why trimming or other stabilization techniques are often employed alongside IPW.

iris$ps <- predict(ps_model, type = "response")
iris$weights <- ifelse(iris$D == 1, 1 / iris$ps, 1 / (1 - iris$ps))
summary(iris$weights)

iris['ps'] = model.predict_proba(X)[:,1]
iris['weights'] = np.where(iris['D'] == 1, 1 / iris['ps'], 1 / (1 - iris['ps']))
iris['weights'].describe()

Augmented IPW (AIPW) / Doubly Robust Estimators

Target Estimand: ATT/ATE.

Many modern estimators can be viewed as combining propensity score weighting with outcome modeling, yielding doubly robust estimators that remain consistent if either component is correctly specified. The key appeal: if either the propensity score model or the outcome model is correct (but not necessarily both), the estimator is consistent. This is called the doubly robust property.

The AIPW estimator for the ATE looks like: where is the predicted outcome for treatment group .

This approach is particularly valuable when you want robust estimation but are uncertain about whether your propensity score model or outcome model is correctly specified. The doubly robust property provides a safety net: you only need one of the two models to be correct. Additionally, AIPW makes efficient use of the available data. The cost is increased computational complexity, since both the treatment and outcome models must be estimated, and careful attention must be paid to how these models interact.

# Manual AIPW implementation
# Step 1: Estimate propensity scores
ps_fit <- glm(D ~ Sepal.Width + Petal.Width, data = iris, family = binomial)
iris$ps <- predict(ps_fit, type = "response")

# Step 2: Estimate outcome models for each treatment group
outcome_treated <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, 
                      data = iris[iris$D == 1, ])
outcome_control <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, 
                      data = iris[iris$D == 0, ])

# Step 3: Predict potential outcomes for all units
iris$mu1 <- predict(outcome_treated, newdata = iris)
iris$mu0 <- predict(outcome_control, newdata = iris)

# Step 4: Calculate AIPW estimator
aipw_component <- with(iris, 
  (D * (Sepal.Length - mu1) / ps) - 
  ((1 - D) * (Sepal.Length - mu0) / (1 - ps)) + 
  (mu1 - mu0)
)
ate_aipw <- mean(aipw_component)
print(paste("AIPW ATE:", round(ate_aipw, 3)))

# Using EconML for Doubly Robust estimation
from econml.dr import DRLearner
from sklearn.linear_model import LinearRegression

# Prepare data
Y = iris['sepal length (cm)'].values
T = iris['D'].values
X = iris[['sepal width (cm)', 'petal width (cm)']].values
W = X  # Covariates for confounding

# Fit doubly robust learner
dr = DRLearner(
    model_propensity=LogisticRegression(max_iter=1000),
    model_regression=LinearRegression(),
    model_final=LinearRegression(),
    cv=3
)
dr.fit(Y, T, X=None, W=W)  # X=None for constant treatment effect

# Estimate ATE (ate() may return array; take scalar for display)
ate_result = dr.ate(X=None)
ate_est = float(np.asarray(ate_result).flatten()[0])
print(f"Doubly Robust ATE: {ate_est:.3f}")

Covariate Balancing Propensity Score (CBPS)

Target Estimand: Typically ATE.

CBPS, introduced by Imai and Ratkovic (2014), directly estimates the propensity score while optimizing covariate balance. Instead of fitting a logistic regression and then checking balance, CBPS ensures balance is achieved as part of the estimation process. Example below in R (CBPS package); Python users can look to balance-focused weighting in other libraries.

This method shines when standard propensity score estimation leads to poor covariate balance. Rather than the typical iterate-and-check workflow, CBPS achieves good balance without requiring manual tuning, working directly toward the ultimate goal of creating comparable groups. The main drawbacks are that it’s more complex to implement than standard logistic regression and less widely available in standard statistical packages, though dedicated R packages do exist.

library(CBPS)
cbps_fit <- CBPS(D ~ Sepal.Width + Petal.Width, data = iris)
summary(cbps_fit)

Overlap Weights

Target Estimand: Overlap-weighted ATE.

Overlap weighting focuses on the region of common support—where treated and control units both exist—by assigning weights: This downweights units with extreme scores near 0 or 1 and emphasizes comparability.

This weighting scheme is ideal when you want to avoid extrapolation and focus inference on the region where treated and control units truly overlap. The approach naturally sidesteps the instability that plagues standard IPW when propensity scores approach the boundaries, and it targets what’s sometimes called the “overlap population.” The key consideration is that the resulting estimate represents the treatment effect for this overlap population, which may differ from the overall ATE or the ATT, depending on how representative the overlap region is of the full sample.

# Calculate overlap weights
iris$overlap_weights <- ifelse(iris$D == 1, 1 - iris$ps, iris$ps)
summary(iris$overlap_weights)

# Calculate propensity scores if not already done
if 'ps' not in iris.columns:
    iris['ps'] = model.predict_proba(X)[:, 1]

# Calculate overlap weights
iris['overlap_weights'] = np.where(iris['D'] == 1, 1 - iris['ps'], iris['ps'])
iris['overlap_weights'].describe()

Entropy Balancing

Target Estimand: ATT/ATE.

Entropy balancing directly reweights the control group so that the moments of the covariates (mean, variance, etc.) match exactly between treated and control groups. Instead of matching or stratifying, this solves a constrained optimization problem that minimizes the Kullback-Leibler divergence of weights subject to balance constraints. Example below in R (ebal package).

This method is particularly useful when balance proves difficult to achieve with traditional weighting schemes. Entropy balancing guarantees exact balance on the chosen covariate moments and fully utilizes all available data without discarding observations. The analyst must specify which moments (typically means, and sometimes variances and skewness) should be balanced, and the results can be sensitive to these choices. Nevertheless, the method offers strong guarantees and has gained popularity for applications where precise balance is paramount.

library(ebal)

# Fit entropy balancing (balance covariates between treated and control)
eb_fit <- ebalance(Treatment = iris$D, X = as.matrix(iris[, c("Sepal.Width", "Petal.Width")]))
summary(eb_fit)

Bottom Line

Propensity score methods provide diverse approaches to estimate causal effects from observational data, each with unique strengths and trade-offs.
Some methods prioritize simplicity (e.g., stratification) or data retention (e.g., IPW), while others focus on robustness or balance (AIPW, CBPS).
Doubly robust methods like AIPW offer reliability even when one model is misspecified, while others (e.g., entropy balancing) guarantee perfect balance through optimization.
No single method is universally best. The choice hinges on practical considerations: sample size, covariate overlap between groups, and whether exact balance or data efficiency is prioritized.
All else equal, doubly robust estimators offer extra protection against biased results.

Where to Learn More

For the original introduction to propensity scores, see Rosenbaum and Rubin’s (1983) landmark paper. Imai and Ratkovic’s (2014) work on CBPS is a must-read for understanding balance-focused estimation. The textbook Causal Inference for Statistics, Social, and Biomedical Sciences by Imbens and Rubin (2015) provides excellent coverage of these methods. There are also great tutorials and vignettes in R packages like MatchIt, twang, and WeightIt.

References

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.
Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 243–263.
Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
Hainmueller, J. (2012). Entropy balancing for causal effects. Political Analysis, 20(1), 25–46.

The Wilcoxon-Mann-Whitney Test is Not a Test of Medians

Tue, 20 Jan 2026 08:00:00 GMT

5 min read

Background

Nonparametric tests like the Wilcoxon-Mann-Whitney (WMW) are among the most popular alternatives to the - and -tests in settings where normality assumptions break down. Often described as a “test of medians,” WMW is used when comparing two independent groups without making strong assumptions about the underlying distributions. It is also known as the Mann-Whitney-Wilcoxon (MWW) test or the Wilcoxon rank-sum test.

Despite this common interpretation, the WMW test is not a test of medians—at least not in general. Divine et al. (2018) dive deep into this misconception and show convincingly how the WMW test can lead you astray if you’re specifically interested in comparing medians.

This article explains why that happens, provides some intuition and math, and shows you how to think more clearly about what the WMW test actually does.

Notation

Let and be two independent random samples from distributions and , respectively. The Wilcoxon-Mann-Whitney statistic is based on the probability:

This quantity is sometimes referred to as the probability of superiority.

Let and denote the medians of and . We often want to test:

But WMW does not directly test this hypothesis unless very specific conditions are met.

A Closer Look

What Does WMW Actually Test?

The WMW test assesses whether one distribution tends to produce larger values than the other. More formally, it tests:

This is equivalent to testing whether the distributions are stochastically equal, not whether the medians are equal.

The WMW test can be performed via rank sums. After combining both samples, we rank all observations from smallest to largest. The test statistic is the sum of ranks assigned to the first sample:

where is the rank of in the combined sample.

This rank-based formulation is mathematically equivalent to counting how many pairs have , which relates to the probability interpretation above. Under the null hypothesis, the expected rank sum is approximately .

Understanding Stochastic Dominance

When we say the WMW test examines “stochastic dominance,” we mean it tests whether values from one distribution tend to exceed values from the other. Specifically, distribution stochastically dominates distribution if:

with strict inequality for at least some values of . Intuitively, this means a randomly selected value from is more likely to be larger than a randomly selected value from .

This is quite different from comparing medians. Two distributions can have identical medians but exhibit stochastic dominance, or they can have different medians but neither stochastically dominates the other.

When Does It Coincide with a Median Test?

The WMW test only functions as a test of medians under symmetric distributions with equal shape and spread. If the shapes differ—say, one is skewed left and the other right—then even if the medians are the same, WMW can reject the null. Worse, it might fail to reject when the medians are different but the distributions have similar overall ranks.

Alternative Tests

If your research question specifically concerns differences in medians, more appropriate tests include:

Mood’s median test: A true test of median equality that uses contingency tables based on counts above and below the combined median.
Quantile regression: For more complex designs, quantile regression directly models the median (or other quantiles) and tests differences between groups.
Bootstrap confidence intervals: Calculating confidence intervals for the difference in medians via bootstrapping provides both a test and measure of uncertainty.

These approaches directly address median differences rather than the stochastic ordering tested by WMW.

An Example

Let’s see this in action with a small simulation.

set.seed(123)
x <- rexp(100, rate = 1)         # Right-skewed
y <- rexp(100, rate = 1.5)       # Also right-skewed, different rate

median(x)  # Median of x
median(y)  # Median of y

wilcox.test(x, y)

import numpy as np
from scipy.stats import mannwhitneyu

np.random.seed(123)
x = np.random.exponential(scale=1.0, size=100)
y = np.random.exponential(scale=2/3, size=100)  # Higher scale = lower rate

print("Median x:", np.median(x))
print("Median y:", np.median(y))

res = mannwhitneyu(x, y, alternative='two-sided')
print(res)

This example demonstrates our point perfectly: The medians are clearly different (0.6334 vs. 0.4865), and the WMW test correctly rejects the null hypothesis (p = 0.004). However, this rejection occurs because the exponential distributions with different rates create a consistent stochastic ordering, not because it’s specifically testing the medians.

Despite different medians, the WMW test might not reject the null. Or it might reject it because of shape differences, not the medians.

Bottom Line

The Wilcoxon-Mann-Whitney test is not a general test of medians.
It tests for stochastic dominance or shift in distribution, not specifically median difference.
It behaves like a median test only under certain conditions (e.g., identical shape).
Be cautious interpreting WMW results as saying something about medians unless distributional assumptions are met.

Where to Learn More

For a deeper dive, read the original Divine et al. (2018) paper. You might also want to look at literature on robust location tests or permutation-based alternatives that better target the median.

References

Divine, G. W., Norton, H. J., Barón, A. E., & Juarez-Colunga, E. (2018). The Wilcoxon–Mann–Whitney procedure fails as a test of medians. The American Statistician, 72(3), 278–286.

Hollander, M., Wolfe, D. A., & Chicken, E. (2013). Nonparametric Statistical Methods (3rd ed.). Wiley.

Unconditional Quantile Regression and Treatment Effects

Sun, 21 Dec 2025 08:00:00 GMT

9 min read

Background

Quantile regression has become a widely used tool in econometrics and statistics, thanks to its ability to model the entire distribution of an outcome variable rather than just its mean. Traditional quantile regression (Koenker and Bassett, 1978), however, is conditional as it models quantiles of the outcome given a set of covariates. But in many policy and causal inference applications, we are interested in changes to the unconditional distribution of the outcome variable.

For example, suppose we want to understand the effect of a job training program on wage inequality. A standard quantile regression would tell us how the program shifts quantiles given certain characteristics like education or experience. In other words, focus is on the quantile of the residual term in the linear model. But we might instead want to estimate how the program shifts quantiles in the population as a whole. This is where Unconditional Quantile Regression (UQR) comes in.

The key breakthrough in this space was provided by Firpo, Fortin, and Lemieux (2009), who introduced a method based on the Recentered Influence Function (RIF). This allows us to estimate the effect of covariates on unconditional quantiles using simple linear regressions. Later, Frölich and Melly (2013) extended this framework to account for endogeneity, providing a way to estimate Unconditional Quantile Treatment Effects (UQTEs) in settings where treatment is not randomly assigned.

In this article, I’ll unpack the key ideas behind UQR, discuss how to estimate unconditional quantile treatment effects, and illustrate these concepts with an example in R and python.

Notation

As a refresher, the -th quantile of an outcome variable is defined as:

UQR allows us to estimate how covariates influence these unconditional quantiles.

Now consider adding a set of covariates . In traditional quantile regression, we estimate the conditional quantile function:

This tells us how the -th quantile of changes with . Traditional quantile regression models the impact on this conditional quantile.

A Closer Look

Unconditional Quantile Regression

Definition

Firpo et al. (2009) introduced an elegant way to estimate UQR using influence functions. The influence function of a statistic measures how much that statistic changes when an observation is perturbed. The recentered influence function (RIF) for a quantile is given by:

Here, is the density of at , which can be estimated nonparametrically. This nonparametric density estimation is often done via kernel density estimation but may be imprecise in the tails.

Firpo et al. showed that regressing on covariates via OLS provides a valid estimate of how affects the -th quantile of . This method is remarkably simple but powerful—it transforms a quantile regression problem into a standard linear regression problem.

This idea also generalizes to other distributional statistics (Gini, variance) by using the corresponding influence functions.

Estimation

The estimation proceeds in three steps:

Algorithm:

Estimate the sample quantile .
Estimate the density , typically via kernel density estimation.
Construct the RIF for each observation and regress it on the covariates.

The basic regression is:

where now captures the effect of on the -th unconditional quantile.

The most common implementation is RIF-OLS, though alternatives include RIF-Logit and nonparametric first stages (RIF-NP).

Inference

Density estimation is a critical step that affects the quality of inference as poor estimates at the quantile point can lead to noisy estimates. Because of the multi-step estimation (quantile, density, RIF), standard error computation is more complex. Bootstrapping is commonly used and has been shown to perform well in practice.

Challenges

RIF-OLS is a linear model and as such it assumes a linear relationship between the RIF and covariates. If the true relationship is nonlinear, flexible methods (logit, nonparametric) are preferred. UQR is especially appealing for estimating treatment effects on the distribution of outcomes in quasi-experimental settings. When treatment is exogenous (conditional on the covariates), including treatment indicators in the RIF regression yields estimates of the treatment effect at various unconditional quantiles. This is a perfect segue for the next section.

Unconditional Quantile Treatment Effects

One limitation of UQR as formulated by Firpo et al. is that it assumes covariates are exogenous. But in many causal inference settings, treatment assignment is endogenous (e.g., workers self-select into training programs). Frölich and Melly (2013) extended the UQR framework to handle endogeneity using instrumental variables (IV). The authors built on earlier work by Chernozhukov and Hansen (2005) which pioneered the estimation of (conditional) quantile treatment effects in the presence of endogeneity.

Frölich and Melly showed that under standard IV assumptions—relevance and exclusion—the unconditional quantile treatment effect (UQTE) can be estimated using a two-step approach:

Algorithm:

Estimate a propensity score model (or an instrumented version of ) to account for selection bias.
Use IV-based weighting to recover the counterfactual unconditional outcome distributions for compliers, and apply RIF methods to estimate UQTEs.

This approach provides a way to estimate distributional treatment effects while addressing selection bias—a crucial tool in policy evaluation and applied econometrics.

Rank Invariance in QTEs

A crucial assumption often invoked in the estimation of quantile treatment effects (QTEs) is rank invariance. This assumption states that units maintain their rank in the outcome distribution after receiving the treatment. In other words, if a treated unit was at the 30th percentile of the untreated outcome distribution, it would remain at the 30th percentile of the treated distribution.

While this assumption simplifies identification and interpretation of QTEs, it can be highly restrictive. It rules out the possibility that treatment reshuffles individuals across the distribution—a scenario that might be not only plausible but central in many applications.

Consider a school voucher program that offers private school access to low-income students. The effect of such a program may be heterogeneous: for high-performing students, access might enhance performance due to better environments. But for low-performing students, the same access could lead to worse outcomes due to higher academic pressure or poor fit. As a result, the program could re-rank students in the outcome distribution, violating rank invariance.

In such settings, assuming rank invariance could lead to misleading conclusions about who benefits and who loses from treatment. Alternative approaches, like those based on quantile treatment effect bounds (e.g., Melly, 2005; Chernozhukov & Hansen, 2005), are more robust to such violations.

Examples

Bitler et al. (2006)

When evaluating the effects of welfare reform, traditional analyses often focus on mean impacts, which can obscure critical insights into the distributional effects of policy changes. Quantile Treatment Effects (QTE) provide a powerful tool for understanding how reforms impact different segments of the population, revealing heterogeneity that mean impacts fail to capture. For example, the study “What Mean Impacts Miss: Distributional Effects of Welfare Reform Experiments” by Bitler, Gelbach, and Hoynes uses QTE to analyze Connecticut’s Jobs First program, a welfare reform initiative.

The authors find that while mean impacts suggest modest income gains, QTE reveal substantial variation: earnings effects are zero at the bottom, positive in the middle, and negative at the top of the distribution before time limits take effect. After time limits, income effects are mixed, with gains concentrated in higher quantiles and losses at the lower end. This nuanced approach highlights the importance of QTE in uncovering the true breadth of policy impacts, enabling data scientists to better inform decision-making and address equity concerns in policy design.

Code

Let’s illustrate these ideas with an example in R and python. We’ll use the iris dataset to estimate the effect of Sepal.Length on different quantiles of Petal.Length using UQR.

rm(list=ls())
library(quantreg)

# Load dataset
data(iris)

# Estimate unconditional quantiles
taus <- c(0.25, 0.50, 0.75)
q_vals <- quantile(iris$Petal.Length, probs = taus)  # Estimate quantiles
f_hat <- density(iris$Petal.Length)

# Compute RIF values
rif_values <- lapply(1:3, function(i) {
  q <- q_vals[i]
  f <- f_hat$y[which.min(abs(f_hat$x - q))]
  q + ((taus[i] - (iris$Petal.Length <= q)) / f)
})

# Run RIF regression
models <- lapply(rif_values, function(rif) lm(rif ~ Sepal.Length, data = iris))

# Print results
lapply(models, summary)

import numpy as np
import pandas as pd
from scipy.stats import gaussian_kde
from sklearn.linear_model import LinearRegression

# Load dataset
from sklearn.datasets import load_iris
iris_data = load_iris(as_frame=True)
iris = iris_data['data']
iris.columns = ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']

# Estimate unconditional quantiles
taus = [0.25, 0.50, 0.75]
q_vals = np.quantile(iris['Petal.Length'], taus)  # Estimate quantiles
f_hat = gaussian_kde(iris['Petal.Length'])

# Compute RIF values
rif_values = []
for i, tau in enumerate(taus):
    q = q_vals[i]
    f = f_hat(q)  # Density at the quantile
    rif = q + ((tau - (iris['Petal.Length'] <= q).astype(int)) / f)
    rif_values.append(rif)

# Run RIF regression
models = []
for rif in rif_values:
    model = LinearRegression(fit_intercept=True)
    model.fit(iris[['Sepal.Length']], rif)
    models.append(model)

# Print results
for i, model in enumerate(models):
    print(f"Model {i + 1}:")
    print(f"Coefficient for Sepal.Length: {model.coef_[0]}")
    print(f"Intercept: {model.intercept_}")

This simple example demonstrates how to estimate the effect of a covariate on unconditional quantiles using the RIF regression approach.

Bottom Line

UQR allows us to estimate the effect of covariates on unconditional quantiles, capturing total effects.
The RIF regression method transforms a quantile regression problem into a simple linear regression.
Frölich and Melly (2013) extend UQR to address endogeneity using instrumental variables.
These tools are invaluable for policy evaluation and causal inference.

Where to Learn More

For a deeper dive into these methods, the foundational paper by Firpo, Fortin, and Lemieux (2009) provides a detailed introduction to UQR, while Frölich and Melly (2013) extend the framework to address endogeneity concerns. For a broader perspective on quantile regression, Koenker’s book Quantile Regression (2005) is a must-read.

References

Alejo, J., Favata, F., Montes-Rojas, G., & Trombetta, M. (2021). Conditional vs unconditional quantile regression models: A guide to practitioners. Economía, 44(88), 76-93.

Bitler, M. P., Gelbach, J. B., & Hoynes, H. W. (2006). What mean impacts miss: Distributional effects of welfare reform experiments. American Economic Review, 96(4), 988-1012.

Borah, B. J., & Basu, A. (2013). Highlighting differences between conditional and unconditional quantile regression approaches through an application to assess medication adherence. Health economics, 22(9), 1052-1070.

Borgen, N. T. (2016). Fixed effects in unconditional quantile regression. The Stata Journal, 16(2), 403-415.

Chernozhukov, V., & Hansen, C. (2005). An IV model of quantile treatment effects. Econometrica, 73(1), 245-261.

Firpo, S., Fortin, N. M., & Lemieux, T. (2009). Unconditional quantile regressions. Econometrica, 77(3), 953-973.

Frölich, M., & Melly, B. (2013). Unconditional quantile treatment effects under endogeneity. Journal of Business & Economic Statistics, 31(3), 346-357.

Koenker, R. (2017). Quantile regression: 40 years on. Annual review of economics, 9(1), 155-176.

Koenker, R., & Hallock, K. F. (2001). Quantile regression. Journal of economic perspectives, 15(4), 143-156.

Koenker, R., & Bassett Jr, G. (1978). Regression quantiles. Econometrica: journal of the Econometric Society, 33-50.

Sasaki, Y., Ura, T., & Zhang, Y. (2022). Unconditional quantile regression with high‐dimensional data. Quantitative Economics, 13(3), 955-978.

The Many Flavors of Matching for Causal Inference

Tue, 27 May 2025 07:00:00 GMT

12 min read

Background

If you’ve worked on causal inference with observational data, you’ve likely faced the fundamental challenge: the treated and control groups often look very different. Matching methods aim to fix that. The idea is simple and intuitive—let’s compare treated units to similar control units and mimic the conditions of a randomized experiment as best as we can.

But here’s the twist: there are multiple ways to define “similar.” Should we look for exact matches? Should we match on covariates directly or on some summary score like the propensity score? Should we optimize the matches globally or locally? Over the years, researchers have developed a wide variety of matching methods, each with its own advantages and pitfalls. The landscape can be overwhelming, especially if you’re new to causal inference.

In this article, I’ll walk through the most popular matching strategies for causal inference. I’ll talk about what each method does, when to use it, and where it might lead you astray. The focus is on the intuition and technical description—not on the code. Whether you’re doing matching for the first time or looking to expand your toolkit, you will find something useful here.

Notation

Let’s set up the basic framework with minimal fluff. Suppose we have units indexed by . Each unit has:

A binary treatment indicator , where for treated units and for controls.
A vector of observed covariates .
Potential outcomes and , where is the outcome if treated, and if untreated. We observe only their realized outcome .

We impose the usual assumptions of unconfoundedness (treatment assignment is independent of potential outcomes given covariates) and overlap (treated and control units have similar covariate distributions).

Our goal is to estimate treatment effects like the Average Treatment Effect (ATE) or the ATE on the Treated (ATT):

The core idea behind matching is to find comparable untreated units for each treated unit so we can approximate for the treated group. We then discard the unmatched units and look at the difference in outcomes between treated and matched controls to estimate the treatment effect.

Let’s abuse notation a bit and define the sample-analogue of the ATT as:

These methods can be, and often are, combined with regression adjustments to reduce bias and improve efficiency and robustness, but I will leave that aside here.

A Closer Look

We are now ready to go through seven of the most popular matching approaches.

Exact Matching

Exact matching is the simplest—and most restrictive—approach to causal inference:

Match treated and control units exactly on all observed covariates .

That is, if a treated unit has , we look for control units with the exact same . While this method is conceptually elegant and easy to understand, it’s rarely practical.

Exact matches become increasingly unlikely in high-dimensional settings or when covariates are continuous, where no two units are likely to be identical. In those cases, exact matching often fails to find matches for many treated units, leading to loss of sample size or biased estimates. Despite its limitations, exact matching is an important baseline: it helps clarify the assumptions behind more flexible methods.

Exact matching works when covariates are discrete, there aren’t too many of them and there is decent overlap between the treated and control groups. It becomes much more difficult (theoretically infeasible) to find matches as the number of covariates increases or in settings with continuous covariates. In practice, it can often lead to lots of unmatched units which often results in discarded data.

Mahalanobis Distance Matching

Instead of requiring exact equality between covariates, Mahalanobis matching

Uses a continuous distance metric to find treated and control units that are similar in terms of their covariate values.

The Mahalanobis distance between two units and , with covariates and , is defined as:

where is the sample covariance matrix of the covariates .

This metric accounts for both the scale and the correlation structure of the covariates. Unlike Euclidean distance, which treats each covariate as equally important and independent, Mahalanobis distance adjusts for the fact that some variables may be more variable than others, or may be correlated.

Intuitively, Mahalanobis distance answers the question: how many standard deviations apart are these two vectors, once we’ve accounted for the spread and correlation of the variables? A small Mahalanobis distance indicates that the two units are close in the joint covariate space, even if they differ somewhat along individual dimensions. It still becomes less reliable in high dimensions, where all units tend to be far from one another.

Unlike exact matching, Mahalanobis matching can handle continuous covariates and works well in high dimensions. It is also more flexible than exact matching, in that it can handle mixed discrete and continuous variables.

Propensity Score Matching

Propensity Score Matching (PSM) is one of the most influential ideas in observational causal inference. Rosenbaum and Rubin’s foundational result shows that if treatment assignment is unconfounded given covariates , then it is also unconfounded given the propensity score:

the probability of receiving treatment conditional on observed covariates. In other words,

Instead of matching on the full covariate vector , we can just match on a single scalar summary—the estimated propensity score.

This is the key idea: propensity scores reduce the curse of dimensionality. By summarizing the information in into one number that captures the likelihood of treatment, we make matching more feasible and scalable, especially when includes many variables.

In practice, the propensity score is rarely known and must be estimated—typically using logistic regression, probit models, or machine learning methods like random forests or gradient boosting. Once estimated, treated and control units are matched based on the closeness of their propensity scores, often using nearest-neighbor matching, caliper matching, or kernel methods. Trimming is therefore an important aspect of the process, where units with very high or very low propensity scores are excluded to improve balance and reduce bias.

PSM improves comparability between groups by balancing the covariates in expectation, but it comes with trade-offs. Matching on the propensity score alone does not guarantee covariate balance in any particular dataset, so it’s important to assess and diagnose balance post-matching. Moreover, PSM is sensitive to model misspecification and can perform poorly if the propensity score is estimated inaccurately or if the overlap between groups is weak.

Despite these caveats, PSM remains a popular and conceptually powerful tool, especially when combined with diagnostics and robustness checks. It can be particularly helpful when the number of covariates is large or mostly continuous.

Coarsened Exact Matching

Coarsened Exact Matching (CEM) offers a practical compromise between the rigidity of exact matching and the flexibility needed for real-world data. The core idea is to

Coarsen continuous covariates into broader, meaningful categories and then perform exact matching on these coarsened values.

Formally, each covariate is discretized into bins, and treated and control units are matched only if they fall into the same bin across all coarsened covariates. This process reduces the granularity of the match criteria, increasing the likelihood of finding matches, while still ensuring comparability within the matched groups. Examples are turning age into 5-year intervals or income into quantile-based brackets.

By construction, CEM guarantees balance on the coarsened covariates—unlike propensity score matching, where balance must be checked and cannot be guaranteed a priori. CEM also allows researchers to control the level of approximation: the finer the bins, the closer it is to exact matching; the coarser the bins, the more matches you retain but the more heterogeneity you permit within matched pairs. Researchers can apply finer coarsening to critical variables and coarser groupings to less central ones.

However, CEM’s effectiveness depends heavily on the choice of binning. Poorly chosen coarsening can either lead to very few matches (if too fine) or poor covariate balance (if too coarse). There is a trade-off between retaining sample size and improving covariate similarity, and CEM makes this trade-off explicit and user-controllable.

Optimal Matching

Optimal matching takes a global approach to the matching problem. Rather than matching each treated unit to its nearest control in isolation (as in nearest neighbor matching), it

Finds the set of matched pairs that minimizes the total distance across all matched units.

Formally, it solves:

where is a distance measure between treated unit and control unit .

The key benefit is that it avoids poor global matches that can arise when matching is done greedily or locally, one unit at a time. Optimal matching is especially useful when treatment and control groups differ significantly in size or distribution, and when you want to minimize overall imbalance rather than optimize matches for individual units.

However, because it solves a global optimization problem, it can be computationally intensive for large datasets. Also, while it minimizes overall distance, it doesn’t necessarily guarantee good covariate balance unless combined with preprocessing (e.g., matching on propensity scores or coarsened covariates).

Still, optimal matching is a powerful and principled method, particularly when used with careful distance choices and diagnostics.

Genetic Matching

Genetic matching is an advanced matching method that uses a genetic algorithm to find an optimal weighting of covariates in the distance metric. The idea is to

Automate the process of choosing how much weight each covariate should receive when determining similarity between treated and control units.

Rather than manually selecting a distance metric like Mahalanobis or Euclidean, genetic matching searches over a space of weighted Mahalanobis distances, adjusting the weights to minimize covariate imbalance after matching. The optimization goal is to improve covariate balance. The result is a customized distance metric that gives higher weight to variables that are harder to balance and less to those that are already balanced.

Genetic matching can be used with or without propensity score preprocessing, and can accommodate interactions or higher-order terms. It’s especially powerful in settings with many covariates or complex imbalance patterns that simple metrics fail to capture.

However, the method is computationally intensive, often requiring many iterations of matching and balance assessment. Its performance also depends on the choice of balance metrics and tuning parameters in the genetic algorithm.

Caliper Matching

Caliper matching introduces a distance threshold to restrict which treated and control units can be matched. Specifically,

A treated unit is only matched to a control unit if the distance between them is within a pre-specified caliper.

That is, if the difference falls below a set limit. For example, when matching on propensity scores, a common rule is to match only if the absolute difference in propensity scores is less than 0.1:

This constraint helps avoid poor matches, especially when treated and control groups have limited overlap. Without calipers, nearest neighbor matching might pair units with very different covariate profiles, particularly in the tails of the propensity score distribution. These poor matches can increase bias and undermine the credibility of causal estimates.

Caliper matching is not a matching method on its own but rather a modification to existing strategies—most often to nearest neighbor matching. It can also be combined with optimal matching or Mahalanobis distance.

Choosing the right caliper width is important: too wide, and the constraint has little effect; too narrow, and many treated units may be left unmatched, reducing sample size and precision.

Caliper matching is particularly useful when the common support assumption is questionable—i.e., when treated and control groups do not overlap well in covariate space. In such cases, calipers serve as a safeguard to maintain the quality of matches by explicitly enforcing local comparability.

Bottom Line

Matching methods are powerful tools for causal inference.
They come in many flavors, each with its own strengths and weaknesses.
No single method is best for all situations; the choice depends on the data, the research question, and the assumptions you are willing to make.

Where to Learn More

The book Causal Inference for Statistics, Social, and Biomedical Sciences by Imbens and Rubin (2015) provides excellent coverage of matching and its theoretical underpinnings. I also recommend Stuart (2010)’s seminal review paper cited below. The MatchIt and Matching R packages documentation are also goldmines for practical implementation details.

References

Abadie, A., & Imbens, G. W. (2016). Matching on the estimated propensity score. Econometrica, 84(2), 781-807.

Ben-Michael, E., Feller, A., Hirshberg, D. A., & Zubizarreta, J. R. (2021). The balancing act in causal inference. arXiv preprint arXiv:2110.14831.

Diamond, A., & Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3), 932-945.

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.

Imbens, G. W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2), 373-419.

Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.

Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1–21.

Rosenbaum, P. R. (2002). Observational Studies. Springer.

The Many Flavors of Variable Selection

Mon, 26 May 2025 07:00:00 GMT

11 min read

Background

If you’ve ever worked with high-dimensional data, you’ve likely faced a familiar challenge: too many variables. Some features are pure noise, others are redundant or collinear, and only a handful truly matter. The question is: how do you tell the difference? This challenge lies at the heart of what we call variable selection.

Over time, statisticians and machine learning researchers have created a diverse toolbox of techniques to tackle this problem—each rooted in different ideas, with its own strengths and trade-offs. Some methods apply penalties to shrink coefficients, like Lasso and Ridge. Others use geometric insights, like Principal Components Analysis (PCA). There are methods built on randomization, like Model-X Knockoffs, and some that rely on greedy or stepwise searches, such as Forward Selection and Least Angle Regression (LAR).

In this post, I’ll take a guided tour through these approaches—what they do, when to use them, and why they work. I’ll also explore their limitations, because no method is a silver bullet. The goal isn’t to pick a winner, but to help you figure out which tool fits your problem. Think of it as a field guide to variable selection, focused on ideas and intuition—so you can navigate the landscape with more confidence and clarity. And, yes, there will be plenty of R and Python code snippets to illustrate each method in action.

Notation

Suppose we observe data , where is the outcome vector and is the matrix of predictors (covariates, features, regressors—pick your favorite term).

We’re interested in estimating a relationship like:

where is the vector of coefficients and is the error term.

In high-dimensional settings, may be large—possibly even larger than . The core task of variable selection is to identify which components of are nonzero (or, more generally, which features matter for predicting ).

(Distinguishing prediction and inference is crucial here: we focus on the former, so we ignore things like confidence intervals or -values for coefficients altogether. The latter is a much more complex problem.)

A Closer Look

We begin by loading the data.

Python

from sklearn.datasets import load_iris
import pandas as pd

# Load iris data
iris = load_iris(as_frame=True)
df = iris.frame
X = df[['sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = df['sepal length (cm)']

Now let’s examine each of the methods in turn.

Stepwise Selection (Forward, Backward, Both)

The classic workhorse of variable selection, stepwise procedures iteratively add or remove variables based on some criterion like AIC (Aikake Information Criterion), BIC (Bayesian Information Criterion), or -values. In forward selection, you start with no variables and add the one that improves the model the most. In backward elimination with , you start with all variables and remove the least significant one at each step. Both methods can also be combined in a bidirectional stepwise approach. In either case, you stop when adding or removing variables no longer sufficiently improves the model according to your chosen criterion.

Stepwise selection can work well for smaller problems where computational cost is low and interpretability is key (although we have recently made some progress on the computation side). However, it is unstable and prone to overfitting.

We are now ready to start with the modeling part.

library(MASS)
full_model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris)
step_model <- stepAIC(full_model, direction = "both")
summary(step_model)

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

# Stepwise selection (both directions)
sfs = SFS(LinearRegression(),
          k_features='best',  # Select best number of features
          forward=True,
          floating=True,      # Enables bidirectional selection
          scoring='neg_mean_squared_error',
          cv=0)               # No cross-validation, like stepAIC

sfs = sfs.fit(X, y)

# Selected features
print('Selected features:', list(sfs.k_feature_names_))

# Fit final model
selected_X = X[list(sfs.k_feature_names_)]
model = LinearRegression().fit(selected_X, y)
print(pd.Series(model.coef_, index=X.columns))

Lasso (aka Regularization)

Lasso introduced the big idea of sparsity, that only some variables enter the model. It penalizes the sum of the absolute values of the coefficients:

The magic of the penalty is that it can shrink some coefficients exactly to zero, performing variable selection as part of the estimation. Over the years, Lasso has become a staple in the variable selection toolkit. Its theoretical properties have been studied extensively, and it has been shown to work well in many practical scenarios.

Part of its appeal and popularity is the computation efficiency where modern algorithms can solve the entire regularization path efficiently. Lasso comes in a wide variety of flavors, including group lasso, adaptive lasso, and fused lasso, which I will probably cover in a future blog post. Be careful, though, lasso is known to be biased, so it’s great for prediction, but don’t take its coefficients at face value.

Lasso is a good idea when you believe that only a subset of predictors are relevant and want an interpretable model. It can struggle with groups of correlated predictors (tends to pick one arbitrarily), and is known to be biased due to shrinkage.

library(glmnet)
data(iris)
X <- as.matrix(iris[, c("Sepal.Width", "Petal.Length", "Petal.Width")])
Y <- iris$Sepal.Length
fit <- cv.glmnet(X, Y, alpha = 1)
coef(fit, s = "lambda.min")

from sklearn.linear_model import LassoCV
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris(as_frame=True).frame
X = iris[['sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris['sepal length (cm)']
lasso = LassoCV(cv=5).fit(X, y)
print(pd.Series(lasso.coef_, index=X.columns))

Ridge Regression (aka Regularization)

Ridge regression doesn’t exactly select variables—it shrinks them. The idea is to add a penalty on the size of the coefficients:

Here, is a tuning parameter that controls the strength of the penalty. As increases, the solution is increasingly biased toward zero, but the variance decreases, which can improve out-of-sample performance.

Unlike the lasso, Ridge regression does not produce sparse solutions—none of the coefficients are exactly zero. Instead, it distributes shrinkage smoothly across all variables, which can be helpful when all predictors contribute weakly and roughly equally.

Ridge is also computationally convenient. The modified normal equations involve the matrix , which is always invertible when , even if is singular. As a result, Ridge provides a unique and stable solution even in high-dimensional settings where —a situation where ordinary least squares (OLS) fails due to non-identifiability.

Ridge is especially good when multicollinearity is a problem; when you prefer stability over sparsity; or when many small effects contribute to the outcome.

library(glmnet)
data(iris)
X <- as.matrix(iris[, c("Sepal.Width", "Petal.Length", "Petal.Width")])
Y <- iris$Sepal.Length
fit <- cv.glmnet(X, Y, alpha = 0)
coef(fit, s = "lambda.min")

from sklearn.linear_model import RidgeCV
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris(as_frame=True).frame
X = iris[['sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris['sepal length (cm)']
lasso = RidgeCV(cv=5).fit(X, y)
print(pd.Series(lasso.coef_, index=X.columns))

Elastic Net

Elastic Net combines the strengths of both Ridge and Lasso by blending their penalties into a single regularization framework:

This formulation retains the sparsity-inducing property of the Lasso via the penalty while incorporating the stabilizing effect of Ridge regression through the penalty. The result is a model that not only performs variable selection but also handles groups of correlated predictors more gracefully than Lasso alone, which tends to pick one variable from a group and ignore the rest.

Elastic Net is especially helpful in high-dimensional settings where predictors are strongly correlated or when . The two tuning parameters, and , control the trade-off between sparsity and smooth shrinkage. In practice, these are often reparameterized using a single penalty term and a mixing proportion (as in many software packages), where:

This makes it easy to interpolate between Ridge () and Lasso (), giving you a continuum of models with different regularization characteristics.

library(glmnet)
data(iris)
X <- as.matrix(iris[, c("Sepal.Width", "Petal.Length", "Petal.Width")])
Y <- iris$Sepal.Length
fit <- cv.glmnet(X, Y, alpha = 0.5)
coef(fit, s = "lambda.min")

from sklearn.linear_model import ElasticNetCV
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris(as_frame=True).frame
X = iris[['sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris['sepal length (cm)']
lasso = ElasticNetCV(cv=5).fit(X, y)
print(pd.Series(lasso.coef_, index=X.columns))

Principal Components Regression (PCR)

Principal Components Analysis (PCA) finds linear combinations of the original variables that explain the most variance of the entire dataset.

In Principal Components Regression, we regress on the top principal components of instead of on the original variables.

PCA is among the most popular methods for dimensionality reduction even among junior data scientists, so I won’t spend too much time on it here. PCA lives in dual nature, with one foot in unsupervised learning (finding components) and the other in supervised learning (variable selection). Note how the term variable selection here is used indirectly, since it selects combinations of variables, not individual variables. Its main strength is its incredible versatility and ability to handle high-dimensional data, but its output can be challenging to interpret.

library(pls)
pcr_model <- pcr(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris, scale = TRUE, validation = "CV")
summary(pcr_model)

from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
reg = LinearRegression().fit(X_pca, y)
print(pd.Series(reg.coef_, index=pca.get_feature_names_out()))

Least Angle Regression (LAR)

Least Angle Regression (LAR) is a greedy, stepwise variable selection algorithm that adds predictors to a linear model incrementally. At each step, it moves in the direction of the predictor most correlated with the current residual, just like forward selection—but with a twist: it adjusts the direction gradually as more variables become equally correlated with the residuals. How it works:

Algorithm:

Start with all coefficients set to zero.
Find the predictor most correlated with the current residual.
Move the coefficient of that variable in the direction of its sign until another predictor becomes equally correlated with the residual.
Continue in a “least angle” direction, adjusting the path to include both predictors, and so on.

The result is a sequence of models, each with one more active variable—just like in forward stepwise regression, but using geometry rather than brute force.

Geometrically, LAR moves along piecewise linear paths toward the least squares solution, and its trajectory closely tracks that of Lasso. In fact, with a small modification, LAR can be used to compute the entire Lasso solution path.

library(lars)
lar_model <- lars(X, Y, type = "lar")
print(lar_model)

from sklearn.linear_model import Lars
lar = Lars().fit(X, y)
print(pd.Series(lar.coef_, index=X.columns))

SCAD (Smoothly Clipped Absolute Deviation)

SCAD (Smoothly Clipped Absolute Deviation) is a non-convex penalty introduced by Fan and Li (2001) to address a key limitation of the Lasso: its tendency to over-shrink large coefficients, leading to biased estimates for important variables.

The SCAD penalty is designed to encourage sparsity like the Lasso for small coefficients, but to relax the penalty for larger ones. In other words, it behaves like Lasso near zero—pushing small coefficients toward zero—but reduces shrinkage as coefficients grow, effectively preserving the size of large signals.

Mathematically, the derivative of the SCAD penalty is defined as:

where (typically ) and denotes the positive part. This piecewise definition ensures a smooth transition:

For small coefficients , it behaves like the Lasso.
For moderate coefficients , the penalty decreases gradually.
For large coefficients , the penalty becomes flat—effectively applying no further shrinkage.

This adaptive behavior helps SCAD achieve a balance between sparsity and unbiasedness. Although the non-convexity makes optimization more challenging than with Lasso or Ridge, the SCAD penalty is continuous and piecewise smooth, allowing the use of local coordinate descent algorithms and oracle-like properties under certain conditions. The non-convex objective can lead to multiple local minima, making optimization more delicate and computationally intensive.

library(ncvreg)
data(iris)
X <- as.matrix(iris[, c("Sepal.Width", "Petal.Length", "Petal.Width")])
Y <- iris$Sepal.Length

# Fit SCAD-penalized regression
scad_fit <- ncvreg(X, Y, penalty = "SCAD")

# Plot cross-validated error
cv <- cv.ncvreg(X, Y, penalty = "SCAD")
plot(cv)

# Coefficients at optimal lambda
coef(cv, lambda = "min")

from skglm import GeneralizedLinearEstimator
from skglm.datafits import Quadratic
from skglm.penalties import SCAD

scad = GeneralizedLinearEstimator(Quadratic(), SCAD(alpha=0.1, gamma=3.7))
scad.fit(X, y)
print(pd.Series(scad.coef_, index=X.columns))

Knockoffs

Knockoffs, introduced by Barber and Candès (2015), is a clever framework for variable selection with false discovery rate (FDR) control. The method constructs “knockoff copies” of each feature—artificial variables that mimic the correlation structure of the real ones but are known to be null. Then it tests whether the real variables outperform their knockoffs.

I have written about knockoffs in more detail in previous posts, so I won’t go into the details here. Just like PCA, knockoffs live in dual nature, with one foot in the multiple testing literature (constructing knockoffs) and the other in supervised learning world (variable selection).


# Clear workspace
rm(list = ls())
library(knockoff)
library(glmnet)
library(dplyr)

# Load data
data(iris)

# Step 1: Prepare the data (binary classification)
iris_binary <- iris %>% filter(Species != "setosa")
X <- as.matrix(iris_binary[, 1:4])  # numeric predictors
y <- as.numeric(iris_binary$Species == "virginica")  # binary target: virginica vs versicolor

# Step 2: Create knockoff copies
# Use the default Gaussian model-X knockoffs
knockoffs <- create.fixed(X)  # creates a list with X and X_k (knockoffs)

X_knock <- knockoffs$Xk

# Step 3: Combine X and knockoffs and fit a Lasso model
X_combined <- cbind(X, X_knock)
fit <- cv.glmnet(X_combined, y, family = "binomial", alpha = 1)

# Step 4: Compute importance statistics (lasso coefficients at lambda.min)
coefs <- coef(fit, s = "lambda.min")[-1]  # remove intercept
p <- ncol(X)

W <- abs(coefs[1:p]) - abs(coefs[(p+1):(2*p)])  # feature importance W-statistic

# Step 5: Apply knockoff threshold to select features
threshold <- knockoff.threshold(W, fdr = 0.1)  # control FDR at 10%
selected <- which(W >= threshold)

# Step 6: Print results
feature_names <- colnames(X)
cat("Selected features controlling FDR at 10%:\n")
print(feature_names[selected])

FOCI (Feature Ordering by Conditional Independence)

FOCI is a recent, information-theoretic method that orders features by how much conditional mutual information they contribute to the outcome. It’s model-free and does not assume a particular parametric form. I have also written about FOCI in a previous post, so I won’t repeat the details here.

Bottom Line

Lasso, Ridge, and Elastic Net are the go-to penalized regression methods, with Lasso giving sparsity, Ridge providing stability, and Elastic Net blending the two.
Non-convex penalties like SCAD address Lasso’s bias issue but at a computational cost.
PCA-based methods reduce dimensionality but don’t directly select variables.
Knockoffs offer strong statistical guarantees like FDR control but require careful implementation.
Modern approaches like FOCI expand the toolkit to nonlinear and information-theoretic settings.

Where to Learn More

For a great introduction to penalized regression methods, The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman is a classic. As always, you can reach for Computer Age Statistical Inference or All of Statistics and they won’t let you down.

References

Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Annals of Statistics, 43(5), 2055–2085.

Efron, B., & Hastie, T. (2021). Computer age statistical inference, student edition: algorithms, evidence, and data science (Vol. 6). Cambridge University Press.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.

Wasserman, L. (2004). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

The Many Flavors of Bootstrap

Mon, 26 May 2025 07:00:00 GMT

8 min read

Background

At its heart, the bootstrap poses a simple yet powerful question: “What if we could resample from our existing data, treating it as a stand-in for the population?” By doing so, we can estimate variability, build confidence intervals, and carry out hypothesis tests—without leaning heavily on strong parametric assumptions. It’s especially useful in situations where analytic solutions exist in theory but are too complex to derive or even implement in practice.

But here’s the thing: there isn’t just one bootstrap. Over the years, statisticians have developed many flavors of the bootstrap to address different challenges in different settings. Some handle small samples better. Some are designed for dependent data like time series. Others shine when the assumptions of classic bootstrapping crumble (think clustered data or heteroskedasticity).

In this article, I’ll take a tour through the zoo of bootstrap methods: from the classic nonparametric bootstrap to the jackknife, parametric bootstrap, Bayesian bootstrap, wild bootstrap, moving block bootstrap, and more. I’ll explore where each method shines, where it stumbles, and how to pick the right one for your problem. As usual, I won’t just throw formulas at you. The focus here is on understanding why these methods work, not just how to mechanically apply them. There is also plenty of R and Python code to illustrate each method in action.

Notation

We have data , where are independent and identically distributed (i.i.d.) random variables drawn from some unknown distribution . We’re interested in estimating some parameter like the mean, median, regression coefficients, or a more complicated functional.

Our estimator of from the observed sample is where is the empirical distribution function that puts mass on each observed data point.

The big question is: How variable is ? And that’s where the bootstrap comes in. Regardless of the type of bootstrap, given a bunch of estimates of , its variance is computed as:

where is the average of the bootstrap estimates.

A Closer Look

The Jackknife

Let’s start with the jackknife, developed back in the 1950s by Quenouille and popularized by Tukey. The jackknife isn’t technically a bootstrap, but it’s often the gateway to resampling methods. Here is how it works:

Algorithm:

For :

Drop observation at a time and recompute your estimate.
Compute the jackknife estimate, , on the remaining observations.

Here is the empirical distribution leaving out the -th observation. We then use the variability across these “leave-one-out” estimates to approximate the variance of following the formula above.

The jackknife works well for smooth statistics like the mean or regression coefficients. But it can fail miserably for non-smooth functionals like the median or quantiles.

Strengths: Fast, easy to implement, no randomness involved.

Weaknesses: Limited to statistics that are smooth in the data. Doesn’t handle complex dependency structures or non-smooth parameters well.

set.seed(1988)
y <- rnorm(100)
jackknife_estimates <- sapply(1:length(y), function(i) mean(y[-i]))
jackknife_variance <- (length(y) - 1) / length(y) * var(jackknife_estimates)
print(jackknife_variance)

import numpy as np
np.random.seed(1988)
y = np.random.normal(size=100)
jackknife_estimates = np.array([np.mean(np.delete(y, i)) for i in range(len(y))])
jackknife_variance = (len(y) - 1) / len(y) * np.var(jackknife_estimates, ddof=1)
print(jackknife_variance)

Classic Nonparametric Bootstrap

The classic bootstrap, introduced by Bradley Efron in 1979, takes the idea of resampling and turns it up a notch. Instead of dropping one observation at a time, we repeatedly resample with replacement from our data to create many “new” datasets, each the same size as the original.

Algorithm:

For each bootstrap sample :

Sample observations with replacement from your data.
Compute the statistic .

Strengths: Flexible, broadly applicable, works well for non-smooth statistics.

Weaknesses: Can struggle with small samples or dependent data (like time series). Resampling with replacement assumes independence.

set.seed(1988)
y <- rnorm(100)
B <- 1000
boot_means <- replicate(B, mean(sample(y, replace = TRUE)))
boot_variance <- var(boot_means)
print(boot_variance)

np.random.seed(1988)
B = 1000
boot_means = [np.mean(np.random.choice(y, size=len(y), replace=True)) for _ in range(B)]
boot_variance = np.var(boot_means, ddof=1)
print(boot_variance)

Parametric Bootstrap

The parametric bootstrap is a natural extension of the classic idea with a minor twist. Instead of sampling from the empirical distribution , you assume a parametric model for the data, fit it to the sample, and then generate new data from the fitted model.

Algorithm:

For each bootstrap sample :

Sample observations from .
Compute the statistic .

For example, if you assume , estimate and , and then generate bootstrap samples from .

The parametric bootstrap can be a good idea when you trust your parametric model (or at least trust it more than the empirical distribution) and want to leverage that structure.

Strengths: More efficient than nonparametric bootstrap if the model is well-specified. Can handle small samples better.

Weaknesses: Garbage in, garbage out—if the parametric model is wrong, so are your bootstrap results.

set.seed(1988)
y <- rnorm(100)
mu_hat <- mean(y)
sigma_hat <- sd(y)
param_boot_means <- replicate(B, mean(rnorm(100, mu_hat, sigma_hat)))
param_boot_variance <- var(param_boot_means)
print(param_boot_variance)

mu_hat = np.mean(y)
sigma_hat = np.std(y, ddof=1)
param_boot_means = [np.mean(np.random.normal(mu_hat, sigma_hat, size=len(y))) for _ in range(B)]
param_boot_variance = np.var(param_boot_means, ddof=1)
print(param_boot_variance)

Bayesian Bootstrap

Invented by Rubin in 1981, the Bayesian bootstrap doesn’t resample data points directly. Instead, it puts a Dirichlet prior on the weights assigned to each observation.

Whereas the classical bootstrap simulates sampling from a population by creating new samples from the observed data, the Bayesian bootstrap simulates uncertainty about the population distribution itself using the Bayesian framework—specifically by placing a nonparametric prior over the unknown distribution (implicitly, a Dirichlet process prior).

Algorithm:

For each bootstrap replicate :

Draw weights .
Construct the weighted empirical distribution , where is a point mass at observation .
Compute the weighted statistic: .

Strengths: Smooth, avoids ties from discrete resampling, easy to implement.

Weaknesses: Interpretation may feel less intuitive if you’re used to classical frequentist bootstrap.

library(MCMCpack)  # for rdirichlet
set.seed(1988)
y <- rnorm(100)
B <- 1000
bayes_boot_means <- replicate(B, {
  weights <- as.numeric(rdirichlet(1, rep(1, length(y))))
  sum(weights * y)
})
var(bayes_boot_means)

from scipy.stats import dirichlet
bayes_boot_means = []
for _ in range(B):
    weights = dirichlet.rvs([1] * len(y))[0]
    bayes_boot_means.append(np.sum(weights * y))
print(np.var(bayes_boot_means, ddof=1))

Wild Bootstrap

The wild bootstrap is a lifesaver when dealing with heteroskedasticity or few clusters. Rather than resampling entire observations (which breaks the structure of heteroskedastic errors), the wild bootstrap keeps the design matrix fixed and perturbs only the residuals—in a way that maintains heteroskedasticity-consistent variability. Some versions modify the score function instead of the residuals.

Suppose you’re estimating a regression model:

Then, you proceed as follows:

Algorithm: Wild Bootstrap

For each bootstrap replicate :

Generate a new outcome variable by perturbing the residuals:

where are random variables with mean zero and variance one (e.g., Rademacher random variables taking values with probability ).

Refit the model using the perturbed outcomes and compute the statistic .

Strengths: Handles heteroskedasticity gracefully, robust in small-sample settings.

Weaknesses: Mostly designed for regression contexts. Choice of perturbation distribution matters.

set.seed(1988)
x <- rnorm(100)
y <- 2 * x + rnorm(100, sd = abs(x))
model <- lm(y ~ x)
residuals <- resid(model)
predicted <- fitted(model)
B <- 1000
wild_means <- replicate(B, {
  v <- sample(c(-1, 1), length(residuals), replace = TRUE)
  y_star <- predicted + v * residuals
  coef(lm(y_star ~ x))[2]
})
var(wild_means)

from sklearn.linear_model import LinearRegression
x = np.random.normal(size=100).reshape(-1, 1)
y = 2 * x.flatten() + np.random.normal(scale=np.abs(x.flatten()))
model = LinearRegression().fit(x, y)
residuals = y - model.predict(x)
predicted = model.predict(x)
wild_boot_coefs = []
for _ in range(B):
    v = np.random.choice([-1, 1], size=len(residuals))
    y_star = predicted + v * residuals
    coef = LinearRegression().fit(x, y_star).coef_[0]
    wild_boot_coefs.append(coef)
print(np.var(wild_boot_coefs, ddof=1))

Cluster Bootstrap

The cluster bootstrap is essential when working with clustered data, where observations within the same group (e.g., students in schools, workers in firms) may be correlated. Unlike standard bootstrap methods that resample individuals, the cluster bootstrap resamples entire clusters, preserving the internal dependence structure of the data.

Suppose you’re estimating a model like:

where indexes clusters and indexes observations within cluster .

The cluster bootstrap generates resampled datasets by:

Algorithm:

For each bootstrap sample :

Sample clusters from your data with replacement and include all observations from each selected cluster.
Compute the statistic .

Strengths: Simple to implement, preserves cluster dependence, consistent under many forms of within-cluster correlation.

Weaknesses: Requires a reasonably large number of clusters (typically ). Can be biased or unstable with few clusters. (Luckily, the United States was broken down into 50 states. The Swiss were not as fortunate.)

Moving Block Bootstrap

If your data are dependent, like time series, the classic bootstrap fails because it breaks the correlation structure. The moving block bootstrap fixes this by resampling blocks of adjacent observations instead of individual data points. You can easily see how this makes sense for time series: you want to maintain the local dependence structure while still resampling.

You choose a block length and create overlapping blocks of data:

Then, you proceed as follows:

Algorithm:

For each bootstrap sample :

Sample these blocks with replacement to form a new dataset.
Compute the statistic .

Strengths: Maintains local dependence within blocks.

Weaknesses: Choice of block size can be tricky; too small loses dependence, too big reduces variability.

library(boot)
set.seed(1988)
y <- arima.sim(model = list(ar = 0.7), n = 100)
block_length <- 5
B <- 1000
block_boot_means <- tsboot(y, statistic = function(x) mean(x), R = B, l = block_length, sim = "fixed")
var(block_boot_means$t)

from arch.bootstrap import MovingBlockBootstrap
np.random.seed(1988)
y = np.random.normal(size=100)
block_length = 5
bs = MovingBlockBootstrap(block_length, y)
boot_means = np.array([np.mean(data[0]) for data in bs.bootstrap(B)])
print(np.var(boot_means, ddof=1))

Bottom Line

The bootstrap is not a single method—it’s a whole family of techniques, each with its own sweet spot.
The jackknife is fast and simple but struggles with non-smooth statistics.
The classic bootstrap works great for i.i.d. data and smooth or non-smooth statistics, but fails with dependence or small samples.
Specialized bootstraps (wild, block, Bayesian, subsampling) handle heteroskedasticity, clustering, dependence, and other real-world challenges that trip up the classic approach.

Where to Learn More

Careful readers of this blog may have noticed that I frequently recommend Efron and Hastie’s Computer Age Statistical Inference for its modern perspective on statistical methods, including bootstrapping. While it’s an excellent and insightful text, it can be a bit too technical for many applied practitioners. If you’re looking for more approachable resources, I recommend exploring how various statistical software packages implement the bootstrap—Stata, in particular, offers great documentation and examples. You’ll also find high-quality lecture notes from advanced econometrics courses online that treat these topics with a contemporary lens. Finally, any of the references listed below will give you a solid grounding in bootstrap techniques.

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. The review of economics and statistics, 90(3), 414-427.

Davidson, R., & Flachaire, E. (2008). The wild bootstrap, tamed at last. Journal of Econometrics, 146(1), 162-169.

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1), 1–26.

Efron, B., & Hastie, T. (2021). Computer age statistical inference, student edition: algorithms, evidence, and data science (Vol. 6). Cambridge University Press.

Lahiri, S. N. (2003). Resampling Methods for Dependent Data. Springer.

Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9(1), 130–134.

The Secret Life of Correlation: Myths and Thirteen Views

Sat, 24 May 2025 07:00:00 GMT

7 min read

Background

Statistical correlation has long captivated me—it’s probably the topic I’ve written about most on this blog. What makes it so compelling is the combination of theoretical richness and deceptive simplicity. In an age dominated by deep learning and opaque models, correlation remains a refreshingly transparent and interpretable quantity. When I encounter a new dataset, it’s often the first tool I reach for to explore relationships among variables.

Despite its familiarity, correlation is also one of the most frequently misunderstood and misapplied concepts in statistics. It seems straightforward: a value between –1 and 1 that quantifies the strength and direction of a relationship between two variables. But beneath that tidy number lies a complex web of assumptions, limitations, and interpretations—many of which are overlooked even by seasoned practitioners.

In this article, I revisit two insightful papers—van den Heuvel and Zhan (2022), and Rodgers and Nicewander (1988)—that peel back the layers of meaning surrounding correlation. My aim is to deepen our intuition and clear up common misconceptions about three of the most widely used correlation measures: Pearson’s r, Spearman’s ρ, and Kendall’s τ. Along the way, I’ll explore thirteen different lenses through which correlation can be understood.

Notation

Let and be two random variables with realizations for a random sample indexed by . I assume all variables are centered (i.e., de-meaned) unless stated otherwise. Below are the three most commonly used correlation coefficients in practice.

As a refresher, here are the three correlation coefficients I’ll focus on:

Pearson’s is defined as:
Spearman’s is Pearson’s computed on the ranks of the data:
Kendall’s is based on the number of concordant and discordant pairs:

Concordant pairs of observations refer to pairs where the ranks of both variables move in the same direction. For example, if one observation is higher than another in both variables, they are concordant. Conversely, discordant pairs occur when the ranks of the variables move in opposite directions; one observation is higher in one variable but lower in the other.

A Closer Look

Some Myths

Pearson’s is traditionally described as a measure of linear association, while Spearman’s and Kendall’s are thought to capture monotonic relationships. This textbook distinction often leads analysts to default to rank-based methods when faced with nonlinear relationships. But as appealing as this neat categorization may be, it oversimplifies the reality.

Van den Heuvel and Zhan (2022) challenge this conventional wisdom. They argue that none of these three correlation coefficients are intrinsically limited to detecting “linear” or “monotonic” associations. Instead, their sensitivity depends on the underlying distributional structure, presence of heteroskedasticity, and even how the data were transformed. Through carefully constructed counterexamples, they demonstrate that Pearson’s can sometimes outperform Spearman’s and Kendall’s even when the association is nonlinear. Conversely, rank-based methods can be more powerful than even when the association is linear—particularly in distributions outside the bivariate normal family.

Another persistent myth is that rank correlations are categorically “more robust.” While it’s true that and are less sensitive to outliers in marginal distributions, this robustness has limits. Rank-based methods can still underperform or behave erratically in the presence of non-monotonic relationships or certain forms of heteroskedasticity. For instance, a -shaped relationship will likely elude all three measures.

New Framework for Association

To overcome these misconceptions and some of the counterexamples previously suggested in the literature, the authors propose a more nuanced framework for understanding linear and monotonic associations. They developed the following extended definitions:

Linear Association: and are linearly associated if there exist known monotone functions and such that:

Similarly,

Monotonic Association: and are monotonically associated if there exist two potentially unknown monotonic functions and such that

Under these updated definitions, the conventional understanding of which correlation coefficient is best suited for linear or monotonic relationships holds better ground. These definitions capture a richer set of relationships by accounting for transformations, rather than relying on raw scale comparisons. They also emphasize the importance of conditional expectation as the lens through which to define association, rather than relying solely on scatter plot geometry or regression output.

Overall, what becomes clear is that no correlation coefficient offers a complete or universally superior summary of association. Each captures different aspects of dependence. They are tools, not truths—and should be interpreted in context. Visualizations and complementary diagnostic tests remain indispensable.

Thirteen Ways to Look at Pearson’s

If this wasn’t enough for you, Rodgers and Nicewander (1988) offer a brilliant framing of correlation by listing thirteen distinct ways to interpret Pearson’s . Here’s a quick tour, each providing a slightly different angle:

As a measure of standardized covariance, it tells you how two variables co-vary after accounting for their units.
As a regression slope between standardized variables, it equals the slope of the line predicting -scored from -scored .
As the centered and standardized sum of cross-product of two variables. This is merely the definition of Pearson’s shown above.
As the cosine of the angle between two vectors, showing their geometric alignment.
As a geometric mean of the two regression slopes. It equals the square root of the product of the slopes of the regression of on and on .
As a square root of the ratio of two variances, where is the proportion of variance in explained by by linear regression.
As a function of the angle between the two standardized regression lines, where it equals the sum of the inverse of the cosine and the tangent of the angle between the two lines.
As an average cross-product of standardized variables, which is obtained by dividing both the numerator and the denominator by the product of the two sample standard deviations.
As a rescaled variance of the difference between standardized scores
As a balloon rule: A visual approximation of using the ellipse-shaped scatterplot “balloon” width and height.
As a geometric property of elliptical contours (isoconcentration ellipses) in a bivariate distribution—essentially more precise versions of the “balloon” idea from the prior rule.
As a test statistic in randomized experiments, can be computed from a t-statistic or F-statistic (e.g., from ANOVA).
As a ratio of two means following Galton, reflects how the mean of Y changes with selected values of X.

Each interpretation highlights a different trade-off or caveat. For example, the geometric view gives a great intuition, but the regression slope interpretation connects more directly to causal inference. And perhaps most importantly, several of these views are not invariant to nonlinear transformations, which matters a lot in real data.

Bottom Line

Pearson’s , Spearman’s , and Kendall’s measure different aspects of association—none is a catch-all indicator.
The “monotonic vs. linear” framing is a helpful heuristic, but it can break down in some real-world scenarios.
Rodgers and Nicewander’s thirteen perspectives on correlation reveal its multifaceted nature and limitations.
Always visualize your data—correlation coefficients should not replace your eyes or your understanding of the domain.

References

van den Heuvel, E., & Zhan, Z. (2022). Myths about linear and monotonic associations: Pearson’s , Spearman’s , and Kendall’s . The American Statistician, 76(1), 44–52.

Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1), 59–66.

The Kolmogorov–Smirnov Test as a Goodness-of-fit

Mon, 05 May 2025 07:00:00 GMT

4 min read

Background

The Kolmogorov–Smirnov (KS) test is a staple in the statistical toolbox for checking how well data fit a hypothesized distribution. It comes in both a one-sample and a two-sample version. A common application in causal inference is covariates distribution balance checks between the treatment and control groups. It’s nonparametric, straightforward to compute, and widely implemented in just about every statistical software. But—and this is a big but—using the KS test naively can lead to some serious misinterpretations, especially when parameters are estimated from the data.

This article is based on the 2024 paper by Zeimbekakis, Schifano, and Yan, which takes a hard look at the common misuses of the one-sample KS test. I’ll walk through what the KS test is supposed to do, when it goes wrong, and how to think more clearly about assessing goodness-of-fit.

Notation

Let be i.i.d. random variables with unknown distribution function . We want to test whether , for some known distribution function .

The empirical distribution function (EDF) is:

You are probably familiar with this. It is a step function that estimates the true cumulative distribution function of a random variable based on a sample. At any point , the ECDF gives the proportion of observations in the sample that are less than or equal to . It is the nonparametric maximum likelihood estimator of the cumulative distribution function (CDF).

The KS statistic is:

Under the null hypothesis, this test statistic converges to the Kolmogorov distribution, a distribution with no closed-form density but a known CDF. This is under the assumption that is fully specified, i.e., no parameters have been estimated from the data.

A Closer Look

A Refresher on KS

Intuitively, the KS test statistic measures the largest vertical distance between the EDF and the hypothesized CDF . It is sensitive to discrepancies in the CDF. This gives you a global measure of discrepancy, not a local one—so it’s less powerful for detecting issues like tail misspecification or multimodality. This is important because in many applications, tail behavior is critically important, such as in risk modeling or extreme value analysis.

A well known limitation of the KS test is that with small samples, it has limited power to detect distributional differences, while with very large samples, it may detect statistically significant but practically trivial deviations from the hypothesized distribution. This problem in the context of “big data” is obviously broader and goes beyond the KS test.

The Problem

Here’s the catch: the null distribution of the KS statistic assumes is fully known. But in practice, people often use the test to evaluate model fit after estimating parameters—e.g., fitting a normal distribution by MLE and then checking fit with KS.

That invalidates the test.

Why? Because the theoretical distribution of changes when parameters are estimated. The true distribution of the test statistic becomes conditional on the data, and the critical values are no longer accurate. This leads to a deflated Type I error rate: you’re less likely to incorrectly reject the null. In other words, the test is too conservative.

Better Alternatives

When parameters are estimated, we need modified procedures:

Lilliefors test: An adaptation of the KS test that adjusts the null distribution when testing for normality with estimated parameters.
Parametric bootstrap: Simulate the null distribution of the test statistic by repeatedly fitting the model and computing on simulated data.
Other GOF tests: Anderson-Darling and Cramér-von Mises tests have versions that handle estimated parameters more gracefully.

Bottom Line

The KS test is a popular and flexible method for testing differences between statistical distributions.
It assumes no parameters are estimated—violating this leads to invalid inference.
Estimating parameters from the same data used in the test deflates Type I error.
Use alternatives like the Lilliefors test or bootstrap methods when parameters are estimated.

References

Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American statistical Association, 62(318), 399-402.

Zeimbekakis, A., Schifano, E. D., & Yan, J. (2024). On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. The American Statistician, 78(4), 481-487.