Randomization Inference: A Gentle Introduction

causal inference

statistical inference

Published

February 12, 2026

6 min read

Background

Randomization inference offers a refreshing alternative to traditional parametric inference, providing exact control over Type I error rates without relying on large-sample approximations or strict distributional assumptions. Born out of Fisher’s famous tea-tasting experiment, the approach leverages the symmetry and structure induced by randomization itself to test hypotheses.

This blog post unpacks the theory and intuition behind randomization inference, drawing on the excellent review by Ritzwoller, Romano, and Shaikh (2025). I’ll cover the key ideas, notation, and algorithms involved, and also touch on modern applications like two-sample tests, regression, and conformal inference. Throughout, I’ll emphasize the practical considerations — when it works, why it works, and where caution is needed.

Notation

Let \(W \in \{0,1\}^n\) denote the treatment assignment vector and \(Y\) the observed outcomes. In potential-outcomes notation, each unit has \(Y_i(1)\) and \(Y_i(0)\), and we observe

\[ Y_i = W_i Y_i(1) + (1 - W_i) Y_i(0). \]

The assignment mechanism is known. For example, under complete randomization with \(n_T\) treated units, \(W\) is uniformly distributed over all binary vectors with exactly \(n_T\) ones.

Let \(T(X)\) be a test statistic computed from the observed data \(X = (Y, W)\).

Let \(G\) denote the set of transformations consistent with the design (e.g., all treatment permutations preserving the treated count). Under a valid randomization hypothesis,

\[ gX \overset{d}{=} X \quad \text{for all } g \in G. \]

This invariance is the engine of randomization inference.

A Closer Look

Sharp vs Regular Null Hypotheses

The most important distinction in randomization inference is between sharp null hypotheses (which fully determine the unobserved potential outcomes) and regular/weak null hypotheses (which do not).

A sharp null specifies the treatment effect for every unit. The canonical example is Fisher’s no-effect null: \[ H_0^{\text{sharp}}:\; Y_i(1) = Y_i(0)\quad \text{for all } i. \] Under this null, the missing potential outcomes are imputable from the observed outcomes. That is what makes exact finite-sample randomization tests possible: for each candidate assignment \(W'\), you can reconstruct the outcomes that would have been observed under \(W'\) and recompute the test statistic.

A regular/weak null is something like “the average treatment effect is zero,” \[ H_0^{\text{weak}}:\; \mathbb{E}[Y(1) - Y(0)] = 0, \] or a regression-style null about a parameter in a model. This null does not let you impute all missing potential outcomes, so the randomization distribution of a non-studentized statistic typically depends on nuisance features (e.g., heteroskedasticity). In that setting, exactness generally fails, and validity is recovered (when it is) by using statistics that are asymptotically pivotal, often via studentization.

The two different \(p\)-values are not comparable since they are based on different null hypotheses.

Exact Randomization Tests

If the randomization hypothesis holds, we can compute the distribution of \(T(X)\) by applying all transformations in \(G\) to the data. The \(p\)-value is simply the proportion of these transformed test statistics that are as extreme or more extreme than the observed \(T(X)\):

\[ \hat{p} = \frac{1}{|G|} \sum_{g \in G} I\{ T(gX) \geq T(X) \}. \]

Because the null implies invariance under \(G\), this procedure achieves exact finite-sample control of the Type I error rate.

Algorithm: Randomization Test

Choose a test statistic \(T(X)\).
Define the group \(G\) of transformations.
Compute \(T(X)\) on the observed data.
Apply all (or a random sample of) transformations \(g \in G\) to the data and recompute \(T(gX)\).
Calculate the \(p\)-value as the proportion of transformed statistics as or more extreme than \(T(X)\).

Because the null implies invariance under \(G\), this test controls Type I error exactly in finite samples.

In practice, \(|G|\) can be large. Monte Carlo sampling of transformations provides an accurate approximation, with a simple +1 adjustment ensuring exactness under random sampling.

When Exactness Fails

Under weak nulls, permutation tests are no longer automatically valid. The permutation distribution of a statistic may not match its true sampling distribution.

The difference in means illustrates the issue. If treatment and control variances differ, the raw difference in means can severely over-reject under permutation. The statistic is not pivotal.

Studentization resolves the problem. Scaling by an estimated standard error produces an asymptotically pivotal statistic whose limiting null distribution does not depend on nuisance parameters. Rank-based procedures (e.g., Wilcoxon–Mann–Whitney) achieve a similar goal.

The general principle is simple: asymptotic validity requires asymptotic pivotality.

Strengths and Limitations

Randomization inference is particularly powerful when the randomization scheme is known and controlled, as in experiments, when the test statistic is chosen to be pivotal, and when exact finite-sample error control is important.

However, it becomes less effective when covariates are correlated with treatment assignment but not properly accounted for, or when the sample size is too small to approximate the randomization distribution reliably through subsampling.

An Example

We illustrate the procedure with a small randomized experiment. There are \(n = 20\) units; exactly \(n_{\text{treat}} = 10\) receive treatment under complete randomization, so \(W\) is uniformly distributed over all binary vectors with ten ones. Each unit has potential outcomes \(Y_i(0)\) and \(Y_i(1) = Y_i(0) + \tau\) with a constant effect \(\tau = 1\), and we observe \(Y_i = W_i Y_i(1) + (1 - W_i) Y_i(0)\). The test statistic is the difference in means, \(T = \bar{Y}_1 - \bar{Y}_0\).

We test Fisher’s sharp null of no effect: \(Y_i(1) = Y_i(0)\) for all \(i\). Under this null, the observed \(Y\) would be the same under any assignment, so we can build the randomization distribution by repeatedly permuting \(W\) (keeping the number of treated units fixed), recomputing \(T\) for each permuted assignment, and then computing the proportion of those values that are as or more extreme than the observed \(T\). That proportion is the randomization \(p\)-value.

The code below does exactly that, using \(B = 5000\) random permutations and a standard +1 adjustment for the Monte Carlo \(p\)-value.

set.seed(1988)
n <- 20
n_treat <- 10

# Potential outcomes (fixed) and assignment (random)
y0 <- rnorm(n, mean = 0, sd = 1)
tau <- 1.0
w <- sample(c(rep(1, n_treat), rep(0, n - n_treat)))
y <- y0 + tau * w

# Test statistic: difference in means
t_obs <- mean(y[w == 1]) - mean(y[w == 0])

# Randomization distribution under Fisher sharp null of no effect:
# under H0, y(1)=y(0), so observed outcomes are invariant to assignment.
b <- 5000
t_perm <- replicate(b, {
  w_perm <- sample(w)  # preserves treated count (complete randomization)
  mean(y[w_perm == 1]) - mean(y[w_perm == 0])
})

# Two-sided $p$-value with a +1 adjustment (Monte Carlo exactness)
p_value <- (1 + sum(abs(t_perm) >= abs(t_obs))) / (b + 1)
p_value

import numpy as np

np.random.seed(1988)
n = 20
n_treat = 10

y0 = np.random.normal(0, 1, n)
tau = 1.0
w = np.array([1] * n_treat + [0] * (n - n_treat))
w = np.random.permutation(w)
y = y0 + tau * w

def diff_in_means(y, w):
    return y[w == 1].mean() - y[w == 0].mean()

t_obs = diff_in_means(y, w)

b = 5000
t_perm = np.empty(b)
for i in range(b):
    w_perm = np.random.permutation(w)  # preserves treated count
    t_perm[i] = diff_in_means(y, w_perm)

p_value = (1 + np.sum(np.abs(t_perm) >= np.abs(t_obs))) / (b + 1)
print(p_value)

Bottom Line

Randomization inference provides exact finite-sample error control when the randomization hypothesis holds.
Asymptotic validity can often be rescued by choosing asymptotically pivotal (studentized) test statistics.
Without studentization, permutation tests may fail badly in the presence of unequal variances.
Randomization tests are flexible and nonparametric, making them attractive for experimental data and beyond.

Where to Learn More

The best starting point is the recent review by Ritzwoller, Romano, and Shaikh (2025). For foundational treatments on nonparametric inference, the go-to is Lehmann & Romano’s two-volume door-stopper Testing Statistical Hypotheses. The practical guide by Good (2005) on permutation tests is also highly recommended.

References

Ritzwoller, D. M., Romano, J. P., & Shaikh, A. M. (2025). Randomization Inference: Theory and Applications.
Lehmann, E. L., & Romano, J. P. (2022). Testing Statistical Hypotheses. Springer.
Good, P. (2005). Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer.