Randomization Inference: A Gentle Introduction
Background
Randomization inference offers a refreshing alternative to traditional parametric inference, providing exact control over Type I error rates without relying on large-sample approximations or strict distributional assumptions. Born out of Fisher’s famous tea-tasting experiment, the approach leverages the symmetry and structure induced by randomization itself to test hypotheses.
This blog post unpacks the theory and intuition behind randomization inference, drawing on the excellent review by Ritzwoller, Romano, and Shaikh (2025). We’ll cover the key ideas, notation, and algorithms involved, and also touch on modern applications like two-sample tests, regression, and conformal inference. Throughout, we’ll emphasize the practical considerations — when it works, why it works, and where caution is needed.
Notation
Let \(W \in \{0,1\}^n\) denote the treatment assignment vector and \(Y\) the observed outcomes. In potential-outcomes notation, each unit has \(Y_i(1)\) and \(Y_i(0)\), and we observe
\[ Y_i = W_i Y_i(1) + (1 - W_i) Y_i(0). \]
The assignment mechanism is known. For example, under complete randomization with \(n_T\) treated units, \(W\) is uniformly distributed over all binary vectors with exactly \(n_T\) ones.
Let \(T(X)\) be a test statistic computed from the observed data \(X = (Y, W)\).
Let \(G\) denote the set of transformations consistent with the design (e.g., all treatment permutations preserving the treated count). Under a valid randomization hypothesis,
\[ gX \overset{d}{=} X \quad \text{for all } g \in G. \]
This invariance is the engine of randomization inference.
A Closer Look
Exact Randomization Tests
If the randomization hypothesis holds, we can compute the distribution of \(T(X)\) by applying all transformations in \(G\) to the data. The \(p\)-value is simply the proportion of these transformed test statistics that are as extreme or more extreme than the observed \(T(X)\):
\[ \hat{p} = \frac{1}{|G|} \sum_{g \in G} I\{ T(gX) \geq T(X) \}. \]
Because the null implies invariance under \(G\), this procedure achieves exact finite-sample control of the Type I error rate.
- Choose a test statistic \(T(X)\).
- Define the group \(G\) of transformations.
- Compute \(T(X)\) on the observed data.
- Apply all (or a random sample of) transformations \(g \in G\) to the data and recompute \(T(gX)\).
- Calculate the \(p\)-value as the proportion of transformed statistics as or more extreme than \(T(X)\).
Because the null implies invariance under \(G\), this test controls Type I error exactly in finite samples.
In practice, \(|G|\) can be large. Monte Carlo sampling of transformations provides an accurate approximation, with a simple +1 adjustment ensuring exactness under random sampling.
When Exactness Fails
Under weak nulls, permutation tests are no longer automatically valid. The permutation distribution of a statistic may not match its true sampling distribution.
The difference in means illustrates the issue. If treatment and control variances differ, the raw difference in means can severely over-reject under permutation. The statistic is not pivotal.
Studentization resolves the problem. Scaling by an estimated standard error produces an asymptotically pivotal statistic whose limiting null distribution does not depend on nuisance parameters. Rank-based procedures (e.g., Wilcoxon–Mann–Whitney) achieve a similar goal.
The general principle is simple: asymptotic validity requires asymptotic pivotality.
Strengths and Limitations
Randomization inference is particularly powerful when the randomization scheme is known and controlled, as in experiments, when the test statistic is chosen to be pivotal, and when exact finite-sample error control is important.
However, it becomes less effective when covariates are correlated with treatment assignment but not properly accounted for, or when the sample size is too small to approximate the randomization distribution reliably through subsampling.
An Example
We illustrate the procedure with a small randomized experiment. There are \(n = 20\) units; exactly \(n_{\text{treat}} = 10\) receive treatment under complete randomization, so \(W\) is uniformly distributed over all binary vectors with ten ones. Each unit has potential outcomes \(Y_i(0)\) and \(Y_i(1) = Y_i(0) + \tau\) with a constant effect \(\tau = 1\), and we observe \(Y_i = W_i Y_i(1) + (1 - W_i) Y_i(0)\). The test statistic is the difference in means, \(T = \bar{Y}_1 - \bar{Y}_0\).
We test Fisher’s sharp null of no effect: \(Y_i(1) = Y_i(0)\) for all \(i\). Under this null, the observed \(Y\) would be the same under any assignment, so we can build the randomization distribution by repeatedly permuting \(W\) (keeping the number of treated units fixed), recomputing \(T\) for each permuted assignment, and then computing the proportion of those values that are as or more extreme than the observed \(T\). That proportion is the randomization \(p\)-value.
The code below does exactly that, using \(B = 5000\) random permutations and a standard +1 adjustment for the Monte Carlo \(p\)-value.
set.seed(1988)
n <- 20
n_treat <- 10
# Potential outcomes (fixed) and assignment (random)
y0 <- rnorm(n, mean = 0, sd = 1)
tau <- 1.0
w <- sample(c(rep(1, n_treat), rep(0, n - n_treat)))
y <- y0 + tau * w
# Test statistic: difference in means
t_obs <- mean(y[w == 1]) - mean(y[w == 0])
# Randomization distribution under Fisher sharp null of no effect:
# under H0, y(1)=y(0), so observed outcomes are invariant to assignment.
b <- 5000
t_perm <- replicate(b, {
w_perm <- sample(w) # preserves treated count (complete randomization)
mean(y[w_perm == 1]) - mean(y[w_perm == 0])
})
# Two-sided $p$-value with a +1 adjustment (Monte Carlo exactness)
p_value <- (1 + sum(abs(t_perm) >= abs(t_obs))) / (b + 1)
p_valueimport numpy as np
np.random.seed(1988)
n = 20
n_treat = 10
y0 = np.random.normal(0, 1, n)
tau = 1.0
w = np.array([1] * n_treat + [0] * (n - n_treat))
w = np.random.permutation(w)
y = y0 + tau * w
def diff_in_means(y, w):
return y[w == 1].mean() - y[w == 0].mean()
t_obs = diff_in_means(y, w)
b = 5000
t_perm = np.empty(b)
for i in range(b):
w_perm = np.random.permutation(w) # preserves treated count
t_perm[i] = diff_in_means(y, w_perm)
p_value = (1 + np.sum(np.abs(t_perm) >= np.abs(t_obs))) / (b + 1)
print(p_value)Bottom Line
- Randomization inference provides exact finite-sample error control when the randomization hypothesis holds.
- Asymptotic validity can often be rescued by choosing asymptotically pivotal (studentized) test statistics.
- Without studentization, permutation tests may fail badly in the presence of unequal variances.
- Randomization tests are flexible and nonparametric, making them attractive for experimental data and beyond.
Where to Learn More
The best starting point is the recent review by Ritzwoller, Romano, and Shaikh (2025). For foundational treatments on nonparametric inference, the go-to is Lehmann & Romano’s two-volumne door-stopper Testing Statistical Hypotheses. The practical guide by Good (2005) on permutation tests is also highly recommended.
References
Ritzwoller, D. M., Romano, J. P., & Shaikh, A. M. (2025). Randomization Inference: Theory and Applications.
Lehmann, E. L., & Romano, J. P. (2022). Testing Statistical Hypotheses. Springer.
Good, P. (2005). Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer.