The Many Flavors of Bootstrap

bootstrap

statistical inference

flavors

Published

May 26, 2025

Background

At its heart, the bootstrap poses a simple yet powerful question: “What if we could resample from our existing data, treating it as a stand-in for the population?” By doing so, we can estimate variability, build confidence intervals, and carry out hypothesis tests—without leaning heavily on strong parametric assumptions. It’s especially useful in situations where analytic solutions exist in theory but are too complex to derive or even implement in practice.

But here’s the thing: there isn’t just one bootstrap. Over the years, statisticians have developed many flavors of the bootstrap to address different challenges in different settings. Some handle small samples better. Some are designed for dependent data like time series. Others shine when the assumptions of classic bootstrapping crumble (think clustered data or heteroskedasticity).

In this article, I’ll take a tour through the zoo of bootstrap methods: from the classic nonparametric bootstrap to the jackknife, parametric bootstrap, Bayesian bootstrap, wild bootstrap, moving block bootstrap, and more. We’ll explore where each method shines, where it stumbles, and how to pick the right one for your problem. As usual, I won’t just throw formulas at you. The focus here is on understanding why these methods work, not just how to mechanically apply them. There is also plenty of R and Python code to illustrate each method in action.

Notation

We have data \(\{Y_1, Y_2, \dots, Y_n\}\), where \(Y_i\) are independent and identically distributed (i.i.d.) random variables drawn from some unknown distribution \(F\). We’re interested in estimating some parameter \[\theta = T(F),\] like the mean, median, regression coefficients, or a more complicated functional.

Our estimator of \(\theta\) from the observed sample is \[\hat{\theta} = T(\hat{F}_n),\] where \(\hat{F}_n\) is the empirical distribution function that puts mass \(1/n\) on each observed data point.

The big question is: How variable is \(\hat{\theta}\)? And that’s where the bootstrap comes in. Regardless of the type of bootstrap, given a bunch of \(B\) estimates of \(\theta\), its variance is computed as: \[ \hat{V}_{\text{boot}} = \frac{1}{B - 1} \sum_{b=1}^B \left( \hat{\theta}^*_b - \bar{\theta}^* \right)^2, \]

where \(\bar{\theta}^* = \frac{1}{B} \sum_{b=1}^B \hat{\theta}^*_b\) is the average of the bootstrap estimates.

A Closer Look

The Jackknife

Let’s start with the jackknife, developed back in the 1950s by Quenouille and popularized by Tukey. The jackknife isn’t technically a bootstrap, but it’s often the gateway to resampling methods. Here is how it works:

Algorithm:

For \(i = 1, \dots, n\):

Drop observation \(i\) at a time and recompute your estimate.
Compute the jackknife estimate, \(\hat{\theta}_{(i)} = T(\hat{F}_{n,-i})\), on the remaining \(n-1\) observations.

Here \(\hat{F}_{n,-i}\) is the empirical distribution leaving out the \(i\)-th observation. We then use the variability across these “leave-one-out” estimates to approximate the variance of \(\hat{\theta}\) following the formula above.

When to use it? The jackknife works well for smooth statistics like the mean or regression coefficients. But it can fail miserably for non-smooth functionals like the median or quantiles.

Strengths: Fast, easy to implement, no randomness involved.

Weaknesses: Limited to statistics that are smooth in the data. Doesn’t handle complex dependency structures or non-smooth parameters well.

R
Python

set.seed(1988)
y <- rnorm(100)
jackknife_estimates <- sapply(1:length(y), function(i) mean(y[-i]))
jackknife_variance <- (length(y) - 1) / length(y) * var(jackknife_estimates)
print(jackknife_variance)

import numpy as np
np.random.seed(1988)
y = np.random.normal(size=100)
jackknife_estimates = np.array([np.mean(np.delete(y, i)) for i in range(len(y))])
jackknife_variance = (len(y) - 1) / len(y) * np.var(jackknife_estimates, ddof=1)
print(jackknife_variance)

Classic Nonparametric Bootstrap

The classic bootstrap, introduced by Bradley Efron in 1979, takes the idea of resampling and turns it up a notch. Instead of dropping one observation at a time, we repeatedly resample with replacement from our data to create many “new” datasets, each the same size as the original.

Algorithm:

For each bootstrap sample \(b = 1, \dots, B\):

Sample \(n\) observations with replacement from your data.
Compute the statistic \(\hat{\theta}^*_b = T(\hat{F}^*_b)\).

When to use it? Whenever the sample size is moderate or large and you can safely assume i.i.d. observations.

Strengths: Flexible, broadly applicable, works well for non-smooth statistics.

Weaknesses: Can struggle with small samples or dependent data (like time series). Resampling with replacement assumes independence.

R
Python

set.seed(1988)
y <- rnorm(100)
B <- 1000
boot_means <- replicate(B, mean(sample(y, replace = TRUE)))
boot_variance <- var(boot_means)
print(boot_variance)

np.random.seed(1988)
B = 1000
boot_means = [np.mean(np.random.choice(y, size=len(y), replace=True)) for _ in range(B)]
boot_variance = np.var(boot_means, ddof=1)
print(boot_variance)

Parametric Bootstrap

The parametric bootstrap is a natural extension of the classic idea with a minor twist. Instead of sampling from the empirical distribution \(\hat{F}_n\), you assume a parametric model \(F_\theta\) for the data, fit it to the sample, and then generate new data from the fitted model.

Algorithm:

For each bootstrap sample \(b = 1, \dots, B\):

Sample \(n\) observations from \(F_\theta\).
Compute the statistic \(\hat{\theta}^*_b = T(\hat{F}^*_b)\).

For example, if you assume \(Y_i \sim N(\mu, \sigma^2)\), estimate \(\hat{\mu}\) and \(\hat{\sigma}^2\), and then generate bootstrap samples from \(N(\hat{\mu}, \hat{\sigma}^2)\).

When to use it? When you trust your parametric model (or at least trust it more than the empirical distribution) and want to leverage that structure.

Strengths: More efficient than nonparametric bootstrap if the model is well-specified. Can handle small samples better.

Weaknesses: Garbage in, garbage out—if the parametric model is wrong, so are your bootstrap results.

R
Python

set.seed(1988)
y <- rnorm(100)
mu_hat <- mean(y)
sigma_hat <- sd(y)
param_boot_means <- replicate(B, mean(rnorm(100, mu_hat, sigma_hat)))
param_boot_variance <- var(param_boot_means)
print(param_boot_variance)

mu_hat = np.mean(y)
sigma_hat = np.std(y, ddof=1)
param_boot_means = [np.mean(np.random.normal(mu_hat, sigma_hat, size=len(y))) for _ in range(B)]
param_boot_variance = np.var(param_boot_means, ddof=1)
print(param_boot_variance)

Bayesian Bootstrap

Invented by Rubin in 1981, the Bayesian bootstrap doesn’t resample data points directly. Instead, it puts a Dirichlet prior on the weights assigned to each observation.

Whereas the classical bootstrap simulates sampling from a population by creating new samples from the observed data, the Bayesian bootstrap simulates uncertainty about the population distribution itself using the Bayesian framework—specifically by placing a nonparametric prior over the unknown distribution (implicitly, a Dirichlet process prior).

Algorithm:

For each bootstrap replicate \(b = 1, \dots, B\):

Draw weights \((w_1, \dots, w_n) \sim \text{Dirichlet}(1, \dots, 1)\).
Construct the weighted empirical distribution \(\hat{F}^*_b = \sum_{i=1}^n w_i^{(b)} \delta_{Y_i}\), where \(\delta_{Y_i}\) is a point mass at observation \(Y_i\).
Compute the weighted statistic: \(\hat{\theta}^*_b = T(\hat{F}^*_b)\).

When to use it? When you’re in a Bayesian mood or want a resampling scheme without discrete resampling (i.e., no repeated observations).

Strengths: Smooth, avoids ties from discrete resampling, easy to implement.

Weaknesses: Interpretation may feel less intuitive if you’re used to classical frequentist bootstrap.

R
Python

library(MCMCpack)  # for rdirichlet
set.seed(1988)
y <- rnorm(100)
B <- 1000
bayes_boot_means <- replicate(B, {
  weights <- as.numeric(rdirichlet(1, rep(1, length(y))))
  sum(weights * y)
})
var(bayes_boot_means)

from scipy.stats import dirichlet
bayes_boot_means = []
for _ in range(B):
    weights = dirichlet.rvs([1] * len(y))[0]
    bayes_boot_means.append(np.sum(weights * y))
np.var(bayes_boot_means, ddof=1)

Wild Bootstrap

The wild bootstrap is a lifesaver when dealing with heteroskedasticity or few clusters. Rather than resampling entire observations (which breaks the structure of heteroskedastic errors), the wild bootstrap keeps the design matrix \(X\) fixed and perturbs only the residuals—in a way that maintains heteroskedasticity-consistent variability. Some versions modify instead the score function instead of the residuals.

Suppose you’re estimating a regression model: \[ Y_i = X_i \beta + \varepsilon_i. \]

Then, you proceed as follows:

Algorithm: Wild Bootstrap

For each bootstrap replicate \(b = 1, \dots, B\):

Generate a new outcome variable by perturbing the residuals: \[ Y^*_i = X_i \hat{\beta} + v_i \hat{\varepsilon}_i, \]

where \(v_i\) are random variables with mean zero and variance one (e.g., Rademacher random variables taking values \(\pm1\) with probability \(0.5\)).

Refit the model using the perturbed outcomes and compute the statistic \(\hat{\theta}^*_b = T(\hat{F}^*_b)\).

When to use it? Heteroskedastic models, clustered data with few clusters.

Strengths: Handles heteroskedasticity gracefully, robust in small-sample settings.

Weaknesses: Mostly designed for regression contexts. Choice of perturbation distribution matters.

R
Python

set.seed(1988)
x <- rnorm(100)
y <- 2 * x + rnorm(100, sd = abs(x))
model <- lm(y ~ x)
residuals <- resid(model)
predicted <- fitted(model)
B <- 1000
wild_means <- replicate(B, {
  v <- sample(c(-1, 1), length(residuals), replace = TRUE)
  y_star <- predicted + v * residuals
  coef(lm(y_star ~ x))[2]
})
var(wild_means)

from sklearn.linear_model import LinearRegression
x = np.random.normal(size=100).reshape(-1, 1)
y = 2 * x.flatten() + np.random.normal(scale=np.abs(x.flatten()))
model = LinearRegression().fit(x, y)
residuals = y - model.predict(x)
predicted = model.predict(x)
wild_boot_coefs = []
for _ in range(B):
    v = np.random.choice([-1, 1], size=len(residuals))
    y_star = predicted + v * residuals
    coef = LinearRegression().fit(x, y_star).coef_[0]
    wild_boot_coefs.append(coef)
np.var(wild_boot_coefs, ddof=1)

Cluster Bootstrap

The cluster bootstrap is essential when working with clustered data, where observations within the same group (e.g., students in schools, workers in firms) may be correlated. Unlike standard bootstrap methods that resample individuals, the cluster bootstrap resamples entire clusters, preserving the internal dependence structure of the data.

Suppose you’re estimating a model like:

\[ Y_{ig} = X_{ig} \beta + \varepsilon_{ig}, \]

where \(g\) indexes clusters and \(i\) indexes observations within cluster \(g\).

The cluster bootstrap generates resampled datasets by:

Algorithm:

For each bootstrap sample \(b = 1, \dots, B\):

Sampling clusters \(g\) from your data with replacement and include all observations \(i\) from each selected cluster.
Compute the statistic \(\hat{\theta}^*_b = T(\hat{F}^*_b)\).

When to use it? Any time your data is grouped and intra-cluster correlation may bias standard errors (e.g., multi-site studies, geographic clustering, panel data with group effects).

Strengths: Simple to implement, preserves cluster dependence, consistent under many forms of within-cluster correlation.

Weaknesses: Requires a reasonably large number of clusters (typically \(G \geq 30\)). Can be biased or unstable with few clusters. (Luckily, the United States was broken down in 50 states. The Swiss were not as fortunate.)

Moving Block Bootstrap

If your data are dependent, like time series, the classic bootstrap fails because it breaks the correlation structure. The moving block bootstrap fixes this by resampling blocks of adjacent observations instead of individual data points. You can easily see how this makes sense for time seires: you want to maintain the local dependence structure while still resampling.

You choose a block length \(l\) and create overlapping blocks of data: \[ \{Y_1, \dots, Y_l\}, \{Y_2, \dots, Y_{l+1}\}, \dots, \{Y_{n-l+1}, \dots, Y_n\}. \]

Then, you proceed as follows:

Algorithm:

For each bootstrap sample \(b = 1, \dots, B\):

Sample these blocks with replacement to form a new dataset.
Compute the statistic \(\hat{\theta}^*_b = T(\hat{F}^*_b)\).

When to use it? Time series or spatial data with short-range dependence.

Strengths: Maintains local dependence within blocks.

Weaknesses: Choice of block size can be tricky; too small loses dependence, too big reduces variability.

R
Python

library(boot)
set.seed(1988)
y <- arima.sim(model = list(ar = 0.7), n = 100)
block_length <- 5
B <- 1000
block_boot_means <- tsboot(y, statistic = function(x) mean(x), R = B, l = block_length, sim = "fixed")
var(block_boot_means$t)

from arch.bootstrap import MovingBlockBootstrap
np.random.seed(1988)
y = np.random.normal(size=100)
block_length = 5
bs = MovingBlockBootstrap(block_length, y)
boot_means = np.array([np.mean(data[0]) for data in bs.bootstrap(B)])
np.var(boot_means, ddof=1)

Bottom Line

The bootstrap is not a single method—it’s a whole family of techniques, each with its own sweet spot.
The jackknife is fast and simple but struggles with non-smooth statistics.
The classic bootstrap works great for i.i.d. data and smooth or non-smooth statistics, but fails with dependence or small samples.
Specialized bootstraps (wild, block, Bayesian, subsampling) handle heteroskedasticity, clustering, dependence, and other real-world challenges that trip up the classic approach.

Where to Learn More

Careful readers of this blog may have noticed that I frequently recommend Efron and Hastie’s Computer Age Statistical Inference for its modern perspective on statistical methods, including bootstrapping. While it’s an excellent and insightful text, it can be a bit too technical for many applied practitioners. If you’re looking for more approachable resources, I recommend exploring how various statistical software packages implement the bootstrap—Stata, in particular, offers great documentation and examples. You’ll also find high-quality lecture notes from advanced econometrics courses online that treat these topics with a contemporary lens. Finally, any of the references listed below will give you a solid grounding in bootstrap techniques.

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. The review of economics and statistics, 90(3), 414-427.

Davidson, R., & Flachaire, E. (2008). The wild bootstrap, tamed at last. Journal of Econometrics, 146(1), 162-169.

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1), 1–26.

Efron, B., & Hastie, T. (2021). Computer age statistical inference, student edition: algorithms, evidence, and data science (Vol. 6). Cambridge University Press.

Lahiri, S. N. (2003). Resampling Methods for Dependent Data. Springer.

Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9(1), 130–134.