The Wilcoxon-Mann-Whitney Test is Not a Test of Medians
Background
Nonparametric tests like the Wilcoxon-Mann-Whitney (WMW) are among the most popular alternatives to the \(t\)- and \(z\)-tests in settings where normality assumptions break down. Often described as a “test of medians,” WMW is used when comparing two independent groups without making strong assumptions about the underlying distributions. It is also known as the Mann-Whitney-Wilcoxon (MWW) test or the Wilcoxon rank-sum test.
Despite this common interpretation, the WMW test is not a test of medians—at least not in general. Divine et al. (2018) dive deep into this misconception and show convincingly how the WMW test can lead you astray if you’re specifically interested in comparing medians.
This article explains why that happens, provides some intuition and math, and shows you how to think more clearly about what the WMW test actually does.
Notation
Let \(X_1, \ldots, X_m \sim F\) and \(Y_1, \ldots, Y_n \sim G\) be two independent random samples from distributions \(F\) and \(G\), respectively. The Wilcoxon-Mann-Whitney statistic is based on the probability:
\[P(X < Y) + \frac{1}{2}P(X = Y)\]
This quantity is sometimes referred to as the probability of superiority.
Let \(\theta_F\) and \(\theta_G\) denote the medians of \(F\) and \(G\). We often want to test:
\[H_0: \theta_F = \theta_G\]
But WMW does not directly test this hypothesis unless very specific conditions are met.
A Closer Look
What Does WMW Actually Test?
The WMW test assesses whether one distribution tends to produce larger values than the other. More formally, it tests:
\[H_0: P(X < Y) + \frac{1}{2}P(X = Y) = 0.5\]
This is equivalent to testing whether the distributions are stochastically equal, not whether the medians are equal.
The WMW test can be performed via rank sums. After combining both samples, we rank all observations from smallest to largest. The test statistic \(W\) is the sum of ranks assigned to the first sample:
\[W = \sum_{i=1}^m R(X_i)\]
where \(R(X_i)\) is the rank of \(X_i\) in the combined sample.
This rank-based formulation is mathematically equivalent to counting how many pairs \((X_i, Y_j)\) have \(X_i < Y_j\), which relates to the probability interpretation above. Under the null hypothesis, the expected rank sum is approximately \(m(m+n+1)/2\).
Understanding Stochastic Dominance
When we say the WMW test examines “stochastic dominance,” we mean it tests whether values from one distribution tend to exceed values from the other. Specifically, distribution \(G\) stochastically dominates distribution \(F\) if:
\[G(x) \leq F(x) \text{ for all } x\]
with strict inequality for at least some values of \(x\). Intuitively, this means a randomly selected value from \(G\) is more likely to be larger than a randomly selected value from \(F\).
This is quite different from comparing medians. Two distributions can have identical medians but exhibit stochastic dominance, or they can have different medians but neither stochastically dominates the other.
When Does It Coincide with a Median Test?
The WMW test only functions as a test of medians under symmetric distributions with equal shape and spread. If the shapes differ—say, one is skewed left and the other right—then even if the medians are the same, WMW can reject the null. Worse, it might fail to reject when the medians are different but the distributions have similar overall ranks.
Alternative Tests
If your research question specifically concerns differences in medians, more appropriate tests include:
- Mood’s median test: A true test of median equality that uses contingency tables based on counts above and below the combined median.
- Quantile regression: For more complex designs, quantile regression directly models the median (or other quantiles) and tests differences between groups.
- Bootstrap confidence intervals: Calculating confidence intervals for the difference in medians via bootstrapping provides both a test and measure of uncertainty.
These approaches directly address median differences rather than the stochastic ordering tested by WMW.
An Example
Let’s see this in action with a small simulation.
set.seed(123)
x <- rexp(100, rate = 1) # Right-skewed
y <- rexp(100, rate = 1.5) # Also right-skewed, different rate
median(x) # Median of x
median(y) # Median of y
wilcox.test(x, y)import numpy as np
from scipy.stats import mannwhitneyu
np.random.seed(123)
x = np.random.exponential(scale=1.0, size=100)
y = np.random.exponential(scale=2/3, size=100) # Higher scale = lower rate
print("Median x:", np.median(x))
print("Median y:", np.median(y))
res = mannwhitneyu(x, y, alternative='two-sided')
print(res)This example demonstrates our point perfectly: The medians are clearly different (0.6334 vs. 0.4865), and the WMW test correctly rejects the null hypothesis (p = 0.004). However, this rejection occurs because the exponential distributions with different rates create a consistent stochastic ordering, not because it’s specifically testing the medians.
Despite different medians, the WMW test might not reject the null. Or it might reject it because of shape differences, not the medians.
Bottom Line
- The Wilcoxon-Mann-Whitney test is not a general test of medians.
- It tests for stochastic dominance or shift in distribution, not specifically median difference.
- It behaves like a median test only under certain conditions (e.g., identical shape).
- Be cautious interpreting WMW results as saying something about medians unless distributional assumptions are met.
Where to Learn More
For a deeper dive, read the original Divine et al. (2018) paper. You might also want to look at literature on robust location tests or permutation-based alternatives that better target the median.
References
Divine, G. W., Norton, H. J., Barón, A. E., & Juarez-Colunga, E. (2018). The Wilcoxon–Mann–Whitney procedure fails as a test of medians. The American Statistician, 72(3), 278–286.
Hollander, M., Wolfe, D. A., & Chicken, E. (2013). Nonparametric Statistical Methods (3rd ed.). Wiley.