Vasco Yasenov
  • About Me
  • CV
  • Blog
  • Research
  • Software
  • Others
    • Methods Map
    • Kids Books
    • TV Series Ratings

On this page

  • Background
  • Notation
  • A Closer Look
    • The First-Generation P-Value
    • The Problem
    • The Main Idea
    • Quick Example
    • Pros and Cons
  • Bottom Line
  • Where to Learn More
  • References

Second-Generation \(p\)-Values: A Worthy Idea That Did Not Catch On

hypothesis testing
statistical inference
Published

June 11, 2026

7 min read

Background

Hardly a data science interview goes by without the person in power bringing up the definition of the infamous \(p\)-value. Normally, chaos ensues. For good or ill, \(p\)-values are omnipresent in a data scientist’s world. A grad school professor of mine once described them as summaries of an entire dataset into a single number — or even, I would add, into a single yes/no decision.

The downsides of \(p\)-values have been well understood and widely publicized for decades. They conflate effect size and precision, they are routinely misinterpreted, and they say nothing about scientific relevance or practical importance. Famously, the American Statistical Association issued a formal statement on their use and misuse (Wasserstein and Lazar, 2016). Predictably, there has been no shortage of proposed alternatives — \(e\)-values, Bayes factors, posterior probabilities, and more.

In this note I describe one such alternative: the second-generation \(p\)-value (SGPV) of Blume et al. (2019). The point is to broaden how we think about using data to test scientific claims, and to get acquainted with a genuinely different approach. I keep the mathematical rigor to a minimum and focus on the intellectual idea behind the proposal.

Notation

Suppose we are interested in some parameter \(\theta\) — a difference in means, a log odds ratio, a regression coefficient, whatever the scientific question demands. Classical testing pits a simple (point) null hypothesis against everything else:

\[H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0.\]

A simple hypothesis pins \(\theta\) to a single value; a composite hypothesis allows a whole set of values. The key move in what follows is to replace the point null with a composite one.

Let \(I = [\theta_\ell, \theta_u]\) denote an interval estimate of \(\theta\) — for concreteness, a 95% confidence interval — with length \(|I| = \theta_u - \theta_\ell\).

Also, let \[H_0 = [\theta_\ell^0, \theta_u^0]\]

denote an interval null hypothesis: the range of effect sizes that are too small to matter scientifically, with length \(|H_0|\). This interval includes the exact null \(\theta_0\) but widens it to absorb effects that are technically non-zero yet practically trivial. Choosing its width is a scientific judgment, made before seeing the data.

A Closer Look

The First-Generation P-Value

Start with the object everyone already knows. We observe a dataset, and the \(p\)-value is the probability of seeing data at least as extreme — weirder, loosely speaking — than what we actually observed, if the null hypothesis were true.

\[ p = P(\text{data at least as extreme as } \text{observed} | H_0) \]

A large \(p\)-value means the data are unsurprising under the null and therefore offer no evidence against it. Famously, this is not the probability that the null is true; getting at that requires a Bayesian apparatus and a prior. This single number has carried an enormous load for nearly a century, and that is precisely the source of trouble.


The Problem

The classical \(p\)-value conflates two things a practitioner usually wants to keep separate: the size of an effect and the precision with which it is estimated. A tiny, scientifically meaningless effect will produce an arbitrarily small \(p\)-value once the sample is large enough, because precision alone drives significance. In the era of big data this is not a hypothetical — it is the default failure mode.

The deeper issue is the point null itself. The hypothesis \(\theta = \theta_0\) exactly is almost never true and rarely even interesting. Two treatments are essentially never identical to infinite precision; the relevant question is whether they differ by an amount that matters. Statistical significance, as classically defined, simply does not speak to that question.


The Main Idea

The SGPV, which I will denote \(p_\delta\), replaces the point null with the interval null \(H_0\) and then measures how much the interval estimate \(I\) overlaps it. It is a proportion, not a tail-area probability:

\[ p_\delta = \frac{|I \cap H_0|}{|I|} \times \max\left\{ \frac{|I|}{2|H_0|},\, 1 \right\}, \]

where \(|I \cap H_0|\) is the length of the overlap between the two intervals. When the interval estimate is reasonably precise (specifically, \(|I| \leq 2|H_0|\)), the correction term is \(1\) and the formula reduces to the clean fraction \(p_\delta = |I \cap H_0| / |I|\) — the share of the interval estimate that falls inside the null zone. The correction factor only kicks in when \(I\) is very wide relative to the null, capping the statistic to reflect that the data are simply too imprecise to say much.

The interpretation is direct:

  • \(p_\delta = 0\): the interval estimate does not overlap the null at all. The data are incompatible with a trivial effect and support a scientifically meaningful one.
  • \(p_\delta = 1\): the interval estimate lies entirely inside the null. The data support only the null premise — something a classical \(p\)-value can never tell you.
  • \(0 < p_\delta < 1\): the data are inconclusive, with \(p_\delta \approx 1/2\) being the maximally inconclusive case.

Quick Example

A quick worked example from the paper makes it concrete. In a study of 100 smokers and 100 non-smokers, suppose 65 smokers and 50 non-smokers develop lung cancer. The odds ratio is \(1.86\) with a 95% confidence interval of \([1.03, 3.36]\). Take the interval null to be odds ratios between \(0.9\) and \(1.1\) — associations too weak to care about. The interval estimate overlaps the null only slightly, and the SGPV works out to \(p_\delta = 0.175\): suggestive, but not conclusive. Bump the smokers’ cancer count from 65 to 70 and the odds ratio rises to \(2.33\) with interval \([1.27, 4.27]\), which no longer touches the null. Now \(p_\delta = 0\), and we would report a scientifically meaningful association.


Pros and Cons

The appeal is that the SGPV bakes scientific relevance into the inference itself, rather than leaving it as an afterthought to be eyeballed from a confidence interval. By construction it can support the null, it indicates when the data are inconclusive instead of forcing a verdict, and — a genuinely attractive frequency property — its Type I error rate shrinks toward zero as the sample grows, instead of hovering at \(\alpha\) forever. That directly attacks the big-data significance problem and tends to yield lower false discovery rates.

The interval null must be specified in advance, and pinning down the width of “scientifically trivial” is a real judgment call that invites disagreement and can feel arbitrary. The statistic is a proportion rather than a probability, which makes it unfamiliar and harder to slot into existing reporting conventions. And it competes in a crowded field: practitioners who care about false discoveries already reach for false discovery rate (FDR) or family-wise error rate (FWER) procedures.

Bottom Line

  • \(P\)-values are among the most widely used and most widely misused tools in applied statistics; their core flaw is conflating effect size with precision.
  • The second-generation \(p\)-value is an alternative worth knowing when false discovery is a concern.
  • Its main idea is simple: measure the overlap between an interval estimate and an interval null that absorbs both the exact null and all scientifically trivial effects.
  • Despite its elegance, the SGPV never gained traction. FDR/FWER control and equivalence testing remain the dominant ways practitioners guard against spurious findings.

Where to Learn More

See the papers referenced below and look around my blog for more details on multiple testing and false discovery rate control.

References

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.

  • Blume, J. D., D’Agostino McGowan, L., Dupont, W. D., & Greevy, R. A. (2018). Second-generation p-values: Improved rigor, reproducibility, and transparency in statistical analyses. PLOS ONE, 13(3), e0188299.

  • Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., & Dupont, W. D. (2019). An introduction to second-generation p-values. The American Statistician, 73(sup1), 157–167.

  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133.

  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05”. The American Statistician, 73(sup1), 1–19.

© 2025 Vasco Yasenov

 

Powered by Quarto