Bayesian Statistics

Lindleys Paradox

Lindley's paradox demonstrates that for large sample sizes, a frequentist test can produce a highly significant p-value (rejecting the null) while the Bayesian analysis simultaneously provides strong evidence in favor of the null hypothesis — exposing a fundamental tension between the two inferential frameworks.

As n → ∞: p-value → 0 while BF₀₁ → ∞ (favoring H₀)

Lindley's paradox, first described by Dennis Lindley in 1957, arises in the following scenario. Consider testing a point null hypothesis H₀: θ = θ₀ against an alternative H₁: θ ≠ θ₀. For large samples, there exist data sets for which the frequentist p-value is very small (say, p = 0.01, leading to rejection at conventional levels) while the Bayes factor simultaneously favors H₀ strongly. The two procedures reach opposite conclusions from identical data.

This is not a mathematical error but a genuine divergence between the inferential philosophies. It arises because the p-value measures the probability of data as extreme or more extreme under H₀, while the Bayes factor compares the predictive performance of H₀ and H₁ — and under a diffuse prior on θ, the alternative H₁ makes poor predictions because its probability mass is spread over a wide range.

Setup Data: ȳ  ~  Normal(θ, σ²/n)
H₀: θ = θ₀      H₁: θ ~ Normal(θ₀, τ²)   (diffuse prior, τ² large)

p-value p  =  2Φ(−|z|), where z = √n · (ȳ − θ₀)/σ

Bayes Factor in favor of H₀ BF₀₁  =  √(1 + nτ²/σ²) · exp(−z² · (nτ²/σ²) / (2(1 + nτ²/σ²)))

The Mechanism of the Paradox

The paradox arises when z is moderate (say, z ≈ 2.5) but n is very large. A moderate z-score gives a small p-value regardless of sample size. But the Bayes factor depends on the trade-off between fit and complexity. Under H₁ with a diffuse prior (τ² large), the alternative must spread its probability over a vast parameter range. When the observed effect size δ = (&ymacr; − θ₀) is small (which it must be if z is moderate and n is large, since δ = zσ/√n), the data are nearly as consistent with H₀ as with any particular value under H₁. The Bayes factor penalizes H₁ for wasting predictive probability on remote parameter values that did not materialize.

More precisely, as n → ∞ with z fixed (so the effect size shrinks as n−1/2), the Bayes factor BF₀₁ grows as √n, favoring H₀ more and more strongly. The p-value, meanwhile, remains constant. The result is an ever-widening gap between the two conclusions.

Not a Contradiction, but a Lesson

Lindley's paradox is sometimes presented as a "flaw" in either framework. It is better understood as a lesson about what each framework measures. The p-value answers: "How surprising are these data if H₀ is true?" The Bayes factor answers: "Which hypothesis predicted these data better?" For large n, a tiny true effect can produce surprising data under H₀ (small p-value) while being indistinguishable from zero when compared against a diffuse alternative (large BF₀₁). The paradox thus highlights the importance of effect sizes, the sensitivity of Bayes factors to prior specification, and the dangers of conflating statistical significance with practical importance.

The Role of the Prior

The paradox is most acute when the prior under H₁ is very diffuse. If the prior on θ under H₁ is concentrated near θ₀ (small τ²), the Bayes factor becomes more favorable to H₁ and the paradox diminishes. This observation has led to extensive work on "default" or "objective" Bayes factors that calibrate the prior spread to a reasonable scale.

Jeffreys' approach recommends using a Cauchy prior centered at θ₀, which has heavier tails than the normal and is more robust. The BIC (Bayesian Information Criterion) approximation to the Bayes factor sidesteps prior specification entirely and approximates the Bayes factor using only the likelihood ratio and a sample-size penalty. The BIC-based conclusion can also diverge from the p-value, confirming that the paradox is not an artifact of a particular prior choice.

Historical Context and Impact

1957

Dennis Lindley publishes "A statistical paradox" in Biometrika, demonstrating the conflict between significance tests and Bayesian evidence for large samples. The paper catalyzes decades of debate between Bayesian and frequentist statisticians.

1961

Harold Jeffreys, in the third edition of Theory of Probability, provides an extensive discussion of the paradox and proposes Cauchy priors for testing as a partial resolution.

1982

James Berger and Thomas Sellke show that even from a purely frequentist perspective, p-values overstate the evidence against H₀: the minimum Bayes factor (over all reasonable priors) compatible with a p-value of 0.05 is only about 2.5.

2001

Sellke, Bayarri, and Berger propose calibrating p-values by converting them to lower bounds on the posterior probability of H₀, providing a practical bridge between the two frameworks.

Implications for Scientific Practice

Lindley's paradox has had lasting influence on debates about statistical significance, particularly in the replication crisis era. The observation that a "significant" p-value may correspond to Bayesian evidence for the null has been cited in discussions of overpublished false positives in psychology, genomics, and medicine. Benjamin et al. (2018) proposed lowering the significance threshold to p < 0.005 partly because, at this level, the Bayes factor almost always favors the alternative, reducing the incidence of the paradox in practice.

"It is possible to have a result that is significant at the 5% level, and yet for which the posterior probability of the null hypothesis, under any reasonable prior, exceeds 50%." — Dennis V. Lindley, Biometrika (1957)

The paradox remains one of the most important results in the philosophy of statistics, forcing practitioners to confront what they mean by "evidence" and whether a single number — whether p-value or Bayes factor — can adequately summarize the strength of a scientific conclusion.

Worked Example: The Paradox with n = 50

We observe n = 50 values with a sample mean slightly above zero and test H₀: μ = 0. Lindley's paradox occurs when the frequentist p-value is significant but the Bayes factor favors H₀.

Given n = 50, x̄ = 0.112, s² = 0.0004 (s = 0.02)
Prior under H₁: μ ~ N(0, τ² = 1)

Step 1: Frequentist z-test SE = s/√n = 0.02/√50 = 0.00283
z = x̄/SE = 0.112/0.00283 = 39.6
p-value ≈ 0 (extremely significant)

Step 2: Bayes Factor BF₁₀ = √(s²/(s² + nτ²)) × exp(n²τ²x̄²/(2s²(s² + nτ²)))
= √(0.0004/(0.0004 + 50)) × exp(50²·1·0.0125/(2·0.0004·50.0004))
With such a small s² and large nτ², the √ factor is tiny: √(8×10⁻⁶) ≈ 0.0028
The exp factor is large but cannot overcome the penalty.
BF₀₁ ≫ 1 (Bayes factor supports H₀)

Step 3: The Paradox Frequentist: p ≈ 0 → Reject H₀ (very strongly)
Bayesian: BF₀₁ ≫ 1 → Evidence supports H₀

This is Lindley's paradox in action. The frequentist test finds the data highly significant because x̄ is many standard errors from zero. But the Bayesian analysis, with a diffuse prior τ² = 1 under H₁, penalizes H₁ for "wasting" probability mass over a huge range when the observed effect is tiny. The paradox highlights that statistical significance and evidential support are fundamentally different concepts.

Interactive Calculator

Each row is a numeric value. The calculator demonstrates Lindley's paradox: when n is large and the effect is small, the frequentist p-value can be significant while the Bayes factor supports H₀. Compare the z-test p-value with the Bayes factor BF₀₁ to see the paradox emerge.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics