Bayesian Statistics

Jeffreys Prior

The Jeffreys prior is a non-informative prior derived from the Fisher information matrix, designed to be invariant under reparameterization of the statistical model.

π(θ) ∝ √det(I(θ)), where I(θ) is the Fisher information matrix

Harold Jeffreys proposed his eponymous prior in 1946 as a principled solution to a fundamental problem in Bayesian inference: how should one encode ignorance? A flat prior over θ is not flat over a transformed parameter φ = g(θ), so "non-informativeness" depends on the parameterization chosen. Jeffreys recognized that the Fisher information provides a natural Riemannian metric on the parameter space, and that using the square root of its determinant as a prior density yields a distribution that transforms correctly under any smooth reparameterization.

The result is a prior that is intrinsic to the statistical model — it depends on the likelihood function but not on any arbitrary choice of parameterization. This makes it one of the most theoretically motivated default priors in Bayesian statistics.

Jeffreys Prior — Scalar Case π(θ)  ∝  √I(θ)

Where the Fisher Information Is I(θ)  =  −E[ ∂² log p(x|θ) / ∂θ² ] =  E[ (∂ log p(x|θ) / ∂θ)² ]

Invariance Under Reparameterization

The key property of the Jeffreys prior is its invariance. Suppose φ = g(θ) is a one-to-one transformation. Under the change-of-variable formula, any density on θ transforms as π(φ) = π(θ) · |dθ/dφ|. For the Jeffreys prior, the Fisher information transforms as I(φ) = I(θ) · (dθ/dφ)², so √I(φ) = √I(θ) · |dθ/dφ|. This means that the Jeffreys prior on φ is exactly the change-of-variable transform of the Jeffreys prior on θ. No other prior family has this property in general.

This invariance resolves the classic paradox of choosing between a flat prior on a probability p and a flat prior on the log-odds log(p/(1−p)). The Jeffreys prior gives a unique, parameterization-independent answer: for the Bernoulli model, it is the Beta(1/2, 1/2) distribution.

Multivariate Jeffreys Prior π(θ₁, …, θₖ)  ∝  √det(I(θ))

Where I(θ) Is the k × k Fisher Information Matrix I(θ)ᵢⱼ  =  −E[ ∂² log p(x|θ) / (∂θᵢ ∂θⱼ) ]

Examples Across Common Models

For the Bernoulli model with parameter p, the Fisher information is I(p) = 1/(p(1−p)), giving the Jeffreys prior π(p) ∝ p⁻¹/² (1−p)⁻¹/², which is the Beta(1/2, 1/2) distribution — the arcsine distribution. For a Normal distribution with known variance σ² and unknown mean μ, I(μ) = 1/σ² (a constant), so the Jeffreys prior is flat: π(μ) ∝ 1. For the scale parameter σ of a Normal with known mean, I(σ) ∝ 1/σ², giving π(σ) ∝ 1/σ — the celebrated "reference prior" for scale parameters, equivalent to a flat prior on log σ.

For the Poisson model with rate λ, I(λ) = 1/λ, giving π(λ) ∝ λ⁻¹/². This is a proper Gamma(1/2, 0) — an improper prior that nonetheless yields a proper posterior whenever at least one observation is positive.

Impropriety and Propriety of the Posterior

The Jeffreys prior is often improper — it does not integrate to a finite value. This is not inherently problematic, provided the posterior is proper (integrates to one). For most well-behaved models with sufficient data, the Jeffreys prior yields a proper posterior. However, in multiparameter models, the Jeffreys prior can sometimes produce improper posteriors, which is one reason why Berger, Bernardo, and others developed reference priors as a refinement.

The Multiparameter Problem

While the Jeffreys prior works elegantly in one-parameter models, it can behave poorly in multiparameter settings. Jeffreys himself recognized this. For the Normal model with both μ and σ unknown, the joint Jeffreys prior is π(μ, σ) ∝ 1/σ², but many statisticians prefer the "independence Jeffreys prior" π(μ, σ) ∝ 1/σ, obtained by applying the Jeffreys rule to each parameter separately. This preference led to the development of reference priors by Jose Bernardo (1979) and the subsequent Berger-Bernardo reference prior algorithm, which orders parameters by inferential importance and derives marginal priors sequentially.

Connections to Information Geometry

The Fisher information matrix defines a Riemannian metric on the parameter space — the Fisher-Rao metric. The Jeffreys prior is the volume element of this metric. In this geometric view, the Jeffreys prior assigns equal prior mass to regions of parameter space that have equal "statistical distinguishability." Parameters in regions where the model changes rapidly (high Fisher information) receive less prior density, while parameters in flat regions receive more. This interpretation gives the Jeffreys prior a deep connection to information geometry, the differential-geometric study of statistical models pioneered by C. R. Rao and later formalized by Shun-ichi Amari.

"The rule thus obtained is not only independent of the particular parameters used but also of whether we are considering parameters or their functions." — Harold Jeffreys, Theory of Probability (1961, 3rd ed.)

Historical Context

1939

Jeffreys publishes Theory of Probability, introducing the idea that a non-informative prior should be derived from the structure of the model.

1946

Jeffreys formalizes the rule π(θ) ∝ √I(θ) in a paper that establishes the invariance principle for prior construction.

1979

Jose Bernardo proposes reference priors as a refinement of the Jeffreys approach for multiparameter models, addressing its known limitations.

1992

Berger and Bernardo develop a general algorithm for constructing reference priors, which reduces to the Jeffreys prior in one-parameter problems.

Practical Usage

Despite its theoretical elegance, the Jeffreys prior is not always the best choice in practice. It can conflict with other desirable properties — for example, it may produce posteriors with poor frequentist coverage in some models. In modern applied work, weakly informative priors (such as those advocated by Andrew Gelman and the Stan development team) are often preferred, as they encode mild domain knowledge while remaining broadly diffuse. Nevertheless, the Jeffreys prior remains the gold standard for theoretical discussions of non-informativeness and serves as a benchmark against which other priors are compared.

Example: Comparing Priors for a Coin

You flip a coin 20 times and observe 14 heads (s = 14, f = 6). How does the choice of non-informative prior affect your inference?

Three priors, one dataset Jeffreys: Beta(0.5, 0.5) → Posterior: Beta(14.5, 6.5) → Mean: 0.6905
Flat: Beta(1, 1) → Posterior: Beta(15, 7) → Mean: 0.6818
Haldane: Beta(ε, ε) → Posterior: Beta(14+ε, 6+ε) → Mean: ≈ 0.7000

With 20 observations, the three posteriors are close but distinguishable. The Jeffreys prior produces a result between the Flat and Haldane priors. Its key advantage is reparameterization invariance: if we transform to log-odds φ = log(p/(1−p)), the Jeffreys posterior for φ is exactly what we would have obtained by computing the Jeffreys prior directly in the φ parameterization. The Flat prior lacks this property — a uniform prior on p is not uniform on log-odds. With 200 observations, all three priors produce virtually identical posteriors, demonstrating that non-informative priors matter most when data are sparse.

Interactive Calculator

Each row is an outcome (success or failure). The calculator compares three priors for Binomial data: Jeffreys Beta(0.5, 0.5), Flat Beta(1, 1), and Haldane Beta(ε, ε). It shows posterior parameters, means, and 95% CIs for each, and demonstrates reparameterization invariance by showing results in both proportion and log-odds.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics