Bayesian Statistics

Posterior Probability

The posterior probability distribution is the end product of Bayesian inference — the updated distribution over unknown quantities that results from combining prior beliefs with observed data through Bayes' theorem.

π(θ | x) = L(θ; x) · π(θ) / ∫ L(θ; x) · π(θ) dθ

The posterior distribution π(θ | x) is the central object of Bayesian statistics. It represents the complete state of knowledge about an unknown parameter θ after observing data x, synthesizing everything the analyst knew before the experiment (the prior) with everything the experiment revealed (the likelihood). Every Bayesian inference — point estimates, interval estimates, predictions, model comparisons, decisions — is extracted from the posterior.

Where frequentist methods produce a point estimate and a confidence interval, the posterior provides a full probability distribution. This distribution answers any question one might ask about the parameter: What is its most likely value? How uncertain are we? What is the probability it exceeds a clinically meaningful threshold? What outcome should we expect for the next observation? The posterior contains all of these answers simultaneously.

Posterior Distribution π(θ | x) = L(θ; x) · π(θ) / P(x)

Proportional Form π(θ | x) ∝ L(θ; x) · π(θ)

In Words Posterior  ∝  Likelihood  ×  Prior

Anatomy of the Posterior

The posterior is shaped by the interplay of the likelihood and the prior. When the data are abundant and informative, the likelihood dominates and the posterior is concentrated near the maximum likelihood estimate, regardless of the prior. When the data are sparse, the prior exerts more influence, pulling the posterior toward regions of prior plausibility. This trade-off is one of the most important and intuitive features of Bayesian inference.

Conjugate Example: Beta-Binomial Prior:      θ ~ Beta(α, β)
Data:       k successes in n trials
Posterior:  θ | k, n ~ Beta(α + k, β + n − k)

Posterior Mean E[θ | k, n] = (α + k) / (α + β + n)

Interpretation The posterior mean is a weighted average of the prior mean α/(α+β)
and the MLE k/n, with weights proportional to the "pseudo-counts"
from the prior (α+β) and the actual data (n).

This weighted-average property is not special to the Beta-Binomial case — it holds qualitatively for all Bayesian analyses. The posterior always compromises between the prior and the data, with the balance shifting toward the data as n increases.

Summarizing the Posterior

Point Estimates

Three standard point summaries are extracted from the posterior. The posterior mean E[θ | x] minimizes expected squared error loss. The posterior median minimizes expected absolute error loss. The posterior mode (the maximum a posteriori or MAP estimate) maximizes the posterior density; with a flat prior, it coincides with the maximum likelihood estimate. The choice among them depends on the loss function appropriate to the problem.

Credible Intervals

A credible interval is a Bayesian interval estimate: an interval [a, b] such that P(a ≤ θ ≤ b | x) = 0.95 (or whatever coverage is desired). Unlike a frequentist confidence interval, a credible interval has a direct probabilistic interpretation: "Given the data and the prior, there is a 95% probability that θ lies in this interval." The highest posterior density (HPD) interval is the shortest interval with the desired coverage — it includes only the most probable parameter values.

Posterior Predictive Distribution

The posterior also generates predictions for future observations. The posterior predictive distribution integrates the likelihood of a new observation x̃ over the posterior uncertainty in θ:

Posterior Predictive Distribution P(x̃ | x) = ∫ P(x̃ | θ) · π(θ | x) dθ

Interpretation The prediction accounts for both the noise in the data-generating process
and the uncertainty about the parameter. Predictions are wider than
those from a plug-in estimate because parameter uncertainty is propagated.

Sequential Updating

A defining feature of the posterior is that it can serve as the prior for the next round of inference. When new data x₂ arrive, the analyst updates the current posterior π(θ | x₁) using the new likelihood:

Sequential Bayes π(θ | x₁, x₂) ∝ L(θ; x₂) · π(θ | x₁)

General Sequential Form π(θ | x₁, …, xₙ) ∝ [∏ᵢ L(θ; xᵢ)] · π(θ)

The final posterior is the same regardless of whether the data are processed all at once or one observation at a time. This coherence property is a direct consequence of the probability axioms and is unique to Bayesian updating. It makes Bayesian methods naturally suited to streaming data, adaptive experiments, and real-time monitoring.

The Posterior as a Complete Summary

In frequentist statistics, different questions (estimation, testing, prediction) require different procedures, each with its own theory and assumptions. In Bayesian statistics, the posterior answers all of them. Want a point estimate? Take the mean or mode. Want uncertainty quantification? Read off a credible interval. Want a hypothesis test? Compute the posterior probability of the hypothesis. Want a prediction? Integrate over the posterior. This unification is one of the deepest appeals of the Bayesian framework.

Computation

For conjugate models, the posterior is available in closed form. For most realistic models, it is not. The computational challenge of Bayesian statistics is the challenge of characterizing the posterior when the normalizing constant ∫ L(θ; x) · π(θ) dθ is intractable.

Markov chain Monte Carlo (MCMC) methods — the Metropolis-Hastings algorithm, Gibbs sampling, Hamiltonian Monte Carlo — generate samples from the posterior without computing the normalizing constant. These samples can be used to estimate any posterior quantity: means, quantiles, probabilities, predictive distributions. The development of MCMC in the 1990s made Bayesian inference practical for complex models and drove the Bayesian revolution in applied statistics.

Variational inference offers a faster alternative: approximate the posterior with a simpler distribution by minimizing the Kullback-Leibler divergence. This trades exactness for speed and scalability, making it the method of choice for large-scale machine learning applications such as variational autoencoders and probabilistic topic models.

Asymptotic Behavior

The Bernstein–von Mises theorem guarantees that, under regularity conditions, the posterior concentrates around the true parameter value as the sample size grows. Specifically, the posterior becomes approximately normal with mean at the MLE and variance equal to the inverse Fisher information divided by n. This result implies that Bayesian and frequentist methods agree asymptotically — the posterior credible interval and the frequentist confidence interval converge to the same interval as n → ∞.

Posterior consistency — the property that the posterior mass eventually concentrates on any neighborhood of the true parameter — is guaranteed under broad conditions, including for many nonparametric models. Failures of consistency can occur when the prior assigns zero probability to neighborhoods of the truth (violating Cromwell's Rule) or when the model is misspecified in certain pathological ways.

Historical Development

The concept of updating beliefs in light of evidence is implicit in the work of Bayes and Laplace. Laplace routinely computed posterior distributions for astronomical parameters using uniform priors and Gaussian likelihoods. But the modern understanding of the posterior as the fundamental output of inference crystallized with Jeffreys' Theory of Probability (1939) and Savage's The Foundations of Statistics (1954).

"The posterior distribution is the Bayesian's complete answer to any inferential question. It is not an intermediate step on the way to a point estimate or a test — it is the inference." — Dennis Lindley, Understanding Uncertainty (2006)

The computational revolution of the 1990s — Gelfand and Smith's demonstration of Gibbs sampling (1990), the development of BUGS and later Stan — transformed the posterior from a theoretical ideal into a practical tool. Today, posterior distributions are computed routinely for models with thousands of parameters, enabling the rich, uncertainty-aware inferences that define modern Bayesian practice.

Example: Detecting a Biased Coin at a Casino

A casino regulator suspects a roulette wheel may be biased. She spins it 200 times and observes 118 reds and 82 blacks. A fair wheel should produce roughly 50/50 (ignoring green for simplicity). What should she believe about the wheel's true bias after seeing this data?

From Prior to Posterior

Before the test, she assumes the wheel is probably fair but allows for the possibility of bias. She models her prior belief about the probability of red (θ) using a Beta(50, 50) distribution — centered on 0.50 with moderate confidence.

Prior θ ~ Beta(50, 50)
Prior mean = 0.50, concentrated around fairness

Data 118 reds out of 200 spins

Posterior (Beta-Binomial Conjugacy) θ | data ~ Beta(50 + 118, 50 + 82) = Beta(168, 132)
Posterior mean = 168 / 300 = 0.56

What the Posterior Tells Us

The posterior distribution is centered at 0.56 — shifted away from the fair value of 0.50 but not as far as the raw data proportion of 0.59. The prior pulled the estimate back toward fairness, reflecting the regulator's initial belief. A 95% credible interval runs from about 0.50 to 0.62.

Since the interval barely includes 0.50, the regulator has moderate evidence of bias. If she observed another 200 spins with a similar pattern, the posterior would tighten further and the evidence would become much stronger — each new observation narrows the posterior, concentrating belief around the true value.

Posterior vs. Point Estimate

A frequentist would report the sample proportion (0.59) and a p-value. The Bayesian posterior gives much more: a full probability distribution over every possible value of θ. The regulator can directly answer questions like "What is the probability the wheel favors red by more than 5%?" by computing the area under the posterior curve above 0.55. This is the power of the posterior — it is the complete inference, not a summary of it.

Interactive Calculator

Each row is a basketball free throw attempt (made or missed). The posterior distribution captures everything we know about the shooter's true accuracy after seeing the data. Watch the posterior sharpen — the credible interval narrows — as more shots are observed.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics