Bayesian Statistics
A Complete Reference

Bayesian Statistics

Bayesian statistics is a framework for reasoning under uncertainty. Where classical methods treat parameters as fixed unknowns, the Bayesian approach assigns them probability distributions—encoding what is known before the data arrive, then updating those beliefs rigorously as evidence accumulates.

The engine is Bayes’ Theorem: prior belief, multiplied by the likelihood of the observed data, yields a posterior that synthesizes knowledge with evidence. This single mechanism unifies estimation, prediction, model comparison, and decision-making into one coherent system.

This reference covers the full landscape—from foundational theorems and prior distributions through inference methods, computational algorithms, and modern applications in machine learning, medicine, finance, and the physical sciences.

P(θ|D) = P(D|θ) · P(θ) / P(D)

Variables & Notation

Core quantities and symbols used throughout Bayesian inference.

P(θ)

Prior Probability

The probability of a hypothesis before observing data. Encodes existing beliefs or knowledge about the parameter.

P(θ)
Example

Before flipping a coin, you believe P(fair) = 0.8 based on experience.

P(D|θ)

Likelihood

The probability of observing the data given a specific hypothesis or parameter value.

P(D|θ) = ∏ P(xᵢ|θ)
Example

Given a fair coin (θ=0.5), the probability of seeing 7 heads in 10 flips.

P(θ|D)

Posterior Probability

The updated probability of a hypothesis after observing data. The central quantity in Bayesian inference.

P(θ|D) = P(D|θ)·P(θ) / P(D)
Example

After seeing 7/10 heads, updated belief that the coin is fair.

P(D)

Marginal Likelihood (Evidence)

The total probability of the data across all possible hypotheses. Acts as a normalizing constant.

P(D) = ∫ P(D|θ)·P(θ) dθ
Example

The overall probability of getting 7/10 heads across all possible coin biases.

P(x̃|D)

Posterior Predictive

The predicted probability of future observations given the observed data, integrating over parameter uncertainty.

P(x̃|D) = ∫ P(x̃|θ)·P(θ|D) dθ
Example

Predicting the probability of the next coin flip being heads after observing data.

P(x̃)

Prior Predictive

The predicted probability of data before observing anything, based only on the prior.

P(x̃) = ∫ P(x̃|θ)·P(θ) dθ
Example

Expected data distribution before running an experiment.

BF₁₂

Bayes Factor

The ratio of marginal likelihoods for two competing models. Quantifies relative evidence.

BF₁₂ = P(D|M₁) / P(D|M₂)
Example

BF = 10 means data is 10× more likely under model 1 than model 2.

α, β, ...

Hyperparameter

Parameters of the prior distribution. In hierarchical models, these may themselves have priors (hyperpriors).

θ ~ Prior(α, β)
Example

The shape (α) and rate (β) of a Gamma prior on a Poisson rate parameter.

z

Latent Variable

An unobserved variable that influences the observed data. Must be inferred from observations.

P(z|x) ∝ P(x|z)·P(z)
Example

Cluster assignments in a Gaussian mixture model.

CI

Credible Interval

An interval in which the parameter lies with a given probability, according to the posterior distribution.

P(a ≤ θ ≤ b | D) = 0.95
Example

95% credible interval: the true coin bias lies between 0.45 and 0.85.

θ̂_MAP

MAP Estimate

Maximum A Posteriori — the single most probable parameter value under the posterior.

θ̂_MAP = argmax_θ P(θ|D)
Example

The most likely coin bias given the observed flips.

E[θ|D]

Posterior Mean

The expected value of the parameter under the posterior distribution. Minimizes squared error loss.

E[θ|D] = ∫ θ·P(θ|D) dθ
Example

The average coin bias weighted by the posterior distribution.

Var(θ|D)

Posterior Variance

The spread of uncertainty in the parameter estimate under the posterior.

Var(θ|D) = E[θ²|D] − (E[θ|D])²
Example

How certain/uncertain we are about the coin's bias after observing data.

n_eff

Effective Sample Size

In MCMC, the number of independent-equivalent samples. Accounts for autocorrelation in chains.

n_eff = N / (1 + 2·Σ ρₖ)
Example

A chain of 10,000 samples might have n_eff = 3,000 due to autocorrelation.

R-hat (Convergence Diagnostic)

Compares between-chain and within-chain variance to assess MCMC convergence. Values near 1 indicate convergence.

R̂ = √(V̂/W)
Example

R̂ = 1.01 suggests chains have converged; R̂ > 1.1 signals problems.