Bayesian Statistics

Expectation Propagation

Expectation Propagation (EP) is a deterministic approximate inference algorithm that iteratively refines local factor approximations by matching moments, using the inclusive (forward) KL divergence rather than the exclusive divergence of variational Bayes.

q(θ) ∝ ∏ᵢ f̃ᵢ(θ), where each f̃ᵢ approximates the true factor fᵢ by minimizing KL(p ‖ q)

Expectation Propagation, introduced by Thomas Minka in 2001, occupies a distinctive niche in the landscape of approximate Bayesian inference. Where variational Bayes minimizes KL(q ‖ p) and thereby seeks modes, EP minimizes KL(p ‖ q) locally at each factor, producing approximations that tend to cover the posterior's mass — matching its moments rather than concentrating on its peaks.

The Factored Posterior

Consider a posterior that factorizes as p(θ | y) ∝ p(θ) ∏ᵢ p(yᵢ | θ) = ∏ᵢ fᵢ(θ). EP approximates each factor fᵢ(θ) with a simpler term f̃ᵢ(θ) from an exponential family, so that the global approximation is q(θ) ∝ ∏ᵢ f̃ᵢ(θ). Since the product of exponential-family terms remains in the exponential family, q(θ) is tractable.

EP Update for Factor i 1. Compute the cavity distribution: q₋ᵢ(θ) ∝ q(θ) / f̃ᵢ(θ)
2. Form the tilted distribution: q_\ᵢ(θ) ∝ q₋ᵢ(θ) · fᵢ(θ)
3. Project: find f̃ᵢ_new(θ) such that q₋ᵢ(θ) · f̃ᵢ_new(θ) has the same moments as q_\ᵢ(θ)
4. Update: q(θ) ∝ q₋ᵢ(θ) · f̃ᵢ_new(θ)

The key step is the moment matching projection (step 3), which is equivalent to minimizing KL(q_\ᵢ ‖ q_new) — the inclusive KL divergence. For Gaussian approximations, this amounts to matching the mean and covariance of the tilted distribution. EP iterates through all factors until convergence.

Relationship to Other Methods

EP can be viewed as a generalization of several classical algorithms. When there is a single likelihood factor, EP reduces to the Laplace approximation refined by moment matching. In the context of Gaussian process classification, EP provides more accurate marginal likelihood estimates than the Laplace approximation or variational bounds. The assumed density filtering (ADF) algorithm is the "one-pass" version of EP that processes each factor exactly once.

Convergence Caveats

Unlike variational Bayes, EP is not guaranteed to converge. The algorithm can oscillate or diverge, particularly when likelihood factors are highly non-Gaussian. Damping (taking convex combinations of old and new updates) and power EP (raising each factor to a fractional power α) are common stabilization techniques. Despite these issues, EP often provides superior approximations when it does converge, particularly for binary classification and robust regression.

Historical Context

1998

Opper and Winther developed adaptive TAP (Thouless–Anderson–Palmer) methods for probabilistic models, foreshadowing EP's moment-matching approach.

2001

Thomas Minka introduced Expectation Propagation in his doctoral work at MIT, unifying ADF, loopy belief propagation, and moment matching into a single framework.

2006

Rasmussen and Williams' textbook on Gaussian Processes established EP as the method of choice for GP classification, bringing it to a wide audience.

2016–2020

Connections between EP and stochastic gradient methods were explored; power EP and black-box EP variants extended the algorithm to more complex models.

Applications

EP is particularly well-suited to Gaussian process classification, where the posterior over latent function values is non-Gaussian due to a probit or logistic link function. It also excels in sparse Bayesian learning, Bayesian linear mixed models, and mixture-of-experts models. In telecommunications, EP has been applied to turbo decoding problems, where its message-passing structure maps naturally onto factor graphs.

In Bayesian neural networks and deep learning, EP ideas have influenced modern posterior approximation techniques, including natural-gradient variational inference, which can be understood as a single-step EP update. The connections between EP, belief propagation, and variational methods make it a theoretically rich algorithm that bridges multiple inference paradigms.

"Expectation Propagation finds a middle ground between the computational simplicity of variational methods and the accuracy of sampling, by replacing global KL minimization with a series of local moment-matching operations."— Thomas Minka, 2001

Worked Example: EP vs Laplace for Normal Mean Estimation

We compare three posterior approximation methods for estimating a Normal mean. With clean data, all methods agree. We then introduce an outlier to see how EP's moment-matching provides robustness.

Given Clean data (n=20): mean = 2.77, SD = 0.42
Prior: μ ~ N(0, 10)

Step 1: Clean Data — All Methods Agree Posterior precision: 1/10 + 20/0.176 = 0.1 + 113.6 = 113.7
Posterior mean: (0 + 20·2.77/0.176)/113.7 = 2.77
Exact = Laplace = EP = N(2.77, 0.0088)

Step 2: Add Outlier (replace x₁ with 15.0) Contaminated mean = 3.38, contaminated s² = 7.90
Exact posterior mean: 3.37 (pulled toward outlier)

Step 3: EP with Down-Weighted Outlier EP assigns weight 0.5 to outlier via moment matching
EP effective n = 19 + 0.5 = 19.5
EP posterior mean ≈ 2.80 (closer to true value)
Shift from clean: Exact = 0.60, EP = 0.03

EP's local moment-matching naturally down-weights observations that are inconsistent with the majority of the data. While the exact posterior shifts by 0.60 due to the outlier, EP shifts by only 0.03. Laplace approximation (being mode-based) would also be affected. This demonstrates EP's practical advantage: robustness to outliers emerges automatically from the iterative refinement of site approximations.

Interactive Calculator

Each row is a numeric value. The calculator compares three posterior approximation methods for a Normal mean with a known variance: Exact (conjugate), Laplace (mode + curvature), and EP (moment matching). For Normal models these all coincide; the differences emerge via robustified likelihoods.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics