The Watanabe–Akaike Information Criterion (WAIC) was developed by Sumio Watanabe as a Bayesian generalization of AIC that is valid even for singular statistical models — models where the Fisher information matrix is degenerate. Unlike DIC, which relies on a point estimate (the posterior mean), WAIC integrates over the full posterior distribution, providing a more principled measure of out-of-sample predictive accuracy.
Definition and Components
lppd = ∑ᵢ₌₁ⁿ log E_θ|y[p(yᵢ | θ)] = ∑ᵢ₌₁ⁿ log ∫ p(yᵢ | θ) p(θ | y) dθ
Effective number of parameters (variance form):
p_WAIC = ∑ᵢ₌₁ⁿ Var_θ|y[log p(yᵢ | θ)]
WAIC = −2(lppd − p_WAIC)
The lppd (log pointwise predictive density) measures how well the model predicts each observed data point, averaged over the posterior. The penalty p_WAIC captures the effective complexity by summing, for each observation, the posterior variance of its log predictive density. A large variance indicates that the model's predictions for that point are sensitive to the particular parameter values — a sign of overfitting.
Computation from MCMC Samples
Given S posterior draws {θ⁽¹⁾, …, θ⁽ˢ⁾}, WAIC is computed as:
lppd ≈ ∑ᵢ log(1/S ∑ₛ p(yᵢ | θ⁽ˢ⁾)) — the log of the average likelihood for each data point.
p_WAIC ≈ ∑ᵢ Var_s[log p(yᵢ | θ⁽ˢ⁾)] — the sample variance of the log-likelihood for each point across posterior draws.
This computation is straightforward and, like DIC, works directly from standard MCMC output. The key advantage is that it uses the full posterior, not just a point estimate.
WAIC is asymptotically equivalent to Bayesian leave-one-out cross-validation (LOO-CV) and to the Bayesian predictive information criterion (BPIC). Vehtari, Gelman, and Gabry (2017) showed that Pareto-smoothed importance sampling LOO (PSIS-LOO) provides a more robust estimate than WAIC when individual terms in the lppd sum are highly variable, because importance sampling diagnostics can flag problematic observations. In current best practice (e.g., the loo R package), PSIS-LOO is generally preferred over WAIC, though both target the same quantity: expected log predictive density (elpd).
Theoretical Foundations
Watanabe's theory of singular learning theory provides the mathematical underpinning. In regular models (where the Fisher information is positive definite), WAIC reduces to AIC up to lower-order terms. In singular models — including mixture models, hidden Markov models, Bayesian neural networks, and many hierarchical models — the effective dimensionality is not a whole number, and AIC/BIC can fail. Watanabe showed that the WAIC penalty correctly captures the "real log canonical threshold" (RLCT), a geometric invariant that governs the model's learning coefficient.
Historical Context
Akaike's AIC established information-theoretic model selection for regular models.
DIC was introduced for Bayesian model comparison via MCMC, but with known limitations for non-Gaussian posteriors.
Sumio Watanabe published WAIC as part of his broader theory of singular learning, proving its validity for singular models.
Gelman, Hwang, and Vehtari brought WAIC into mainstream applied statistics, comparing it to DIC and LOO-CV and advocating its use as a principled Bayesian model comparison tool.
Practical Recommendations
WAIC (and the closely related PSIS-LOO) should be used when the goal is to compare models' predictive performance, which is the most common model comparison objective. For comparing nested models with the same data, Bayes factors remain the gold standard for hypothesis testing. For non-nested models or when prior sensitivity is a concern, robust Bayesian analysis should complement any information criterion. WAIC is implemented in Stan (via the loo package), PyMC, and other modern Bayesian software.
"WAIC bridges information theory and Bayesian statistics, providing a model comparison criterion that respects the full posterior and remains valid even when classical regularity conditions fail."— Sumio Watanabe, 2010
Worked Example: WAIC for Two Competing Models
We compute WAIC from pointwise log-likelihood values for two models fit to 10 observations. WAIC uses the full posterior, not just the point estimate.
Model 2: −1.5, −1.0, −1.3, −1.1, −1.4, −1.2, −1.0, −1.6, −1.1, −0.9
Step 1: lppd (log pointwise predictive density) lppd₁ = Σ log p(yᵢ | θ̂) = −10.5
lppd₂ = Σ log p(yᵢ | θ̂) = −12.1
Step 2: p_WAIC (effective parameters) p_WAIC = Σ Var_post(log p(yᵢ | θ))
p_WAIC₁ ≈ 0.82
p_WAIC₂ ≈ 0.45
Step 3: WAIC = −2(lppd − p_WAIC) WAIC₁ = −2(−10.5 − 0.82) = 22.64
WAIC₂ = −2(−12.1 − 0.45) = 25.10
ΔWAIC = WAIC₁ − WAIC₂ = −2.46
Model 1 has a lower WAIC (22.64 vs 25.10), indicating better expected out-of-sample predictive performance. The difference of 2.46 is modest. Model 1 achieves this despite having a higher effective parameter count (p_WAIC = 0.82 vs 0.45) because its log pointwise predictive density is substantially better (−10.5 vs −12.1). WAIC, unlike DIC, accounts for the full posterior distribution of each parameter, making it more reliable for non-Normal posteriors.