Bayesian Statistics

Bayesian Interpretation Of Kernel Regularization

The Bayesian interpretation of kernel regularization reveals that penalized regression in a reproducing kernel Hilbert space is equivalent to MAP estimation under a Gaussian process prior, unifying the frequentist and Bayesian perspectives on function estimation.

f* = argmin Σᵢ L(yᵢ, f(xᵢ)) + λ‖f‖²_ℋ ⟺ MAP estimate under GP prior

Regularization is the standard frequentist remedy for ill-posed or overfit function estimation: add a penalty term that discourages overly complex solutions. Kernel regularization — penalizing the RKHS norm of the estimated function — is the foundation of support vector machines, kernel ridge regression, and smoothing splines. The remarkable result connecting these methods to Bayesian statistics is that the regularized solution is precisely the maximum a posteriori (MAP) estimate under a Gaussian process prior whose covariance function is the reproducing kernel of the RKHS.

The Equivalence

Regularized Optimization (Frequentist) f* = argmin_{f ∈ ℋ} [ Σᵢ₌₁ⁿ L(yᵢ, f(xᵢ)) + λ ‖f‖²_ℋ ]

MAP Estimation (Bayesian) f_MAP = argmax_{f} [ log P(y|f) + log P(f) ]
= argmin_{f} [ −log P(y|f) − log P(f) ]

Correspondence P(f) = GP(0, k)  ⟹  −log P(f) ∝ ‖f‖²_ℋ
Gaussian likelihood with variance σ²  ⟹  λ = σ² (regularization parameter)

When the loss function L is the squared error and the likelihood is Gaussian, the negative log-posterior is exactly the regularized least-squares objective. The regularization parameter λ plays the role of the noise variance σ², and the RKHS norm penalty ‖f‖²_ℋ is the negative log-prior under a zero-mean GP with kernel k. The representer theorem guarantees that the solution lies in the span of kernel functions at the training points, yielding the familiar kernel regression formula.

1970

Kimeldorf and Wahba establish the correspondence between spline smoothing and Bayesian estimation, showing that cubic splines are the MAP estimates under a GP prior with a specific covariance structure.

1990

Grace Wahba's Spline Models for Observational Data develops the Bayesian interpretation of smoothing splines in depth, connecting cross-validation to marginal likelihood estimation.

1998–2001

The connection between SVMs and GP classification is explored by several authors. Sollich (1999) and Opper and Winther (2000) analyze the Bayesian interpretation of kernel classifiers.

2006

Rasmussen and Williams provide a unified treatment in Gaussian Processes for Machine Learning, making the RKHS-GP equivalence accessible to a broad audience and demonstrating its practical implications.

Beyond MAP: The Full Bayesian Advantage

The equivalence between regularization and MAP estimation is illuminating but also reveals the limitations of regularization viewed in isolation. MAP estimation discards posterior uncertainty — it produces a point estimate but no error bars. The full Bayesian GP approach goes further: it computes the posterior distribution over functions, providing predictive variance at every test point. This uncertainty quantification is not available from the regularized solution alone.

Moreover, the Bayesian approach provides a principled way to select λ (equivalently, σ²) and the kernel hyperparameters. The marginal likelihood — obtained by integrating over the function space — balances data fit against model complexity automatically. Maximizing the marginal likelihood (type-II maximum likelihood) is the Bayesian counterpart of cross-validation, and in practice the two often agree, but the marginal likelihood is cheaper to compute and provides a smooth, differentiable objective.

Regularization Paths and Prior Families

Different regularization penalties correspond to different priors. L₂ regularization (ridge regression, Tikhonov regularization) corresponds to a Gaussian prior — the GP setting. L₁ regularization (LASSO) corresponds to a Laplace prior, promoting sparsity. Elastic net (L₁ + L₂) corresponds to a spike-and-slab-like prior. Group lasso corresponds to group-sparse priors. In each case, the regularized solution is the MAP estimate under the corresponding prior, and the full Bayesian treatment generalizes the regularization approach by computing the entire posterior rather than just its mode.

RKHS and Function-Space Priors

The reproducing kernel Hilbert space ℋ associated with kernel k consists of all functions of the form f(x) = Σᵢ αᵢ k(x, xᵢ), with norm ‖f‖²_ℋ = αᵀKα. A GP with covariance k places most of its prior mass on functions with small RKHS norm — smooth, regular functions. Functions with large RKHS norm (rough, rapidly varying) receive exponentially less prior probability. The RKHS norm thus quantifies the "complexity" of a function as measured by the prior.

Subtly, the GP sample paths almost surely do not lie in the RKHS (they are rougher than any RKHS function). The RKHS characterizes the support of the posterior mean, not the support of sample paths. This distinction matters for understanding the difference between MAP estimation (which returns an RKHS function) and full posterior inference (which produces sample paths of different regularity).

Practical Implications

The Bayesian interpretation enriches standard kernel methods in several ways. It provides uncertainty quantification, enables principled hyperparameter selection, and supports model comparison through Bayes factors. For practitioners, it means that the familiar regularized regression formulas they already use are implicitly Bayesian — and that going fully Bayesian typically requires only modest additional computation while providing substantially richer output.

"Every regularized estimator is secretly a MAP estimator. Making the prior explicit doesn't change the point estimate — but it opens the door to everything else Bayesian inference has to offer." — Grace Wahba, on the connection between splines and Bayes

Related Topics