Bayesian Statistics

Kernel (Statistics)

In Bayesian statistics, a kernel function defines the covariance structure of a Gaussian process, encoding prior assumptions about smoothness, periodicity, and length-scale of the unknown function, and serving as the bridge between nonparametric function estimation and reproducing kernel Hilbert spaces.

k(x, x′) = σ² · exp(−‖x − x′‖² / 2ℓ²) [squared exponential]

A kernel function k(x, x′) takes two inputs and returns a real number measuring their "similarity" in a sense relevant to the problem at hand. In the context of Bayesian nonparametric modeling — particularly Gaussian processes — the kernel serves as the covariance function of a prior distribution over functions. The choice of kernel encodes all prior assumptions about the function to be learned: its smoothness, its typical amplitude, its characteristic length-scales, whether it exhibits periodicity, and how it behaves at large distances.

Properties of Valid Kernels

Positive Semi-Definiteness (Mercer's Condition) Σᵢ Σⱼ cᵢcⱼ k(xᵢ, xⱼ) ≥ 0    for all {xᵢ}, {cᵢ}

Common Kernel Functions Squared Exponential: k(x,x′) = σ² exp(−‖x−x′‖² / 2ℓ²)
Matérn-ν: k(r) = σ² · (2^{1−ν}/Γ(ν)) · (√(2ν)r/ℓ)^ν · K_ν(√(2ν)r/ℓ)
Periodic: k(x,x′) = σ² exp(−2 sin²(π|x−x′|/p) / ℓ²)
Linear: k(x,x′) = σ² · (x−c)ᵀ(x′−c)

A function k is a valid kernel if and only if the Gram matrix K with entries Kᵢⱼ = k(xᵢ, xⱼ) is positive semi-definite for every finite set of inputs. Mercer's theorem establishes that any such kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space: k(x, x′) = ⟨φ(x), φ(x′)⟩. This is the "kernel trick" — computations that depend only on inner products can be carried out in the high-dimensional feature space without ever computing the feature mapping φ explicitly.

1909

James Mercer proves his theorem on the eigenfunction expansion of positive-definite kernels, laying the mathematical foundation for kernel methods.

1950s–1960s

Aronszajn develops the theory of reproducing kernel Hilbert spaces (RKHS). Parzen connects kernels to density estimation. The mathematical infrastructure for kernel methods matures.

1992–1995

Boser, Guyon, and Vapnik introduce the support vector machine (SVM), popularizing the kernel trick for classification. Simultaneously, Neal and Williams develop the GP perspective, connecting kernels to Bayesian priors.

2006

Rasmussen and Williams publish Gaussian Processes for Machine Learning, providing a comprehensive treatment of kernels as covariance functions in the Bayesian nonparametric setting.

Kernels as Bayesian Priors

In GP regression, placing a GP prior f ~ GP(0, k) over an unknown function is equivalent to specifying a prior distribution over functions whose sample paths have regularity determined by k. The squared exponential kernel produces infinitely differentiable sample paths. The Matérn-ν kernel produces paths that are ⌈ν⌉−1 times differentiable — the Matérn-1/2 gives Ornstein-Uhlenbeck (continuous but rough) paths, Matérn-3/2 gives once-differentiable paths, and as ν → ∞ the Matérn converges to the squared exponential. The choice of kernel thus encodes a meaningful scientific prior about the expected regularity of the underlying process.

Kernel Composition

Kernels can be combined to build more expressive covariance functions. The sum of two valid kernels is valid (modeling additive contributions). The product of two valid kernels is valid (modeling interactions). A kernel applied to a transformed input is valid. This compositional algebra enables practitioners to construct kernels encoding complex prior knowledge: a sum of a periodic kernel and a squared exponential captures a signal with both periodic and smooth aperiodic components; a product of a linear kernel and a periodic kernel captures periodic patterns whose amplitude grows linearly with time. The "automatic statistician" project of Duvenaud, Lloyd, Grosse, Tenenbaum, and Ghahramani uses this compositionality to search over kernel structures, interpreting the resulting models in natural language.

Hyperparameter Learning

The kernel hyperparameters — length-scales ℓ, signal variance σ², periodicity p, and smoothness ν — control the prior over functions. In the Bayesian framework, these can be set by maximizing the marginal likelihood (type-II maximum likelihood or empirical Bayes), which automatically balances data fit against model complexity. Alternatively, a fully Bayesian treatment places priors on hyperparameters and marginalizes over them using MCMC or variational inference, producing richer uncertainty estimates.

Beyond Euclidean Inputs

Kernels can be defined on structured inputs: strings (string kernels), graphs (graph kernels), sets, distributions, and manifolds. This flexibility allows GP models to be applied to molecules (predicting drug activity from molecular graphs), text (modeling documents as distributions over words), and spatial data on the sphere (climate modeling on Earth's surface). Each kernel encodes a domain-specific notion of similarity, and the GP posterior provides calibrated uncertainty regardless of the input type.

"The kernel is the soul of a Gaussian process. Choose the kernel well and the model captures the structure of the problem; choose poorly and no amount of data will save you." — Carl Edward Rasmussen, Gaussian Processes for Machine Learning (2006)

Related Topics