Bayesian Statistics

Machine Learning & Deep Learning

Bayesian machine learning treats model parameters as random variables with posterior distributions, providing principled uncertainty quantification, automatic regularization, and model comparison — addressing the overconfidence and opacity that plague standard deep learning.

P(w | D) ∝ P(D | w) · P(w)

Machine learning and deep learning have achieved remarkable performance in tasks from image recognition to language understanding. But standard neural networks produce point predictions without reliable uncertainty estimates, are prone to overconfident extrapolation, and require ad hoc regularization and hyperparameter tuning. Bayesian machine learning addresses these limitations by maintaining a full posterior distribution over model parameters, treating prediction as marginalization over this posterior, and using the marginal likelihood for principled model selection.

Bayesian Neural Networks

A Bayesian neural network (BNN) places prior distributions over the weights and biases, computes (or approximates) the posterior distribution given training data, and makes predictions by integrating over the weight posterior. The predictive distribution naturally captures two types of uncertainty: epistemic uncertainty (uncertainty about the model) that decreases with more data, and aleatoric uncertainty (inherent noise in the data) that does not.

Bayesian Predictive Distribution P(y* | x*, D) = ∫ P(y* | x*, w) · P(w | D) dw

The integral over the weight posterior w provides predictions that
reflect both the best-fit model and uncertainty about model parameters.

Exact Bayesian inference in neural networks is intractable due to the high dimensionality and complex geometry of the weight space. Practical approximations include variational inference (Bayes by Backprop), Monte Carlo dropout, stochastic gradient MCMC (SGLD, SGHMC), Laplace approximation, and deep ensembles. Each trades off fidelity to the true posterior against computational cost.

Bayesian Optimization

Bayesian optimization is a global optimization strategy for expensive black-box functions — particularly hyperparameter tuning of machine learning models. A surrogate model (typically a Gaussian process) maintains a posterior distribution over the objective function, and an acquisition function (expected improvement, probability of improvement, or upper confidence bound) selects the next evaluation point by balancing exploration and exploitation. Bayesian optimization has been shown to find good hyperparameters in far fewer evaluations than grid search or random search.

Why Uncertainty Matters in Deep Learning

A standard neural network trained to classify images will assign high confidence to images far from its training distribution — including adversarial examples, out-of-distribution inputs, and corrupted data. Bayesian neural networks, by contrast, report high epistemic uncertainty for unfamiliar inputs, enabling downstream systems to detect when the model should not be trusted. This is critical for safety-relevant applications: medical diagnosis, autonomous driving, and financial decision-making all demand knowing what the model doesn't know.

Bayesian Model Selection and Regularization

The marginal likelihood — the probability of the data under the model after integrating out all parameters — provides a principled criterion for model comparison that automatically penalizes complexity (Occam's razor). Weight decay regularization in neural networks is equivalent to a Gaussian prior on the weights, making every regularized neural network an approximate Bayesian model. Bayesian methods make this connection explicit and allow the regularization strength to be determined by the data through empirical Bayes or full hierarchical modeling.

Gaussian Processes and Bayesian Nonparametrics

Gaussian processes (GPs) are a Bayesian nonparametric approach to regression and classification that define a prior directly over functions rather than over finite-dimensional parameters. GPs provide elegant uncertainty estimates and work well with small datasets but scale poorly with data size. Sparse GP approximations and connections between GPs and infinitely-wide neural networks (the neural network Gaussian process) bridge the gap between these two paradigms.

"A neural network that doesn't know what it doesn't know is not intelligent — it's dangerous. Bayesian deep learning is our best path toward models that are honest about their limitations." — Yarin Gal, on the importance of uncertainty in deep learning

Current Frontiers

Scalable Bayesian inference for foundation models (large language models and vision transformers) remains a grand challenge. Last-layer Bayesian methods and linearized Laplace approximations offer tractable compromises. Bayesian methods for neural architecture search, continual learning, and meta-learning are active research areas. And the intersection of Bayesian inference with causal machine learning promises models that not only predict but explain.

Related Topics