Matthew D. Hoffman is an American computer scientist and machine learning researcher at Google DeepMind whose work on scalable inference algorithms has had a transformative impact on both Bayesian statistics and deep learning. His development of stochastic variational inference with David Blei and colleagues enabled variational Bayesian methods to scale to massive datasets, while his co-invention of the ADAM optimizer with Diederik Kingma became the default optimization algorithm for training deep neural networks worldwide.
Life and Career
Born in the United States. Studies computer science and machine learning.
Earns his Ph.D. from Princeton University, focusing on scalable algorithms for Bayesian inference.
Publishes "Online Learning for Latent Dirichlet Allocation" with David Blei and Francis Bach, introducing online variational Bayes for topic models.
Co-develops the general framework of stochastic variational inference, extending online variational methods to arbitrary conjugate-exponential models.
Co-invents the No-U-Turn Sampler (NUTS) with Andrew Gelman, providing the adaptive HMC algorithm that powers Stan.
Co-publishes the ADAM optimizer paper with Diederik Kingma, which becomes one of the most cited papers in machine learning.
Stochastic Variational Inference
Classical variational inference processes the entire dataset at each optimization step, computing expectations over all observations to update the variational parameters. For datasets with millions of observations, this is prohibitively expensive. Hoffman, Blei, Wang, and Paisley showed that natural gradient stochastic optimization could be applied to the variational objective, using random mini-batches of data to form noisy but unbiased gradient estimates.
Compute local variational parameters for the mini-batch
Form noisy estimate of the natural gradient: λ̃ₜ
Update global parameters: λₜ = (1 − ρₜ) λₜ₋₁ + ρₜ λ̃ₜ
Learning Rate Schedule ρₜ = (t + τ)^{−κ}, κ ∈ (0.5, 1], τ ≥ 0
The key theoretical insight is that the natural gradient of the variational objective in exponential-family models takes a particularly simple form that can be estimated from mini-batches. The Robbins-Monro conditions on the learning rate schedule guarantee convergence, while the use of natural gradients (rather than ordinary gradients) accounts for the information geometry of the variational distribution, leading to faster convergence in practice.
Hoffman also co-invented the No-U-Turn Sampler with Andrew Gelman. NUTS eliminates the need to hand-tune the trajectory length in HMC by automatically determining when the simulated trajectory begins to double back on itself. This adaptation was crucial for making HMC practical as a default algorithm, since the optimal trajectory length varies across problems and even across different regions of the same posterior. NUTS is the default sampler in Stan.
The ADAM Optimizer
While not strictly a Bayesian contribution, the ADAM (Adaptive Moment Estimation) optimizer that Hoffman co-developed with Diederik Kingma has been essential to modern machine learning. ADAM combines momentum (first-moment estimation) with adaptive learning rates (second-moment estimation) to provide robust optimization for deep neural networks. The algorithm has become the default optimizer for training deep learning models, used in virtually every major neural network architecture from convolutional networks to transformers.
Legacy
Hoffman's contributions span both sides of the Bayesian-deep learning divide. Stochastic variational inference made Bayesian methods scalable to big data, while ADAM made deep learning optimization reliable and practical. The NUTS sampler made Hamiltonian Monte Carlo accessible to non-experts through Stan. Together, these contributions have shaped the computational infrastructure of modern machine learning and statistics.
"Scalable inference is not just about faster computation. It is about making principled Bayesian methods applicable to the problems that matter most, which increasingly involve very large datasets." — Matthew D. Hoffman