Bayesian Statistics

Bayesian Experimental Design

Bayesian experimental design selects experiments to maximize the expected information gain about unknown parameters, using the prior predictive distribution to evaluate how much each possible experiment is expected to reduce posterior uncertainty.

d* = argmax_d E_{p(y|d)} [ H[p(θ)] − H[p(θ|y, d)] ]

Every experiment costs something — time, money, material, or a patient's willingness to participate. Bayesian experimental design provides a principled framework for choosing which experiment to run, how many observations to collect, and where in the design space to place them, so as to extract the most information per unit of cost. The key idea is simple but powerful: before collecting data, use the prior predictive distribution to simulate all possible outcomes of each candidate design, compute how much each outcome would reduce uncertainty about the parameters, and choose the design that maximizes this expected reduction.

The approach was pioneered by Dennis Lindley in 1956 and has since grown into a rich field with connections to information theory, optimal control, and active learning in machine learning.

The Expected Information Gain

The central criterion in Bayesian experimental design is the expected information gain (EIG), also called the mutual information between the data y and the parameters θ under a candidate design d.

Expected Information Gain U(d) = E_{p(y|d)} [ D_KL( p(θ|y,d) ‖ p(θ) ) ]

     = E_{p(y|d)} [ H[p(θ)] − H[p(θ|y,d)] ]

     = I(θ; y | d)

Where D_KL         →  Kullback-Leibler divergence
H[p(θ)]      →  Prior entropy (uncertainty before experiment)
H[p(θ|y,d)]  →  Posterior entropy (uncertainty after experiment)
I(θ; y | d)   →  Mutual information between parameters and data

The optimal design maximizes this criterion: d* = argmax_d U(d). Intuitively, the best experiment is the one whose outcomes are most informative about the unknowns on average — the one that produces the largest expected reduction in entropy from prior to posterior.

Alternative Criteria

While the EIG (Lindley's criterion) is the most common, several alternative utility functions are used depending on the goal.

Common Design Criteria D-optimality:   max_d  det(I_F(d, θ̂))       (maximize determinant of Fisher information)
A-optimality:   min_d  tr(I_F(d, θ̂)⁻¹)     (minimize trace of inverse Fisher information)
Bayesian D-opt:  max_d  E_π[log det(I_F(d, θ))] (D-optimality averaged over prior)
Decision-theoretic: max_d  E[ u(a*(y), θ) ]     (maximize expected decision utility)

The classical D-optimal and A-optimal criteria come from frequentist experimental design. Their Bayesian counterparts average these criteria over the prior distribution of θ, acknowledging that the optimal design depends on the unknown parameter value. The decision-theoretic criterion goes further, specifying a terminal decision problem and choosing the experiment that leads to the best expected decision.

Why Not Just Maximize Data?

One might think the answer is always "collect more data." But Bayesian experimental design reveals that where you observe matters as much as how much you observe. In regression, for instance, designs that spread observations across the predictor space can be vastly more informative than designs that cluster them. In dose-response studies, the most informative dose levels are often at the extremes and inflection points — not at the center. Bayesian design identifies these high-information regions automatically.

Computational Methods

The main computational challenge is that the EIG involves a double expectation — an outer expectation over possible data y and an inner expectation over the posterior p(θ|y,d). For most models, neither integral has a closed form.

Nested Monte Carlo

The simplest approach draws samples θ⁽ʲ⁾ from the prior, simulates data y⁽ʲ⁾ from the likelihood, and estimates the posterior entropy for each simulated dataset using importance sampling or MCMC. This is computationally expensive — each outer sample requires a full posterior computation — but it is broadly applicable.

Variational Methods

Recent work uses variational bounds on the mutual information to create tractable lower bounds on the EIG that can be optimized with gradient-based methods. The variational approach of Foster et al. (2019) uses amortized inference networks to approximate the posterior, dramatically reducing cost.

Sequential Design

In many settings, experiments are performed sequentially — each observation informs the choice of the next. Bayesian adaptive design interleaves data collection with inference: after each observation, the posterior is updated, and the next design point is chosen to maximize the expected information gain from the updated state of knowledge. This is mathematically equivalent to a partially observable Markov decision process (POMDP), connecting experimental design to reinforcement learning.

"The purpose of the experiment is to gain information about the parameter, and the best experiment is the one that is expected to provide the most." — Dennis V. Lindley, "On a Measure of the Information Provided by an Experiment" (1956)

Applications

Clinical Trials

Bayesian adaptive clinical trials use accumulating data to modify the trial as it proceeds — adjusting dose levels, dropping ineffective treatment arms, or reallocating patients to promising treatments. The I-SPY 2 breast cancer trial and RECOVERY COVID-19 trial both employed Bayesian adaptive designs, accelerating the identification of effective treatments.

Bayesian Optimization

Bayesian optimization — used to tune hyperparameters in machine learning, optimize expensive simulations, and design new materials — is a form of sequential Bayesian experimental design. A Gaussian process surrogate model provides a posterior over the objective function, and an acquisition function (such as expected improvement or the knowledge gradient) selects the next evaluation point to maximize information about the optimum.

A/B Testing

Bayesian A/B testing uses sequential design to determine when enough evidence has accumulated to declare a winner, avoiding the fixed-sample-size requirement of frequentist testing and reducing the expected number of observations allocated to inferior variants.

Connections to Information Theory

The EIG is precisely the mutual information between θ and y, placing Bayesian experimental design squarely within Shannon's information theory. The channel capacity of a communication channel — the maximum mutual information between input and output — is the information-theoretic analogue of the optimal design. This connection has been exploited in compressed sensing, active learning, and sensor placement problems.

Mutual Information Decomposition I(θ; y | d) = H(y | d) − H(y | θ, d)

Interpretation I(θ; y | d) = (total uncertainty in data) − (irreducible noise in data)

This decomposition shows that the most informative experiment is one whose outcome is highly uncertain a priori (high marginal entropy) but tightly determined once the parameter is known (low conditional entropy). In other words, informative experiments are those whose results depend strongly on the unknown parameter.

Optimal Design Is Prior-Dependent

Unlike classical optimal design, which depends on a point estimate of θ, Bayesian optimal design depends on the full prior distribution. This means the optimal experiment changes as knowledge accumulates — a natural motivation for sequential and adaptive designs. It also means that prior elicitation is not merely a philosophical nicety but has direct practical consequences for experimental efficiency.

Related Topics