Chain Rule (Probability) — BayesianStatistics.com

Every joint distribution, no matter how complex, can be written as a product of conditional probabilities. The chain rule makes this factorization precise. Given events A₁ through Aₙ, their joint probability equals the probability of the first event, times the probability of the second given the first, times the probability of the third given the first two, and so on. No approximation is involved — the identity is exact, following directly from the definition of conditional probability applied repeatedly.

This seemingly simple algebraic fact has far-reaching consequences. It is the principle that allows Bayesian networks to represent high-dimensional distributions compactly, the mechanism behind autoregressive language models, and the foundation on which sequential Bayesian updating rests.

Chain Rule — General Form P(A₁, A₂, …, Aₙ) = ∏ᵢ₌₁ⁿ P(Aᵢ | A₁, …, Aᵢ₋₁)

Expanded P(A₁, A₂, …, Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁,A₂) · … · P(Aₙ|A₁,…,Aₙ₋₁)

Derivation from Conditional Probability

The chain rule follows by repeatedly applying the definition of conditional probability. Recall that for any two events A and B with P(A) > 0:

Definition of Conditional Probability P(A, B) = P(B | A) · P(A)

For three events, apply the definition twice. First factor the joint into a pair:

Two-Step Derivation P(A, B, C) = P(C | A, B) · P(A, B)
= P(C | A, B) · P(B | A) · P(A)

The pattern extends to any number of variables by induction. At each step, one variable is peeled off the joint and turned into a conditional. The process terminates when a single marginal probability remains. Because each step is an exact application of the definition of conditional probability, the final product is an identity — it holds for all valid probability distributions.

The Two-Event Case and Bayes' Theorem

The simplest instance of the chain rule is the product rule for two events:

Product Rule P(A, B) = P(B | A) · P(A) = P(A | B) · P(B)

Setting these two factorizations equal and solving for P(A | B) yields Bayes' Theorem directly. In this sense the chain rule is logically prior to Bayes' Theorem — Bayes is what you get when you equate two different chain-rule decompositions of the same joint probability.

Connection to Bayesian Networks

A Bayesian network encodes a joint distribution over many variables as a directed acyclic graph. The chain rule guarantees that any joint distribution can be written as a product of conditionals. The network's structure then introduces conditional independence assumptions that simplify these factors.

Full Chain Rule (4 variables) P(A, B, C, D) = P(A) · P(B|A) · P(C|A,B) · P(D|A,B,C)

With Conditional Independence (Bayesian Network) P(A, B, C, D) = P(A) · P(B|A) · P(C|A) · P(D|B,C)

In the full chain rule, each successive conditional depends on all previous variables. A Bayesian network exploits the graph structure to drop irrelevant parents — for example, C might depend only on A (not B), and D might depend only on B and C (not A). These simplifications are not approximations; they are assertions about the structure of the problem. When they hold, the number of parameters needed to specify the joint distribution drops dramatically.

This is why the chain rule is sometimes called the "engine" of Bayesian networks. Without it there would be no principled way to decompose a joint distribution into local, modular factors that can be estimated and reasoned about independently.

Role in Sequential Bayesian Updating

When data arrive sequentially — observations x₁, x₂, …, xₙ — the chain rule provides the mathematical justification for updating beliefs one observation at a time:

Sequential Likelihood Decomposition P(x₁, x₂, …, xₙ | θ) = P(x₁|θ) · P(x₂|x₁,θ) · … · P(xₙ|x₁,…,xₙ₋₁,θ)

If the observations are conditionally independent given the parameter θ (the standard i.i.d. assumption), every conditioning set reduces to just θ, and the likelihood simplifies to a product of identical terms. But even when observations are dependent — as in time series, spatial data, or sequential experiments — the chain rule shows how to build the full likelihood from one-step-ahead conditionals.

This decomposition is the basis of online learning algorithms, particle filters, and Kalman filters. At each time step the agent absorbs one new observation, updates its posterior, and carries that posterior forward as the prior for the next step. The mathematical legitimacy of this "update and propagate" strategy rests entirely on the chain rule.

Autoregressive Models and Language

Modern autoregressive language models generate text by sampling one token at a time, each conditioned on all previous tokens. This is a direct application of the chain rule:

Autoregressive Factorization P(w₁, w₂, …, wₙ) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · … · P(wₙ|w₁,…,wₙ₋₁)

No independence assumptions are made — each token can depend on the entire preceding context. The model learns each conditional factor P(wₜ | w₁,…,wₜ₋₁) from data using a neural network. The chain rule guarantees that the product of these learned conditionals defines a valid joint distribution over sequences, regardless of the quality of the learned factors.

Continuous Variables

The chain rule extends naturally to continuous random variables by replacing probabilities with probability densities:

Continuous Chain Rule f(x₁, x₂, …, xₙ) = f(x₁) · f(x₂|x₁) · f(x₃|x₁,x₂) · … · f(xₙ|x₁,…,xₙ₋₁)

This form is essential in multivariate statistics and probabilistic modeling. Hierarchical Bayesian models, for instance, express complex joint densities over parameters and hyperparameters as cascading conditionals — hyperpriors feeding into priors feeding into likelihoods — each factor specified independently. The chain rule guarantees their product defines a coherent joint distribution.

A Worked Example: Medical Screening

A patient undergoes two independent diagnostic tests for a condition. The chain rule structures the joint probability of the disease status and both test results:

Joint Probability via Chain Rule P(Disease, Test₁⁺, Test₂⁺)
  = P(Disease) · P(Test₁⁺ | Disease) · P(Test₂⁺ | Disease, Test₁⁺)

With Conditional Independence of Tests   = P(Disease) · P(Test₁⁺ | Disease) · P(Test₂⁺ | Disease)

Numerical Example P(D) = 0.01    P(T₁⁺|D) = 0.95    P(T₂⁺|D) = 0.90
P(D, T₁⁺, T₂⁺) = 0.01 × 0.95 × 0.90 = 0.00855

The conditional independence assumption — that the two tests provide independent information given the true disease status — allows the third factor to drop its dependence on Test₁. This is precisely the kind of structural simplification that Bayesian networks formalize. Without the chain rule to begin with, there would be no principled starting point from which to introduce such simplifications.

Ordering Does Not Matter

The chain rule can be applied in any order. For three events, P(A,B,C) can be factored as P(A)·P(B|A)·P(C|A,B) or as P(C)·P(B|C)·P(A|B,C) or any of the other four orderings. All six products are equal — they must be, since they all equal the same joint probability. In practice, some orderings lead to simpler conditionals than others, and choosing a good ordering is a key design decision in probabilistic modeling.

Information-Theoretic Perspective

Taking logarithms of both sides of the chain rule yields an additive decomposition of information. The joint entropy (or log-probability) decomposes into a sum of conditional entropies:

Entropy Chain Rule H(A₁, A₂, …, Aₙ) = H(A₁) + H(A₂|A₁) + H(A₃|A₁,A₂) + … + H(Aₙ|A₁,…,Aₙ₋₁)

This identity is the foundation of information theory's connection to probability. It tells us that the total uncertainty in a collection of variables equals the uncertainty in the first, plus the residual uncertainty in the second after knowing the first, and so on. Mutual information, KL divergence, and all the other information-theoretic quantities used in Bayesian model comparison ultimately trace back to this decomposition.

"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)

Example: Predicting Equipment Failure in a Factory

A manufacturing plant monitors two sequential warning indicators on its assembly line: a temperature sensor (T) and a vibration sensor (V). Historical data show that 5% of production runs develop a critical fault (F). When a fault is developing, the temperature sensor triggers 80% of the time. When the temperature sensor has triggered and a fault is developing, the vibration sensor also triggers 90% of the time.

Setting Up the Chain

We want P(T triggers, V triggers, Fault) — the joint probability that both alarms fire and a real fault is present. The chain rule decomposes this neatly:

Chain Rule Decomposition P(F, T, V) = P(F) · P(T | F) · P(V | F, T)

Plugging in the Numbers

Calculation P(F) = 0.05
P(T | F) = 0.80
P(V | F, T) = 0.90

P(F, T, V) = 0.05 × 0.80 × 0.90 = 0.036

So 3.6% of production runs will see both sensors fire during a genuine fault. The chain rule let us build this joint probability from three simpler conditional relationships, each of which we could estimate from maintenance logs independently. Without the chain rule, we would need to observe all three events simultaneously in enough cases to estimate the joint probability directly — a much harder task when faults are rare.

Why This Matters

The chain rule is the workhorse behind probabilistic graphical models used in predictive maintenance systems. Each sensor reading becomes a node in a Bayesian network, and the chain rule provides the mathematical foundation for computing the probability of any combination of readings and underlying states. Real factories monitor hundreds of sensors — the chain rule is what makes the combinatorial explosion manageable by decomposing the joint distribution into a product of tractable conditionals.

Interactive Calculator

Each row is a production run with three readings: temp_alert, vibration_alert, and fault (yes/no). The calculator decomposes the joint probability using the chain rule: P(F, T, V) = P(F) · P(T|F) · P(V|T, F).

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.