Bayesian Statistics

Lewandowski-Kurowicka-Joe Distribution

The Lewandowski-Kurowicka-Joe (LKJ) distribution is a probability distribution over correlation matrices parameterized by a single shape parameter η, widely used as a prior in Bayesian multivariate models.

p(C | η) ∝ det(C)^(η − 1), where C is a correlation matrix and η > 0

Specifying prior distributions over correlation matrices is one of the more challenging tasks in multivariate Bayesian modeling. A correlation matrix must be symmetric, positive definite, and have ones on the diagonal — constraints that make it impossible to simply place independent priors on each entry. The LKJ distribution, introduced by Daniel Lewandowski, Dorota Kurowicka, and Harry Joe in 2009, provides an elegant solution: a single-parameter family of distributions over the space of valid correlation matrices, with the parameter η controlling how much mass is concentrated near the identity matrix.

The LKJ distribution has become the default prior for correlation matrices in modern probabilistic programming frameworks, particularly Stan, where it is implemented as a first-class distribution. Its popularity stems from its simplicity, interpretability, and computational tractability.

LKJ Density p(C | η)  ∝  det(C)^(η − 1)

Where C   →  a d × d positive-definite correlation matrix (diagonal entries = 1)
η   →  shape parameter (η > 0)
det(C)  →  determinant of the correlation matrix

The Shape Parameter η

The single parameter η governs the concentration of the distribution. When η = 1, the LKJ distribution is uniform over the space of valid correlation matrices — all correlation structures are equally likely a priori. This is a natural non-informative choice. When η > 1, the distribution concentrates mass near the identity matrix, favoring weak correlations. The larger η is, the stronger the prior preference for near-independence. When η < 1, the distribution favors correlation matrices with extreme (near ±1) entries — strong correlations are a priori more likely.

In practice, η = 1 (uniform) or η = 2 (mild preference for weak correlations) are the most common choices. The choice η = 2 produces a marginal distribution on each off-diagonal correlation that is approximately uniform on (−1, 1) in the bivariate case, making it a convenient weakly informative prior.

Marginal Distribution on a Single Correlation (d = 2) For a 2 × 2 correlation matrix C = [[1, ρ], [ρ, 1]]:

p(ρ | η)  ∝  (1 − ρ²)^(η − 1)

η = 1:   uniform on (−1, 1)
η = 2:   p(ρ) ∝ (1 − ρ²), a concave parabola centered at 0
η = 10:   strongly peaked at ρ = 0
Why Not Inverse-Wishart?

The Inverse-Wishart distribution is the conjugate prior for a covariance matrix in the Normal model, but it couples the correlations and the variances in ways that are often undesirable. The LKJ distribution separates the prior on correlations from the prior on standard deviations, allowing each to be specified independently. In Stan's parameterization, a covariance matrix Σ is decomposed as Σ = diag(σ) · C · diag(σ), where σ is a vector of standard deviations (each given its own prior) and C is the correlation matrix (given an LKJ prior). This decomposition provides far greater flexibility and interpretability than the Inverse-Wishart.

The Vine Copula Construction

The LKJ distribution is defined through a specific construction based on vine copulas and partial correlations. A d × d correlation matrix has d(d−1)/2 free parameters. The vine construction parameterizes the matrix through a sequence of partial correlations, each of which is marginally Beta-distributed under the LKJ prior. The determinant det(C) can be expressed as a product of powers of (1 − ρ²ᵢⱼ|...) over the partial correlations, and the density det(C)^(η−1) factors accordingly.

This construction has two major advantages. First, it guarantees that random draws from the prior are always valid correlation matrices — symmetric and positive definite by construction. Second, it enables efficient sampling via independent draws from Beta distributions, followed by a deterministic transformation to the correlation matrix. The Stan implementation uses this approach.

Vine Construction (Cholesky Factor) C  =  LLᵀ, where L is a lower triangular Cholesky factor

The Cholesky factor L is constructed row by row from partial correlations:
L₂₁  =  ρ₁₂
L₃₁  =  ρ₁₃,    L₃₂  =  ρ₂₃|₁ · √(1 − ρ₁₃²)
(and so on for higher dimensions)

Applications in Bayesian Modeling

Multivariate Regression

In multivariate regression with correlated responses, the LKJ prior on the residual correlation matrix provides a principled and flexible specification. The analyst can express a preference for weak or strong residual correlations without coupling this choice to beliefs about the response variances.

Random Effects Models

In mixed-effects models with multiple random slopes and intercepts, the random effects covariance matrix requires a prior. The LKJ decomposition allows the analyst to place separate priors on the random effect standard deviations and their correlations, leading to better-behaved posteriors and more interpretable prior specifications.

Factor Analysis and Structural Equation Models

Bayesian factor analysis and structural equation models involve latent variable correlations. The LKJ prior provides a natural, weakly informative choice that avoids the excessive shrinkage toward zero correlation that can result from certain Inverse-Wishart parameterizations.

"The LKJ distribution offers a conceptually clean and computationally efficient prior for correlation matrices, finally separating our beliefs about correlations from our beliefs about variances." — Daniel Lewandowski, Dorota Kurowicka, and Harry Joe, Generating Random Correlation Matrices Based on Vines and Extended Onion Method (2009)

Historical Context and Alternatives

~1970s

The Inverse-Wishart distribution becomes the standard conjugate prior for covariance matrices, despite known limitations in coupling correlation and variance components.

2000

Barnard, McCulloch, and Meng propose separating correlations from variances and placing independent priors on each, establishing the conceptual framework the LKJ distribution would later fill.

2009

Lewandowski, Kurowicka, and Joe publish their paper introducing the LKJ distribution and efficient sampling algorithms based on vine copulas and the extended onion method.

2015

Stan implements the LKJ distribution as lkj_corr and lkj_corr_cholesky, making it accessible to applied practitioners and establishing it as the de facto standard prior for correlation matrices.

Practical Considerations

In high dimensions, the space of correlation matrices is vast and complex. A d × d correlation matrix has d(d−1)/2 free parameters, growing quadratically in d. Even with η = 1 (uniform), the typical correlation matrix drawn from the LKJ distribution has all off-diagonal entries near zero when d is large — a geometric consequence of concentration of measure in high dimensions. This means that in high-dimensional settings, the LKJ prior with η = 1 is effectively an informative prior favoring near-independence, not a "flat" prior in any practical sense. Practitioners should be aware of this concentration phenomenon and consider whether it aligns with their actual prior beliefs.

Example: Regularizing a Correlation with LKJ

From 20 paired observations, the sample correlation is r = 0.92. Is this genuine strong correlation, or noise? The LKJ prior with different η values provides regularization.

LKJ density at r = 0.92 For a 2×2 matrix: f(r | η) ∝ (1 − r²)^(η − 1)

η = 1 (uniform): f ∝ (1 − 0.846)^0 = 1.000 (all correlations equally likely)
η = 2: f ∝ (1 − 0.846)^1 = 0.154
η = 5: f ∝ (1 − 0.846)^4 = 0.00056
η = 10: f ∝ (1 − 0.846)^9 = 0.0000001

Approximate regularized correlation:
  η = 1: r* ≈ 0.92 × 20/(20+0) = 0.920 (no shrinkage)
  η = 2: r* ≈ 0.92 × 20/(20+2) = 0.836
  η = 5: r* ≈ 0.92 × 20/(20+8) = 0.657
  η = 10: r* ≈ 0.92 × 20/(20+18) = 0.484

With η = 1 (uniform over correlation matrices), the LKJ prior is agnostic and the sample correlation stands. As η increases, the prior concentrates mass near r = 0 (the identity matrix), penalizing extreme correlations. At η = 10, the prior is so skeptical of strong correlations that the observed r = 0.92 is shrunk nearly to 0.48. For typical Bayesian regression with Stan, η = 2 is a common default — it mildly discourages correlations near ±1 while remaining broadly permissive, yielding a regularized estimate of about 0.84.

Interactive Calculator

Each row has an x and y value. The calculator computes the sample correlation r, then evaluates the LKJ density at r for several shape parameters: η=1 (uniform over correlations), η=2, η=5, and η=10. Higher η concentrates prior mass near r=0, regularizing the observed correlation.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics