Bayesian Statistics

Genomics & Bioinformatics

Bayesian methods underpin the analysis of genomic data at nearly every level — from calling individual variants in DNA sequences to reconstructing the evolutionary tree of life and identifying differentially expressed genes.

P(Genotype | Reads) ∝ P(Reads | Genotype) · P(Genotype)

The genomics revolution generates data at a scale and noise level that demands probabilistic reasoning. A single whole-genome sequencing run produces billions of short reads, each subject to sequencing errors, alignment ambiguities, and coverage variation. Bayesian inference provides the natural framework for integrating noisy observations with prior knowledge about genome structure, mutation rates, and population genetics to make principled calls about genotypes, variants, and biological function.

Variant Calling

The task of determining whether a given position in a genome differs from the reference sequence is fundamentally a Bayesian inference problem. Tools like GATK (Genome Analysis Toolkit) compute the posterior probability of each possible genotype given the observed sequencing reads, base quality scores, and prior expectations from population databases. The prior encodes allele frequencies from reference panels like gnomAD, while the likelihood integrates information across multiple reads at each position.

Bayesian Variant Calling P(G = g | D) = P(D | G = g) · P(G = g) / Σ_g' P(D | G = g') · P(G = g')

where G is the genotype, D is the read data, and the sum runs over all possible genotypes.

Structural variant detection, copy number analysis, and somatic mutation calling in cancer all employ similar Bayesian frameworks, often with hierarchical models that share information across samples or genomic regions.

Phylogenetics and Molecular Evolution

Bayesian phylogenetics, implemented in software like MrBayes and BEAST, has become the dominant approach for inferring evolutionary relationships from molecular sequence data. The method places priors on tree topology, branch lengths, and substitution model parameters, then uses MCMC to sample from the posterior distribution over trees. This naturally provides uncertainty quantification — posterior probabilities on clades replace the bootstrap support values of maximum likelihood methods.

The BEAST Revolution

BEAST (Bayesian Evolutionary Analysis Sampling Trees), developed by Alexei Drummond and Andrew Rambaut, unified phylogenetic inference with molecular clock dating and demographic reconstruction in a single Bayesian framework. It has been cited over 30,000 times and was instrumental in tracking the evolution and spread of HIV, Ebola, Zika, and SARS-CoV-2.

Gene Expression and Differential Analysis

RNA-seq analysis of gene expression benefits enormously from Bayesian shrinkage. With thousands of genes and often few biological replicates, naive gene-by-gene testing yields unreliable variance estimates. Bayesian hierarchical models — as implemented in tools like DESeq2 and limma-voom — share information about variance across genes, shrinking extreme estimates toward a common prior. This empirical Bayes approach dramatically improves the ranking of differentially expressed genes and controls false discovery rates.

Genome-Wide Association Studies

GWAS tests millions of genetic variants for association with a trait. Bayesian fine-mapping methods like FINEMAP and SuSiE compute posterior inclusion probabilities for each variant, estimating the probability that each SNP is a causal variant rather than merely correlated with one. Bayesian sparse regression handles the massive multiple testing burden more gracefully than frequentist correction, and naturally produces credible sets of likely causal variants.

"In genomics, the number of hypotheses vastly exceeds the number of observations. Bayesian methods impose the structure and regularization needed to extract signal from this inherently underdetermined problem." — Matthew Stephens, on Bayesian approaches to genomic inference

Current Frontiers

Single-cell genomics generates sparse, high-dimensional data ideally suited to Bayesian factor models and topic models. Bayesian neural networks are being applied to protein structure prediction and variant effect prediction. And Bayesian integration of multi-omics data — combining genomics, transcriptomics, proteomics, and metabolomics — promises a more complete picture of biological systems through principled uncertainty propagation.

Related Topics