What Can Epigenomic Foundation Models Reveal About Biology?

By Christopher Hartl and Stuart P. Atkinson

Part 2: Modeling DNA Methylation

Our previous post focused on foundation epigenetic models that used DNA sequence (and DNA sequence only!) as input, learning effective nucleotide representations to predict molecular outputs such as RNA expression, open chromatin, and splicing. This subsequent post focuses on models that augment the DNA sequence with DNA methylation– one of the many epigenetic layers – to improve overall modeling of cell type and state. Importantly, the models described in this blog differ from previous models, as their encoded representations incorporate the contextual epigenetic clues critical for predicting methylation status at unseen CpG sites and across cell types.

Function-to-Function Models

A quirk of history bifurcated the world of epigenetics into two camps - DNA methylation and chromatin – and, as such, the landscape of foundation models is no different. Indeed, the sequence-to-function models considered in our last blog included AlphaGenome and GENA-LM, which operate at base-pair (or token) resolution. As such, these models could have incorporated methylation state as an output, but neither model included methylation data as an output head, likely due to scope rather than feasibility. This new blog focuses on a set of foundation models that aim to recapitulate or predict methylation: scMeFormer, scDNAm-GPT, MethylGPT, and CpGPT.

Unlike sequence-to-function models, function-to-function models often take multiple modalities as input whose dimensions may not align; therefore, one must carefully design the network architecture to best leverage these modalities. The models in this overview take different approaches: learning additive representations (enc(DNA) + enc(DNAm)), disjoint representations ((enc(DNA); enc(DNAm))), and combining modalities up front (enc(DNA, DNAm)). Comparing these approaches remains impossible as architectures, datasets, and benchmarks differ across models; instead, we can only summarize how each model performs on its own.

Single-modality output design represents a prominent feature of these models compared to previous models. Compared to Borzoi or AlphaGenome, which attempt to predict a broad spectrum of tissue- and cell-specific epigenetic states from DNA sequences, the models described in this study focused only on DNA methylation states. From an engineering standpoint, there is little practical difference in fine-tuning these models or AlphaGenome to a new modality (i.e., given sequence + methylation, predict expression); in both cases, a new output network must be added and fine-tuned. However, one would expect the methylation models to be slightly "over-adapted" to methylation prediction and to perform better when fine-tuning the whole network rather than freezing the weights and training only the new output head.

The potential for partial DNA methylation constitutes the fundamental difference between methylation-aware data and sequence-only data. In this case and across a population of cells, a particular locus may be fully methylated, fully unmethylated, or anything in between, with the average methylation level called the "beta value." DNA sequences do not share this methylation "spectrum", except in the context of somatic mosaicism (usually in cancer samples); however, methylation state effectively becomes equivalent to a nucleotide at the single-cell level: in humans, only 0, 1, or 2 copies can be methylated, just as only 0, 1, or 2 copies of the genome may harbor a polymorphic allele. Therefore, one may be mildly surprised that, especially for single-cell models, none of these models investigated the expansion of the alphabet to include a methylation character (attempted and dismissed?). These models all learn representations of epigenetic state and sequence context, and all utilize a stack of attention-like layers; however, the architectural approaches remain as diverse as the benchmarks they use.

Multi-panel image showing architecutres from the models under study — Model architectures and training approaches for CpGPT (top left), scDNAm-GPT (top right), methylGPT (bottom left) and scMeFormer (bottom right)

Overview

All methylation state models start with some variant of masked methylation prediction for pre-training, followed by a round of fine-tuning for a specific task. The approaches bear similarities to expression models such as scGPT, GET, scBERT, and cellFM, but strongly leverage CpG genomic ordering. Like expression models, the range of fine-tuning tasks is donor-level predictions (species, tissue, age) or cell-level predictions (cell type). One could ask questions more directly related to functional engineering, such as the impact of mutations at CpGs, or the predicted impact of targeted demethylation (dCAS-TET1) at nearby CpGs (which may appear in follow-up posts).

Each of the foundation models highlights a different aspect. CpGPT highlights robust performance on imputation and DNA methylation-age prediction, and extension of the chain-of-thought to imputation tasks. MethylGPT focuses on the effectiveness of methylation for predicting disease and treatment response. For single-cell models, scDNAmGPT highlights the MAMBA architecture's training speed, pseudotime inference, and clustering performance, whereas scMeFormer highlights non-degradation in clustering performance with downsampling, heritability enrichment, and imputation-driven DMR inference.

Inputs

Model	Input Type	Input Window	Raw Encoding	Seq. Encoding	Latent Resolution
CpGPT	CpG window + flanking DNA sequence; global position; bulk	10K CpGs; 2K flanking regions (each)	DNA: As input to pre-trained sequence model; methylation: Beta	DNA: Output from pre-trained model (DNABert2/HyenaDNA); Methylation: MLP(beta)	Same as pre-trained DNA model (768 for DNABERT2)
MethylGPT	CpG window; bulk	50K CpGs	CpG identifier, methylation beta value	Identifier: Encoder; Methylation: MLP(beta)	64
scDNAmGPT	Multiple CpG; DNA id; single cell	Up to 20M CpGs	Each CpG: 3 bp flanking DNA plus binary methylation	Encoder	128
scMeFormer	Multiple CpG; flanking DNA; single cell	2 KB, 100 CpGs	DNA: one-hot, CpG: pseudobulk beta values	3x CNN [DNA]; 3x CNN [Methyl]	256 for DNA, 256 for Methyl

These models use two major strategies for representing the input data: i) Include the DNA sequence within a contextual window [CpGPT, scMeFormer]; or ii) Encode CpG identifiers themselves [MethylGPT, scDNAmGPT]. In fact, scDNAmGPT is a slight hybrid: the CpG "identifier" is 6 bp of flanking DNA sequence itself, for a total of 4096 unique IDs. The question then arises that, given a latent vector of dimension 2^k that encodes the DNA sequence (or CpG ID), how do we combine it with a methylation value of dimension 1? Using a small fully-connected network to "inject" a [0,1] floating-point beta value into the same dimension as the DNA/CpG model remains a common strategy. Both MethylGPT and CpGPT employ this trick, while scMeFormer uses a convolutional neural network (CNN) instead, and scDNAmGPT incorporates a binary 0/1 methylation value at each site, resulting in a full "token" of the form (0, TATTGT). This latter "sleight of hand" means that scDNAmGPT remains the only method that employs strictly binary methylation information.

While the input windows of these models vary, CpGPT incorporates global positional encoding, which may enable "memorization" of methylation states; meanwhile, MethylGPT focuses on a genome-wide set of 49,157 CpGs, with each index uniquely corresponding to a single CpG. ScDNAmGPT takes a related approach, in which multiple word tokens (methylation, flanking sequence) help build a CpG "sentence": (1, ATTACT), (0, TGGCTC), .... MethylGPT also fits into this latter framework, except that each "word" represents a CpG identifier (cgXXXXX), and each "sentence" is 49,157 words long. Together, these models present a fascinating exploration of different methods for encoding methylation.

Architecture

Model Encoding	Transformer Block	Output heads	Output architecture	Fine Tuning
CpGPT	Additive (DNA + methyl)	32x Transformer++ (16 head)	2 (methylation, certainty)	3-layer MLP
MethylGPT	Additive (ID + methyl)	6x Transformer (4 head)	1 (methylation)	Linear
scDNAmGPT	Convolution (1x Conv1d)	8x Mamba SSM + 1x cross-attn	2 (methylation, unmethylation)	1-layer MLP
scMeFormer	Concatenation (after transformer)	8x MHA (DNA); 8x MHA (RNA)	4357 (1 per cell)	1-layer MLP

Building joint representations of both DNA methylation and DNA sequence/CpG information requires combining the initially independent encodings of these modalities. CpGPT and MethylGPT add the two representations together (enc(DNA, methyl) = enc(DNA) + enc(methyl)), whereas scMeFormer concatenates them (enc(DNA, methyl) = (enc(DNA), enc(Methyl))). Significant differences from these standard approaches include additional model parameters for concatenation and the learning of a non-interfering pair of encodings for addition.

All models utilize attention to some extent; scDNAmGPT employs only a single true attention layer, choosing instead a state-space model (MAMBA) for most stacks, with a single final cross-attention to "aggregate" a windowed representation. MethylGPT, as the most "lightweight" model, employs "only" 6 layers of a 4-head attention stack, though the input size makes the attention mechanism costly.

scMeFormer possesses the most unique output (one output head per cell): treating each cell as its own output rather than learning a network suitable for "arbitrary" cells (or even "arbitrary" cell ordering; i.e., given depth and cell metadata as inputs).

CpGPT comes closest to reflecting methylation as a population statistic by providing both an estimate and a confidence score. While not explicit, this idea should also appear in scMeFormer in the final representation, as predicting several thousand outputs will reveal apparent differences between highly variable and low-variability CpGs (when controlling for mean methylation).

Training

logos for EWAS data hub and clockbase — EWAS Data Hub and ClockBase are widely used across these studies for model pre-training.

Model	Pre-training Data	Pre-Training Scheme	Fine-Tuning Data	Fine-Tuning
CpGPT	2042 studies, 155,151 human samples (CpGCorpus); array data of varying sizes	Masked-site prediction (MAE; weight=10) with auxiliary losses: site variance (weight=1), earth-mover-distance to target (weight=1), embedding normality KLD (weight=1) and scGPT contrastive loss (weight=1)	Mortality; biomarkers of aging; RRBS Atlas; sciMETv3; mammalian data	Age prediction; cell type prediction; species prediction
MethylGPT	EWAS Data Hub; ClockBase; 154,063 human samples; array data of varying sizes	Masked-site prediction (MSE); methylation profile prediction [from ] (MSE); equal-weights	ClockBase age; Generation Scotland Cohort (disease)	Age prediction; disease prediction
scDNAmGPT	Human/Mouse scDNAm and bulk DNAm datasets (~900K single cells, ~50K bulk samples); sequencing data	Masked token prediction (beta-weighted cross-entropy)	scDNAm from embryonic cells (human/mouse); CRC tumors	Cell type prediction
scMeFormer	snMethyl data only (snmC-seq n=2784, snmC-seq2 n=3072, sn-m3C-seq n=4237, snmCAT-seq n=4357)	Weighted MSE reconstruction loss, higher weight for low-level methylation sites	Mouse embryo snMethyl data (scNMT-seq n=1105)	Weighted MSE reconstruction

Despite the explosion of whole-genome bisulfite sequencing (WGBS) data, the bulk models CpGPT and MethylGPT use only array data for training, whereas scDNAmGPT and scMeFormer use only single-cell bisulfite sequencing data, leaving much of the real-world data untouched. While referencing WGBS data, scDNAmGPT treats bulk data from purified cell populations as surrogate single-cell data, presumably by binarizing CpGs stochastically (according to some local correlation structure?). However, no information was provided on generating binarized training data from the bulk datasets.

CpGPT possesses (by far) the most complex loss function, combining reconstruction MAE with Gaussian-distribution-inducing divergence penalties (on the latent space), sample-similarity-clumping-inducing contrastive penalties (on the latent space), and earth-mover-distance distribution-matching penalties (on the predicted space). In all, CpGPT attempts to (i) predict the methylation mean and variance, (ii) use approximately normally distributed latent features, (iii) ensure the predicted value distribution for one CpG across multiple samples matches the empirical observed distribution, and (iv) ensure that latent features for similar samples remain similar to each other. The other, far simpler, loss functions employ weighted mean-squared error (MSE) mean absolute error (MAE).

For fine-tuning, tasks generally involve predicting biological states (i.e., tissue, cell type, disease, age, or species). The code repository for scDNAmGPT includes a fine-tuned gene expression model (scDNAmGPTForPredictscRNAseq) and an example of how to load it, but provides no details about the dataset(s) used for fine-tuning.

MethylGPT and scDNAmGPT take the full CpG context from a single sample as input, providing a single unambiguous embedding for each sample; in contrast, the windowed approaches of CpGPT and scMeFormer yield multiple embeddings (one per window) for individual samples (or cells). scMeFormer does not attempt any sample-level benchmarks and therefore does not address this problem; however, CpGPT has benchmarks for age prediction and cancer/normal prediction, and therefore, predictions from each CpG and window must be combined in some fashion. For extracting sample-level information, the attention mechanism in CpGPT adds an additional token at index 0 of the sequence, which undergoes training to aggregate sample-level information. All predictions occur via the inner product proj(U)hcls, where U is either: i) an embedding of DNA sequence surrounding the CpG (methylation prediction), ii) age values (age prediction), or iii) condition tokens (cancer/normal) prediction; and proj is a fully connected network that injects the raw values into the latent space. Of note, neither the CpGPT preprint nor the GitHub code provides information on how individual window predictions merge to form a final prediction.

Benchmarks

The lack of diverse benchmark tasks for evaluation remains a significant limitation across almost all biological foundation models, making comparisons particularly difficult, even when each model provides a set of evaluations. CpGPT and MethylGPT lack comparisons with other methods and compare only to a "mean" baseline for pre-training and to simpler machine learning methods for fine-tuning. ScDNAmGPT compares only to its own baseline (either swapping attention layers for MAMBA or eliminating the cross-attention layer). scDNAmGPT does compare to prior models, in which case it performs near-equivalently to CpG Transformer , but at far higher scalability, as CpG Transformer requires a full (n_cell x n_cpg) matrix as input.

In terms of the metrics themselves: CpGPT (Imputation: 3% MAE (seen), 7% MAE (unseen); Age prediction: MAE 2.02) performs well compared to MethylGPT (Imputation: 7%; Age prediction: MAE 4.5), though the comparison involves vastly different numbers of CpGs. CpGPT performs similarly to a regular MLP for tumor classification (AUPRC=0.968 vs 0.972), and MethylGPT highlights an AUC of 0.72 for disease prediction (no baseline comparison).

The scDNAmGPT model benchmarks cell classification (using <CLS> cross-attention versus gold-standard labels), achieving an F1 of 96.1 in brain, and 81-91 in other tissues. Using a purely synthetic bulk deconvolution approach, the r to the actual cell-type proportions ranged from 0.79 to 0.93. Using a small number (~600) of cancer cells with both scRNA and scDNAm data, scDNAmGPT fine-tunes the network to predict expression of highly variable genes (e.g., those most likely to be differentially expressed across lesion types), achieving r=0.95, though there are no "hold-out" genes. Finally, scMeFormer considers only imputation-focused tasks as benchmarks, particularly the ability to produce single-cell clusters under increased (downsampled) data missingness, with adjusted rand indexes (clustering scores) of 0.5-0.8 when downsampling to as few as 5% of CpGs.

The Critical Role of Chromatin in the Epigenetic Context

Compared to the sequence-to-function models from our previous blog, the models discussed here have far more limited capabilities, partially due to structural concerns (many models focus on CpGs rather than on sequence plus methylation) but also driven by the decision to focus exclusively on DNA methylation. While these methylation models could be fine-tuned for sequence-function benchmarks (by using a new output head), they would (likely) lack the same degree of predictive capability as Borzoi or AlphaGenome, as their pre-training did not consider multi-dimensional cell states.

Protein binding – histones, insulators, CTCF, remodelers – remains a critical driver of actual expression and regulatory network maintenance. While scDNAmGPT does provide some basic methylation-to-expression prediction (scDNAm_to_expression.ipynb) with a cross-gene r2 of 0.95, there exists no comparison to the simple cluster/lesion baseline. DNA methylation correlates with both decreased and increased expression, indicating that its role in cell-state maintenance and fate determination remains conditional on the activity of other systems.

Cartoon showing K36me2 and DNMT3A association — DNMT3A binding to H3K36me2 sites via PWWP is a key step in de novo DNA methylation. From Lue et al.

The dynamics of de novo DNA methylation also suggest a chromatin-first view. DNMTA3 contacts DNA via a PWWP domain, targeting H3K36me2 (added by NSD2) and H3K36me3 (added by SETD2); as such, the population-average beta-value tracks the combined K36me2 and K36me3 signals very strongly. Additionally, DNMT3A1/H2AK119ub association functions to induce de novo DNA methylation, while TET2 association with H3K4me2 induces DNA demethylation. Due to crosstalk between chromatin/DNA methylation, both chromatin profiling (beyond methylation) and cellular resolution remain critical for epigenetic foundation models, providing essential context for methylation dynamics.

IGV tracks showing DNAme correlation to K36me states — In K562 cells, average DNA methylation very strongly correlates with CUT&Tag profiles **jointly** targeting H3K36me3 and H3K36me2, highlighting the key role SET-containing proteins have in conferring DNA methylation state *via* chromatin.

Future models will need to build on single-cell data to predict not merely population averages of epigenetic and transcriptomic states, but modes that correspond to attractors in the system dynamics, and the dynamics themselves. Importantly, our next suite of technologies combines Paired-Tag technology with methylation profiling, providing exceptionally rich cellular data for exploring state transition dynamics.

Epigenome Technologies Blog