Do sequence-to-function models learn cell-type specificity?

Poster presented at the 90th CSHL Symposium: AI in Biology, May 26–31, 2026

Work in progress

This is ongoing work; the benchmark and analysis are being actively refined. I’d love to hear from you if this intersects with your work.

EnhancerDesigner coming soon

EnhancerDesigner (still workshopping the name) will be coming to the public soon. If you’re interested and don’t want to wait, get in touch!

Poster

Presented at the 90th CSHL Symposium: AI in Biology, Cold Spring Harbor, May 2026.

CSHL 2026 Poster — AI in Biology

CSHL 2026 Poster — AI in Biology

Click the poster to view the higher-resolution version.


The question

Deep learning models trained to predict chromatin accessibility from DNA sequence are increasingly used to select or design cell-type-specific enhancers. But predicting accessibility and discriminating cell-type specificity are different tasks. A model that accurately reconstructs ATAC-seq signal may still fail to identify which cell type an enhancer actually targets.

We ask: do sequence-to-function models trained on basal ganglia single-cell ATAC-seq data actually learn cell-type specificity and if so, where in the model does that information live?


The benchmark dataset

We compiled 541 in vivo validated enhancer-AAV sequences targeting the mammalian basal ganglia, drawn from published and preprint sources. Each enhancer is mapped to a harmonised set of ~25 basal ganglia cell types using the HMBA consensus taxonomy.

Evidence quality varies across sources, so the dataset is stratified into tiers (from gold-standard cross-species in vivo validation down to histological proxies), allowing evaluation at different levels of label confidence.


Models and evaluation

We trained and evaluated two model architectures across multiple species training conditions (human, macaque, marmoset, and pooled multi-species):

  • DilatedCNN - short-context; predicts per-cell-type accessibility from a single sequence window.
  • Enformer - long-context (~196 kb input); evaluated in a fine-tuned variant (ENCODE-pretrained weights) and a fully retrained variant (brain data only).

Three complementary evaluations:

Task Question Metric
Prediction accuracy Does the model predict ATAC signal? Pearson r
Specificity discrimination Does it rank the right cell type highest? AUROC per cell type and averaged
Representation analysis Do model embeddings encode cell-type identity? Logistic regression classifier accuracy

Interested in collaborating?

This work is ongoing, and we’re always interested in connecting with others working on synthetic sequence design, sequence-to-function modeling, active learning, and adjacent areas. If any of this resonates with your work or interests, please get in touch.