DNABERT-S

Weights and tokenizer for DNABERT-S (Zhou et al., arXiv 2024), loaded with the shared MosaicBERT implementation from Taykhoom/MosaicBERT-updated.

DNABERT-S is a species-aware DNA embedding model fine-tuned from DNABERT-2 using curriculum contrastive learning. It generates embeddings that naturally cluster and segregate genomes from different species, enabling species identification, metagenomics binning, and evolutionary analysis.

This repo contains only weights and tokenizer files. The model code is loaded automatically from Taykhoom/MosaicBERT-updated via trust_remote_code=True.

Architecture

Parameter	Value
Layers	12
Attention heads	12
Embedding dimension	768
Intermediate size	3072
Vocabulary size	4096 (BPE, identical to DNABERT-2)
Positional encoding	ALiBi (no hard length limit)
Max sequence length	~10000 nt (practical; ALiBi resizes dynamically)
Parameters	~110M (backbone only, no MLM head)

Tokenization

Uses Byte Pair Encoding (BPE) tokenization via PreTrainedTokenizerFast, identical vocabulary to DNABERT-2. No k-mer pre-processing required.

Pretraining

Objective: Curriculum contrastive learning (same-species pairs with i-Mix)
Initialization: Fine-tuned from zhihan1996/DNABERT-2-117M
Source checkpoint: pytorch_model.bin from zhihan1996/DNABERT-S

Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original implementation at all 13 representation levels (embedding + 12 transformer layers). SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.

Related Models

See the full DNABERT collection.

Model	Architecture	Notes
DNABERT-3mer	BERT + k-mer	k=3
DNABERT-4mer	BERT + k-mer	k=4
DNABERT-5mer	BERT + k-mer	k=5
DNABERT-6mer	BERT + k-mer	k=6
DNABERT-2	MosaicBERT + BPE + ALiBi	Pre-trained
DNABERT-S	MosaicBERT + BPE + ALiBi	This model

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
model.eval()

sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb  = out.last_hidden_state[:, 0, :]   # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling

Attention implementation

# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)

Implementation Notes

The original DNABERT-S codebase uses a Triton-based flash attention implementation (flash_attn_triton.py). This HF port uses Taykhoom/MosaicBERT-updated which replaces it with the standard flash-attn package, and also adds attn_implementation="sdpa" support. These were not part of the original codebase.

Citation

@misc{zhou2024_dnaberts,
  title   = {{DNABERT}-S: Learning Species-Aware {DNA} Embedding with Genome Foundation Models},
  author  = {Zhou, Zhihan and Wu, Winmin and Ho, Harrison and Wang, Jiayi and
             Shi, Lizhen and Davuluri, Ramana V and Wang, Zhong and Liu, Han},
  year    = {2024},
  eprint  = {2402.08777},
  archivePrefix = {arXiv},
  primaryClass  = {q-bio.GN}
}

Credits

Original DNABERT-S model and code by Zhou et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0, following the original repository.

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including Taykhoom/DNABERT-S

DNABERT

Collection

DNABERT family: k-mer variants (3/4/5/6), DNABERT-2 (BPE+ALiBi), and DNABERT-S (species-aware). All with Flash Attention 2 and SDPA. • 6 items • Updated 1 day ago

Paper for Taykhoom/DNABERT-S

DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

Paper • 2402.08777 • Published Feb 13, 2024 • 2