TropicBERT: Tropical Fruit Genomics Foundation Models
𧬠1. Model Overview
TropicBERT is the first genomic foundation model series specifically designed for tropical fruit crop genome sequences. We pre-trained BERT models using the MLM (Masked Language Modeling) objective on datasets comprising 13 different species combinations, releasing a total of 13 pre-trained model variants covering 1, 5, and 10 species.
The models are developed based on TropicBERT-LLMs_One_stop_tutorial, a reproducible pipeline that successfully transfers the "pre-training and fine-tuning" paradigm from NLP to genomics, significantly lowering the barrier for developing plant genomic large language models.
π Related ResourcesοΌ
- GitHub Full Tutorial: TropicBERT-LLMs_One_stop_tutorial
- Includes data preprocessing, pre-training, and fine-tuning scripts
- Provides sample datasets in FASTA/CSV format
- The original genomic data is sourced fromοΌ tropicalfruit-omics database
- Published Paper: [ Paper ]
- More detailed model description and training details
π 2. Model Variants
We provide 13 model versions trained on different species combinations to meet various research needsοΌ
| Model Name | Coverage | Species Count | Included Species (Scientific Name) |
|---|---|---|---|
| base | Comprehensive | 10 | Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava, Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata |
| tp035-039 | Group A | 5 | Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava |
| tp040-044 | Group B | 5 | Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata |
| tp035 | Single Species | 1 | Melastoma candidum |
| tp036 | Single Species | 1 | Punica granatum |
| tp037 | Single Species | 1 | Syzygium samarangense |
| tp038 | Single Species | 1 | Plinia cauliflora |
| tp039 | Single Species | 1 | Psidium guajava |
| tp040 | Single Species | 1 | Eugenia stipitata |
| tp041 | Single Species | 1 | Eugenia brasiliensis |
| tp042 | Single Species | 1 | Eugenia candolleana |
| tp043 | Single Species | 1 | Eugenia Observa |
| tp044 | Single Species | 1 | Eugenia aggregata |
π 3. Quick Usage
The following code demonstrates how to load the model and extract DNA sequence embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "yang0104/TropicBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
seq = "ATGCGTACGTTAGCCTA..."
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
print(embeddings.shape)
π 4. Contact & Citation
If you encounter any issues or wish to discuss collaboration, please contact us via:
- π Submit Issue: GitHub Issues
- π¬ Community Discussion: HuggingFace Discussions
- π§ Email: [email protected], [email protected]
Citation: If you use TropicBERT in your research, please cite the following paper:
paper