Sparse Autoencoders for Low-N Protein Function Prediction and Design
Paper
β’
2508.18567
β’
Published
This repository contains the models used in the paper "Sparse Autoencoders Improve Low - N Protein Function Prediction and Design", by Darin Tsui, Kunal Talreja, and Amirali Aghazadeh. A link to the paper can be found here.
models/
βββ esm2_t33_650M_UR50D/ # Base ESM2 protein language model
βββ SPG1_STRSG_Wu_2016/ # Protein-specific models for SPG1
βββ SPG1_STRSG_Olson_2014/ # Protein-specific models for SPG1 (Olson)
βββ GRB2_HUMAN_Faure_2021/ # Protein-specific models for GRB2
βββ GFP_AEQVI_Sarkisyan_2016/ # Protein-specific models for GFP
βββ F7YBW8_MESOW_Ding_2023/ # Protein-specific models for F7YBW8
βββ DLG4_HUMAN_Faure_2021/ # Protein-specific models for DLG4
Each protein directory contains two types of models:
esm_ft/)
adapter_model.safetensors: Fine-tuning adapter weightsadapter_config.json: Adapter configurationREADME.md: Model card with detailed informationtokenizer_config.jsonspecial_tokens_map.jsonsae/)
checkpoints/: Training checkpoints with different loss valuesesm2_plm{PLM_DIM}_l{SAE_LAYER}_sae{SAE_DIM}_k{SAE_K}_auxk{SAE_AUXK}_-step={step}-avg_mse_loss={loss}.ckptlast.ckptfrom peft import PeftModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load fine-tuned model
base_model = AutoModel.from_pretrained("models/esm2_t33_650M_UR50D")
ft_model = PeftModel.from_pretrained(base_model, f"models/{DMS_ASSAY}/esm_ft")
Base model
facebook/esm2_t33_650M_UR50D