The Embedded Alphabet (TEA)
This repository contains the code accompanying our pre-print (link coming soon).
Installation
python -m pip install git+https://github.com/PickyBinders/tea.git
Sequence Conversion with TEA
The tea_convert command takes protein sequences from a FASTA file and generates new tea-FASTA. It supports confidence-based sequence output where low-confidence positions are displayed in lowercase, and has options for saving logits and entropy. If --save_avg_entropy is set, the FASTA identifiers will contain the average entropy of the sequence in the format <key>|H=<avg_entropy>.
usage: tea_convert [-h] -f FASTA_FILE -o OUTPUT_FILE [-l] [-H] [-r] [-c] [-t ENTROPY_THRESHOLD]
options:
-h, --help show this help message and exit
-f FASTA_FILE, --fasta_file FASTA_FILE
Input FASTA file containing protein amino acid sequences
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output FASTA file for generated tea sequences
-l, --save_logits Save per-residue logits to .pt file
-H, --save_avg_entropy
Save average entropy values in FASTA identifiers
-r, --save_residue_entropy
Save per-residue entropy values to .pt file
-c, --lowercase_entropy
Save residues with entropy > threshold in lowercase
-t ENTROPY_THRESHOLD, --entropy_threshold ENTROPY_THRESHOLD
Entropy threshold for lowercase conversion
Using the huggingface model
from tea.model import Tea
from transformers import AutoTokenizer, AutoModel
from transformers import BitsAndBytesConfig
import torch
import re
tea = Tea.from_pretrained("PickyBinders/tea")
device = next(tea.parameters()).device
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
bnb_config = BitsAndBytesConfig(load_in_4bit=True) if torch.cuda.is_available() else None
esm2 = AutoModel.from_pretrained(
"facebook/esm2_t33_650M_UR50D",
torch_dtype="auto",
quantization_config=bnb_config,
add_pooling_layer=False,
).to(device)
esm2.eval()
sequence_examples = ["PRTEINO", "SEQWENCE"]
sequence_examples = [" ".join(list(re.sub(r"[UZOBJ]", "X", sequence))) for sequence in sequence_examples]
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
with torch.no_grad():
x = esm2(
input_ids=input_ids, attention_mask=attention_mask
).last_hidden_state.to(device)
results = tea.to_sequences(embeddings=x, input_ids=input_ids, return_avg_entropy=True, return_logits=False, return_residue_entropy=False)
results
Using tea sequences with MMseqs2
The matcha.out substitution matrix is included with the tea package. You can get its path programmatically:
from tea import get_matrix_path
matcha_path = get_matrix_path()
print(f"Matrix path: {matcha_path}")
Then use it with MMseqs2:
mmseqs easy-search tea_query.fasta tea_target.fasta results.m8 tmp/ \
--comp-bias-corr 0 \
--mask 0 \
--gap-open 18 \
--gap-extend 3 \
--sub-mat /path/to/matcha.out \
--seed-sub-mat /path/to/matcha.out \
--exact-kmer-matching 1
- Downloads last month
- 1,682
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
