Convert_tokens_to_ids produces <unk>

AfonsoSousa · October 21, 2022, 10:45am

Hi. I am trying to tokenize single words with a Roberta BPE Sub-word tokenizer. I was expecting to have some words with multiple ids, but when that is supposed to be the case, the method convert_tokens_to_ids just returns <unk>. However, __call__ from tokenizer produces the multiple ids. To reproduce the problem, run:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
token_id = tokenizer.convert_tokens_to_ids("exam")
print(f"{token_id} => {tokenizer.decode([token_id])}")
token_ids = tokenizer("exam").input_ids[1:3]
print(f"{token_ids} => {tokenizer.decode(token_ids)}")

Is there a way to make convert_tokens_to_ids have the same behaviour as tokenizer(token).input_ids[1:3]? THanks in advance for any help you can provide.

lianghsun · October 25, 2022, 8:01pm

I think you misunderstand tokenizer.convert_tokens_to_ids(). Please note this function is to map token to id, however exam is not a token, it is a word instead. You can check by the following code:

tokenizer.convert_ids_to_token([3463, 424])
> ['ex', 'am'] # exam is tokenized to two token!

So, it’s obivously that there is no token exam in vocab, but [UNK] instead.

Topic		Replies	Views
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	345	May 12, 2024
Word_to_tokens() and word_ids() ---- microsoft/deberta-v2/v3 🤗Tokenizers	2	502	July 14, 2022
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	774	April 22, 2024
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4765	March 11, 2025
Word_ids not working with deberta_v2 🤗Tokenizers	1	1347	August 12, 2022

Convert_tokens_to_ids produces <unk>

Related topics