Hi. I am trying to tokenize single words with a Roberta BPE Sub-word tokenizer. I was expecting to have some words with multiple ids, but when that is supposed to be the case, the method convert_tokens_to_ids just returns <unk>. However, __call__ from tokenizer produces the multiple ids. To reproduce the problem, run:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
token_id = tokenizer.convert_tokens_to_ids("exam")
print(f"{token_id} => {tokenizer.decode([token_id])}")
token_ids = tokenizer("exam").input_ids[1:3]
print(f"{token_ids} => {tokenizer.decode(token_ids)}")
Is there a way to make convert_tokens_to_ids have the same behaviour as tokenizer(token).input_ids[1:3]? THanks in advance for any help you can provide.