class transformers.CpmTokenizertransformers.CpmTokenizerhttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L35[{"name": "vocab_file", "val": ""}, {"name": "do_lower_case", "val": " = False"}, {"name": "remove_space", "val": " = True"}, {"name": "keep_accents", "val": " = False"}, {"name": "bos_token", "val": " = '~~'"}, {"name": "eos_token", "val": " = '~~'"}, {"name": "unk_token", "val": " = ''"}, {"name": "sep_token", "val": " = ''"}, {"name": "pad_token", "val": " = ''"}, {"name": "cls_token", "val": " = ''"}, {"name": "mask_token", "val": " = ''"}, {"name": "additional_special_tokens", "val": " = ['', '']"}, {"name": "sp_model_kwargs", "val": ": typing.Optional[dict[str, typing.Any]] = None"}, {"name": "**kwargs", "val": ""}] Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models.

build_inputs_with_special_tokenstransformers.CpmTokenizer.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L241[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs to which the special tokens will be added. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`List of [input IDs](../glossary#input-ids) with the appropriate special tokens. Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An XLNet sequence has the following format: - single sequence: `X ` - pair of sequences: `A B `

convert_tokens_to_stringtransformers.CpmTokenizer.convert_tokens_to_stringhttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L235[{"name": "tokens", "val": ""}] Converts a sequence of tokens (strings for sub-words) in a single string.

create_token_type_ids_from_sequencestransformers.CpmTokenizer.create_token_type_ids_from_sequenceshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L296[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet sequence pair mask has the following format: ``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | ``` If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

get_special_tokens_masktransformers.CpmTokenizer.get_special_tokens_maskhttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L267[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}, {"name": "already_has_special_tokens", "val": ": bool = False"}]- **token_ids_0** (`list[int]`) -- List of IDs. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs. - **already_has_special_tokens** (`bool`, *optional*, defaults to `False`) -- Whether or not the token list is already formatted with special tokens for the model.0`list[int]`A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method.

class transformers.CpmTokenizerFasttransformers.CpmTokenizerFasthttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L30[{"name": "vocab_file", "val": " = None"}, {"name": "tokenizer_file", "val": " = None"}, {"name": "do_lower_case", "val": " = False"}, {"name": "remove_space", "val": " = True"}, {"name": "keep_accents", "val": " = False"}, {"name": "bos_token", "val": " = '~~'"}, {"name": "eos_token", "val": " = '~~'"}, {"name": "unk_token", "val": " = ''"}, {"name": "sep_token", "val": " = ''"}, {"name": "pad_token", "val": " = ''"}, {"name": "cls_token", "val": " = ''"}, {"name": "mask_token", "val": " = ''"}, {"name": "additional_special_tokens", "val": " = ['', '']"}, {"name": "**kwargs", "val": ""}] Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models.

build_inputs_with_special_tokenstransformers.CpmTokenizerFast.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L148[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs to which the special tokens will be added. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`List of [input IDs](../glossary#input-ids) with the appropriate special tokens. Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An XLNet sequence has the following format: - single sequence: `X ` - pair of sequences: `A B `

create_token_type_ids_from_sequencestransformers.CpmTokenizerFast.create_token_type_ids_from_sequenceshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L174[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet sequence pair mask has the following format: ``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | ``` If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).