# CPM

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.

The abstract from the paper is the following:

*Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3,
with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even
zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus
of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the
Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best
of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained
language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation,
cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
NLP tasks in the settings of few-shot (even zero-shot) learning.*

This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
here: https://github.com/TsinghuaAI/CPM-Generate

<Tip>

CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
API reference information.

</Tip>

## CpmTokenizer[[transformers.CpmTokenizer]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.CpmTokenizer</name><anchor>transformers.CpmTokenizer</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L35</source><parameters>[{"name": "vocab_file", "val": ""}, {"name": "do_lower_case", "val": " = False"}, {"name": "remove_space", "val": " = True"}, {"name": "keep_accents", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "sep_token", "val": " = '<sep>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "cls_token", "val": " = '<cls>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "additional_special_tokens", "val": " = ['<eop>', '<eod>']"}, {"name": "sp_model_kwargs", "val": ": typing.Optional[dict[str, typing.Any]] = None"}, {"name": "**kwargs", "val": ""}]</parameters></docstring>
Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models.


<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.CpmTokenizer.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L241</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs to which the special tokens will be added.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring>

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An XLNet sequence has the following format:

- single sequence: `X <sep> <cls>`
- pair of sequences: `A <sep> B <sep> <cls>`








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>convert_tokens_to_string</name><anchor>transformers.CpmTokenizer.convert_tokens_to_string</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L235</source><parameters>[{"name": "tokens", "val": ""}]</parameters></docstring>
Converts a sequence of tokens (strings for sub-words) in a single string.

</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.CpmTokenizer.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L296</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).</retdesc></docstring>

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet
<ExampleCodeBlock anchor="transformers.CpmTokenizer.create_token_type_ids_from_sequences.example">

sequence pair mask has the following format:

```
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |
```

</ExampleCodeBlock>

If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>get_special_tokens_mask</name><anchor>transformers.CpmTokenizer.get_special_tokens_mask</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L267</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}, {"name": "already_has_special_tokens", "val": ": bool = False"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.
- **already_has_special_tokens** (`bool`, *optional*, defaults to `False`) --
  Whether or not the token list is already formatted with special tokens for the model.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.</retdesc></docstring>

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.








</div></div>

## CpmTokenizerFast[[transformers.CpmTokenizerFast]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.CpmTokenizerFast</name><anchor>transformers.CpmTokenizerFast</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L30</source><parameters>[{"name": "vocab_file", "val": " = None"}, {"name": "tokenizer_file", "val": " = None"}, {"name": "do_lower_case", "val": " = False"}, {"name": "remove_space", "val": " = True"}, {"name": "keep_accents", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "sep_token", "val": " = '<sep>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "cls_token", "val": " = '<cls>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "additional_special_tokens", "val": " = ['<eop>', '<eod>']"}, {"name": "**kwargs", "val": ""}]</parameters></docstring>
Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models.


<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.CpmTokenizerFast.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L148</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs to which the special tokens will be added.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring>

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An XLNet sequence has the following format:

- single sequence: `X <sep> <cls>`
- pair of sequences: `A <sep> B <sep> <cls>`








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.CpmTokenizerFast.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L174</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).</retdesc></docstring>

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet
<ExampleCodeBlock anchor="transformers.CpmTokenizerFast.create_token_type_ids_from_sequences.example">

sequence pair mask has the following format:

```
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |
```

</ExampleCodeBlock>

If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).








</div></div>

<EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/cpm.md" />