# BERTweet [[bertweet]]

## 개요 [[overview]]

BERTweet 모델은 Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen에 의해 [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) 에서 제안되었습니다.

해당 논문의 초록 :

*영어 트윗을 위한 최초의 공개 대규모 사전 학습된 언어 모델인 BERTweet을 소개합니다. 
BERTweet은 BERT-base(Devlin et al., 2019)와 동일한 아키텍처를 가지고 있으며, RoBERTa 사전 학습 절차(Liu et al., 2019)를 사용하여 학습되었습니다. 
실험 결과, BERTweet은 강력한 기준 모델인 RoBERTa-base 및 XLM-R-base(Conneau et al., 2020)의 성능을 능가하여 세 가지 트윗 NLP 작업(품사 태깅, 개체명 인식, 텍스트 분류)에서 이전 최신 모델보다 더 나은 성능을 보여주었습니다.*

이 모델은 [dqnguyen](https://huggingface.co/dqnguyen) 께서 기여하셨습니다. 원본 코드는 [여기](https://github.com/VinAIResearch/BERTweet).에서 확인할 수 있습니다.


## 사용 예시 [[usage-example]]

```python
>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")

>>> # 트랜스포머 버전 4.x 이상 :
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

>>> # 트랜스포머 버전 3.x 이상:
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

>>> # 입력된 트윗은 이미 정규화되었습니다!
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

>>> input_ids = torch.tensor([tokenizer.encode(line)])

>>> with torch.no_grad():
...     features = bertweet(input_ids)  # Models outputs are now tuples

>>> # With TensorFlow 2.0+:
>>> # from transformers import TFAutoModel
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
```

<Tip> 

이 구현은 토큰화 방법을 제외하고는 BERT와 동일합니다. API 참조 정보는 [BERT 문서](bert) 를 참조하세요.

</Tip>

## Bertweet 토큰화(BertweetTokenizer) [[transformers.BertweetTokenizer]][[transformers.BertweetTokenizer]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.BertweetTokenizer</name><anchor>transformers.BertweetTokenizer</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L54</source><parameters>[{"name": "vocab_file", "val": ""}, {"name": "merges_file", "val": ""}, {"name": "normalization", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "sep_token", "val": " = '</s>'"}, {"name": "cls_token", "val": " = '<s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **vocab_file** (`str`) --
  Path to the vocabulary file.
- **merges_file** (`str`) --
  Path to the merges file.
- **normalization** (`bool`, *optional*, defaults to `False`) --
  Whether or not to apply a normalization preprocess.
- **bos_token** (`str`, *optional*, defaults to `"<s>"`) --
  The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

  <Tip>

  When building a sequence using special tokens, this is not the token that is used for the beginning of
  sequence. The token used is the `cls_token`.

  </Tip>

- **eos_token** (`str`, *optional*, defaults to `"</s>"`) --
  The end of sequence token.

  <Tip>

  When building a sequence using special tokens, this is not the token that is used for the end of sequence.
  The token used is the `sep_token`.

  </Tip>

- **sep_token** (`str`, *optional*, defaults to `"</s>"`) --
  The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
  sequence classification or for a text and a question for question answering. It is also used as the last
  token of a sequence built with special tokens.
- **cls_token** (`str`, *optional*, defaults to `"<s>"`) --
  The classifier token which is used when doing sequence classification (classification of the whole sequence
  instead of per-token classification). It is the first token of the sequence when built with special tokens.
- **unk_token** (`str`, *optional*, defaults to `"<unk>"`) --
  The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
  token instead.
- **pad_token** (`str`, *optional*, defaults to `"<pad>"`) --
  The token used for padding, for example when batching sequences of different lengths.
- **mask_token** (`str`, *optional*, defaults to `"<mask>"`) --
  The token used for masking values. This is the token used when training this model with masked language
  modeling. This is the token which the model will try to predict.</paramsdesc><paramgroups>0</paramgroups></docstring>

Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.

This tokenizer inherits from [PreTrainedTokenizer](/docs/transformers/v4.57.0/ko/main_classes/tokenizer#transformers.PreTrainedTokenizer) which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.





<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>add_from_file</name><anchor>transformers.BertweetTokenizer.add_from_file</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L402</source><parameters>[{"name": "f", "val": ""}]</parameters></docstring>

Loads a pre-existing dictionary from a text file and adds its symbols to this instance.


</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.BertweetTokenizer.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L167</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs to which the special tokens will be added.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring>

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A BERTweet sequence has the following format:

- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>convert_tokens_to_string</name><anchor>transformers.BertweetTokenizer.convert_tokens_to_string</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L368</source><parameters>[{"name": "tokens", "val": ""}]</parameters></docstring>
Converts a sequence of tokens (string) in a single string.

</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.BertweetTokenizer.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L221</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of zeros.</retdesc></docstring>

Create a mask from the two sequences passed to be used in a sequence-pair classification task. BERTweet does
not make use of token type ids, therefore a list of zeros is returned.








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>get_special_tokens_mask</name><anchor>transformers.BertweetTokenizer.get_special_tokens_mask</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L193</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}, {"name": "already_has_special_tokens", "val": ": bool = False"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.
- **already_has_special_tokens** (`bool`, *optional*, defaults to `False`) --
  Whether or not the token list is already formatted with special tokens for the model.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.</retdesc></docstring>

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>normalizeToken</name><anchor>transformers.BertweetTokenizer.normalizeToken</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L341</source><parameters>[{"name": "token", "val": ""}]</parameters></docstring>

Normalize tokens in a Tweet


</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>normalizeTweet</name><anchor>transformers.BertweetTokenizer.normalizeTweet</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/bertweet/tokenization_bertweet.py#L307</source><parameters>[{"name": "tweet", "val": ""}]</parameters></docstring>

Normalize a raw Tweet


</div></div>

<EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/ko/model_doc/bertweet.md" />