| | --- |
| | library_name: transformers |
| | license: other |
| | license_name: rigoberta-nc |
| | license_link: https://huggingface.co/IIC/RigoBERTa-2.0/blob/main/LICENSE |
| | language: |
| | - es |
| | pipeline_tag: fill-mask |
| | --- |
| | |
| | # RigoBERTa 2.0 |
| |
|
| |  |
| |
|
| | **RigoBERTa 2.0** is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | **RigoBERTa 2.0** was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language. |
| |
|
| | - **Developed by:** IIC |
| | - **Model type:** Encoder |
| | - **Language(s) (NLP):** Spanish |
| | - **License:** rigoberta-nc (permissive Non Commercial) |
| | - **Finetuned from model:** FacebookAI/xlm-roberta-large |
| |
|
| | ## Intended Use & Limitations |
| |
|
| | ### Intended Use |
| |
|
| | **RigoBERTa 2.0** is designed for: |
| |
|
| | - General text understanding in Spanish. |
| | - Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks. |
| | - Research and development purposes, including benchmarking and further model adaptation. |
| |
|
| | Note that the license is **non-commercial**. For a commercial use, please contact us. |
| |
|
| | ### Limitations & Caveats |
| |
|
| | - **Data Biases:** While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data. |
| | - **Operational Cost:** Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated. |
| |
|
| | ## Training Details |
| |
|
| | ### Training Procedure |
| |
|
| | #### Preprocessing |
| |
|
| | - **Tokenizer:** Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model. |
| | - **Handling Long Sequences:** Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary. |
| | - **OOV Handling:** Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text. |
| |
|
| | ## Evaluation |
| |
|
| | RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models. |
| |
|
| | **Key Results:** |
| |
|
| | - Achieves top performance on most of the tested datasets. |
| |
|
| | Breakdown of the results: |
| |
|
| |  |
| | [García Subies et al.](https://academic.oup.com/jamia/article-abstract/31/9/2137/7630016) |
| |
|
| |  |
| |
|
| |  |
| |
|
| | ## Citation |
| |
|
| | If you use RigoBERTa 2.0 in your research, please cite the associated paper: |
| |
|
| | **BibTeX:** |
| |
|
| | ```bibtex |
| | @misc{rigoberta2, |
| | author = { Instituto de Ingeniería del Conocimiento }, |
| | title = { RigoBERTa-2.0 }, |
| | year = 2025, |
| | url = { https://huggingface.co/IIC/RigoBERTa-2.0 }, |
| | doi = { 10.57967/hf/7048 }, |
| | publisher = { Hugging Face } |
| | } |
| | ``` |
| |
|