--- library_name: transformers license: other license_name: rigoberta-nc license_link: https://huggingface.co/IIC/RigoBERTa-2.0/blob/main/LICENSE language: - es pipeline_tag: fill-mask --- # RigoBERTa 2.0 ![Logo](./data/rigoberta.jpg) **RigoBERTa 2.0** is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding. ## Model Details ### Model Description **RigoBERTa 2.0** was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language. - **Developed by:** IIC - **Model type:** Encoder - **Language(s) (NLP):** Spanish - **License:** rigoberta-nc (permissive Non Commercial) - **Finetuned from model:** FacebookAI/xlm-roberta-large ## Intended Use & Limitations ### Intended Use **RigoBERTa 2.0** is designed for: - General text understanding in Spanish. - Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks. - Research and development purposes, including benchmarking and further model adaptation. Note that the license is **non-commercial**. For a commercial use, please contact us. ### Limitations & Caveats - **Data Biases:** While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data. - **Operational Cost:** Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated. ## Training Details ### Training Procedure #### Preprocessing - **Tokenizer:** Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model. - **Handling Long Sequences:** Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary. - **OOV Handling:** Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text. ## Evaluation RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models. **Key Results:** - Achieves top performance on most of the tested datasets. Breakdown of the results: ![Clinical Bench](./data/becnh1.jpg) [García Subies et al.](https://academic.oup.com/jamia/article-abstract/31/9/2137/7630016) ![Bench2](./data/becnh2.jpg) ![Bench3](./data/becnh3.jpg) ## Citation If you use RigoBERTa 2.0 in your research, please cite the associated paper: **BibTeX:** ```bibtex @misc{rigoberta2, author = { Instituto de Ingeniería del Conocimiento }, title = { RigoBERTa-2.0 }, year = 2025, url = { https://huggingface.co/IIC/RigoBERTa-2.0 }, doi = { 10.57967/hf/7048 }, publisher = { Hugging Face } } ```