IIC
/

RigoBERTa-2.0

Model card Files Files and versions

RigoBERTa-2.0 / README.md

GuillemGSubies's picture

Updated citing information to add the DOI

a429e2b verified 3 months ago

|

history blame contribute delete

3.14 kB

	---
	library_name: transformers
	license: other
	license_name: rigoberta-nc
	license_link: https://huggingface.co/IIC/RigoBERTa-2.0/blob/main/LICENSE
	language:
	- es
	pipeline_tag: fill-mask
	---

	# RigoBERTa 2.0

	![Logo](./data/rigoberta.jpg)

	RigoBERTa 2.0 is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding.

	## Model Details

	### Model Description

	RigoBERTa 2.0 was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language.

	- Developed by: IIC
	- Model type: Encoder
	- Language(s) (NLP): Spanish
	- License: rigoberta-nc (permissive Non Commercial)
	- Finetuned from model: FacebookAI/xlm-roberta-large

	## Intended Use & Limitations

	### Intended Use

	RigoBERTa 2.0 is designed for:

	- General text understanding in Spanish.
	- Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks.
	- Research and development purposes, including benchmarking and further model adaptation.

	Note that the license is non-commercial. For a commercial use, please contact us.

	### Limitations & Caveats

	- Data Biases: While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data.
	- Operational Cost: Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated.

	## Training Details

	### Training Procedure

	#### Preprocessing

	- Tokenizer: Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model.
	- Handling Long Sequences: Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary.
	- OOV Handling: Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text.

	## Evaluation

	RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models.

	Key Results:

	- Achieves top performance on most of the tested datasets.

	Breakdown of the results:

	![Clinical Bench](./data/becnh1.jpg)
	[García Subies et al.](https://academic.oup.com/jamia/article-abstract/31/9/2137/7630016)

	![Bench2](./data/becnh2.jpg)

	![Bench3](./data/becnh3.jpg)

	## Citation

	If you use RigoBERTa 2.0 in your research, please cite the associated paper:

	BibTeX:

	```bibtex
	@misc{rigoberta2,
	author = { Instituto de Ingeniería del Conocimiento },
	title = { RigoBERTa-2.0 },
	year = 2025,
	url = { https://huggingface.co/IIC/RigoBERTa-2.0 },
	doi = { 10.57967/hf/7048 },
	publisher = { Hugging Face }
	}
	```