Fill-Mask
Transformers
Safetensors
Kok Borok
xlm-roberta
kokborok
bert
northeast-india
tripura
low-resource-languages
Eval Results (legacy)
Instructions to use MWirelabs/kokborokbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MWirelabs/kokborokbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="MWirelabs/kokborokbert")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("MWirelabs/kokborokbert") model = AutoModelForMaskedLM.from_pretrained("MWirelabs/kokborokbert") - Notebooks
- Google Colab
- Kaggle
KokborokBERT
KokborokBERT is the first publicly released masked language model for the Kokborok language. It is built via domain-adaptive fine-tuning of XLM-RoBERTa-base on a curated Kokborok corpus.
Training Performance
The model was trained for 13 epochs on an NVIDIA A40.
| Metric | Baseline (XLM-R) | KokborokBERT |
|---|---|---|
| Masked Loss | 5.9831 | 1.7752 |
| Perplexity | 396.69 | 5.90 |
| Improvement | - | 67.2x error reduction |
Usage
You can use this model directly with a pipeline for masked language modeling:
from transformers import pipeline
mask_filler = pipeline("fill-mask", model="MWirelabs/kokborokbert")
test_text = "O kothar-no nwng jeni-hai-pha-no <mask> khlai-man-nai."
results = mask_filler(test_text)
for res in results:
print(f"Score: {res['score']:.4f} | Prediction: {res['token_str']}")
License
This model is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Limitations and Biases
- Domain Specificity: The model was trained on a specific corpus of ~391k tokens. Performance may vary when applied to dialects or specialized domains (medical, legal) not heavily represented in the training data.
- Base Model Inheritances: As a fine-tuned version of
xlm-roberta-base, this model may inherit biases present in the original multilingual pre-training data. - Task Limitation: This is an encoder-only Masked Language Model. It is designed for tasks like NER, classification, and similarity, but is not intended for text generation (NLG).
Citation
If you use this model in your research, please cite it as follows:
@misc{kokborokbert2026,
author = {MWire Labs},
title = {KokborokBERT: Domain-Adaptive Fine-Tuning of XLM-RoBERTa for Kokborok},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MWirelabs/kokborokbert}}
}
- Downloads last month
- 43
Evaluation results
- Perplexityself-reported5.900