KokborokBERT

KokborokBERT is the first publicly released masked language model for the Kokborok language. It is built via domain-adaptive fine-tuning of XLM-RoBERTa-base on a curated Kokborok corpus.

Training Performance

The model was trained for 13 epochs on an NVIDIA A40.

Metric	Baseline (XLM-R)	KokborokBERT
Masked Loss	5.9831	1.7752
Perplexity	396.69	5.90
Improvement	-	67.2x error reduction

Usage

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

mask_filler = pipeline("fill-mask", model="MWirelabs/kokborokbert")
test_text = "O kothar-no nwng jeni-hai-pha-no <mask> khlai-man-nai."
results = mask_filler(test_text)

for res in results:
    print(f"Score: {res['score']:.4f} | Prediction: {res['token_str']}")

License

This model is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Limitations and Biases

Domain Specificity: The model was trained on a specific corpus of ~391k tokens. Performance may vary when applied to dialects or specialized domains (medical, legal) not heavily represented in the training data.
Base Model Inheritances: As a fine-tuned version of xlm-roberta-base, this model may inherit biases present in the original multilingual pre-training data.
Task Limitation: This is an encoder-only Masked Language Model. It is designed for tasks like NER, classification, and similarity, but is not intended for text generation (NLG).

Citation

If you use this model in your research, please cite it as follows:

@misc{kokborokbert2026,
  author       = {MWire Labs},
  title        = {KokborokBERT: Domain-Adaptive Fine-Tuning of XLM-RoBERTa for Kokborok},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MWirelabs/kokborokbert}}
}

Downloads last month: 43

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

Perplexity
self-reported

5.900