NagaNLP NER (XLM-RoBERTa)

NagaNLP-NER is a Named Entity Recognition model fine-tuned on the Nagamese (Naga Pidgin) language. It is based on XLM-RoBERTa and trained to identify entities such as Persons, Locations, Organizations, and Miscellaneous entities.

This model is part of the NagaNLP project, aiming to provide foundational NLP resources for the low-resource languages of Nagaland.

Model Details

Developer: Agniva Maiti
Base Architecture: XLM-RoBERTa Base
Task: Token Classification (NER)
Language: Nagamese (nag)
Dataset: agnivamaiti/naganlp-ner-annotated-corpus

Training Data

The model was fine-tuned on a manually annotated corpus containing 214 sentences (approx. 4,800 tokens).

Source: NagaNLP Conversational Corpus subset.
Tags: CoNLL-2003 format (PER, LOC, ORG, MISC).

Intended Use

This model is intended for:

Extracting entities from Nagamese text.
Benchmarking multilingual models (like XLM-R) on extremely low-resource creole languages.

How to Get Started

YouCan use this model with the Hugging Face pipeline:

from transformers import pipeline

# Load the pipeline
ner_pipeline = pipeline("ner", model="agnivamaiti/naganlp-ner", aggregation_strategy="simple")

# Inference
text = "Etu retreating monsoon normally October mahina start hoi."
results = ner_pipeline(text)

# Print results
for entity in results:
    print(entity)
# Expected Output: {'entity_group': 'MISC', 'word': 'monsoon', ...}, {'entity_group': 'MISC', 'word': 'October', ...}

Limitations

Data Scarcity: Trained on a very small dataset (214 sentences). It serves as a baseline proof-of-concept and may struggle with vocabulary not seen during training.
Generalization: May perform poorly on dialects significantly different from the training corpus (Kohima/Dimapur standard).

Citation

If you use this model, please cite the associated NagaNLP research paper: Citation details to be added upon publication.

Downloads last month: 3

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for agnivamaiti/naganlp-ner

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3712)

this model

Dataset used to train agnivamaiti/naganlp-ner

Space using agnivamaiti/naganlp-ner 1

Collection including agnivamaiti/naganlp-ner

NagaNLP Project

Collection

Resources for the NagaNLP project: Low-resource NLP for Nagamese (Naga Pidgin), including conversational corpora, NER, and POS tagging resources. • 10 items • Updated Nov 26, 2025 • 3