NagaNLP NER (XLM-RoBERTa)
NagaNLP-NER is a Named Entity Recognition model fine-tuned on the Nagamese (Naga Pidgin) language. It is based on XLM-RoBERTa and trained to identify entities such as Persons, Locations, Organizations, and Miscellaneous entities.
This model is part of the NagaNLP project, aiming to provide foundational NLP resources for the low-resource languages of Nagaland.
Model Details
Training Data
The model was fine-tuned on a manually annotated corpus containing 214 sentences (approx. 4,800 tokens).
- Source: NagaNLP Conversational Corpus subset.
- Tags: CoNLL-2003 format (PER, LOC, ORG, MISC).
Intended Use
This model is intended for:
- Extracting entities from Nagamese text.
- Benchmarking multilingual models (like XLM-R) on extremely low-resource creole languages.
How to Get Started
YouCan use this model with the Hugging Face pipeline:
from transformers import pipeline
ner_pipeline = pipeline("ner", model="agnivamaiti/naganlp-ner", aggregation_strategy="simple")
text = "Etu retreating monsoon normally October mahina start hoi."
results = ner_pipeline(text)
for entity in results:
print(entity)
Limitations
- Data Scarcity: Trained on a very small dataset (214 sentences). It serves as a baseline proof-of-concept and may struggle with vocabulary not seen during training.
- Generalization: May perform poorly on dialects significantly different from the training corpus (Kohima/Dimapur standard).
Citation
If you use this model, please cite the associated NagaNLP research paper:
Citation details to be added upon publication.