--- language: - da # Danish - sv # Swedish - no # Norwegian - en # English license: mit base_model: jhu-clsp/mmBERT-base tags: - token-classification - named-entity-recognition - ner - nordic-languages - multilingual - danish - swedish - norwegian - english - german metrics: - f1 - precision - recall widget: - text: "Barack Obama visited Stockholm and met Stefan Löfven." example_title: "English Example" - text: "Angela Merkel var Tysklands förbundskansler." example_title: "Swedish Example" - text: "Kristian Thulesen Dahl er dansk politiker." example_title: "Danish Example" - text: "Erna Solberg var statsminister i Norge." example_title: "Norwegian Example" --- # Scandi NER Model 🏔️ A multilingual Named Entity Recognition model trained on multiple Scandi language datasets plus English and German. The model identifies **Person (PER)**, **Organization (ORG)**, and **Location (LOC)** entities. ## Model Description This model is based on `jhu-clsp/mmBERT-base` and has been fine-tuned for token classification on a combined dataset of Scandi NER corpora. It supports: - 🇩🇰 **Danish** - Multiple high-quality datasets including DaNE - 🇸🇪 **Swedish** - SUC 3.0, Swedish NER corpus, and more - 🇳🇴 **Norwegian** - NorNE (Bokmål and Nynorsk) - 🇬🇧 **English** - CoNLL-2003 and additional datasets ## Performance The model achieves the following performance on the held-out test set: | Metric | Score | |--------|-------| | **F1 Score** | 0.8149 | | **Precision** | 0.8092 | | **Recall** | 0.8207 | ## Quick Start ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("your-username/nordic-ner-model") model = AutoModelForTokenClassification.from_pretrained("your-username/nordic-ner-model") # Create NER pipeline ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # Example usage text = "Barack Obama besökte Stockholm och träffade Stefan Löfven." entities = ner_pipeline(text) for entity in entities: print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})") ``` ## Supported Entity Types The model predicts the following entity types using BIO tagging: PER (Person): Names of people ORG (Organization): Companies, institutions, organizations LOC (Location): Geographic locations, places ## Training Data The model was trained on a combination of the following datasets: - **eriktks/conll2003**: 20,682 examples - **NbAiLab/norne_bokmaal-7**: 20,044 examples - **NbAiLab/norne_nynorsk-7**: 17,575 examples - **KBLab/sucx3_ner_original_lower**: 71,915 examples - **alexandrainst/dane**: 5,508 examples - **klintan/swedish_ner_corpus**: 9,330 examples - **ljos/norwegian_ner_nynorsk**: 17,575 examples - **ljos/norwegian_ner_bokmaal**: 20,044 examples - **chcaa/dansk-ner**: 14,651 examples - **unimelb-nlp/wikiann_da**: 40,000 examples - **unimelb-nlp/wikiann_sv**: 40,000 examples - **unimelb-nlp/wikiann_no**: 40,000 examples - **unimelb-nlp/wikiann_en**: 40,000 examples - **unimelb-nlp/wikiann_de**: 40,000 examples - **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples - **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples - **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples ## Dataset Statistics Total examples: 943,804 Average sequence length: 11.8 tokens Languages: en, no, sv, da, unknown, de Label distribution: - B-ORG: 241,558 (2.2%) - O: 9,270,690 (82.9%) - B-PER: 427,033 (3.8%) - I-PER: 475,887 (4.3%) - B-LOC: 341,771 (3.1%) - I-ORG: 276,449 (2.5%) - I-LOC: 144,536 (1.3%) ## Training Details ### Training Hyperparameters Base model: jhu-clsp/mmBERT-base Training epochs: 30 Batch size: 16 Learning rate: 2e-05 Warmup steps: 5000 Weight decay: 0.01 ### Training Infrastructure Mixed precision: False Gradient accumulation: 1 Early stopping: Enabled with patience=3 ## Usage Examples ### Basic NER Tagging ```python text = "Olof Palme var Sveriges statsminister." entities = ner_pipeline(text) # Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}] ``` ### Batch Processing ```python texts = [ "Microsoft fue fundada por Bill Gates.", "Angela Merkel var förbundskansler i Tyskland.", "Universitetet i Oslo ligger i Norge." ] for text in texts: entities = ner_pipeline(text) print(f"Text: {text}") for entity in entities: print(f" {entity['word']} -> {entity['entity_group']}") ``` ## Limitations and Considerations Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains Subword handling: The model uses subword tokenization; ensure proper aggregation Language mixing: While multilingual, performance is best when languages don't mix within sentences Entity coverage: Limited to PER, ORG, LOC; doesn't detect MISC, DATE, or other entity types