---
language:
- da  # Danish
- sv  # Swedish  
- no  # Norwegian
- en  # English
license: mit
base_model: jhu-clsp/mmBERT-base
tags:
- token-classification
- named-entity-recognition
- ner
- nordic-languages
- multilingual
- danish
- swedish
- norwegian
- english
- german
metrics:
- f1
- precision
- recall
widget:
  - text: "Barack Obama visited Stockholm and met Stefan Löfven."
    example_title: "English Example"
  - text: "Angela Merkel var Tysklands förbundskansler."
    example_title: "Swedish Example"  
  - text: "Kristian Thulesen Dahl er dansk politiker."
    example_title: "Danish Example"
  - text: "Erna Solberg var statsminister i Norge."
    example_title: "Norwegian Example"
---

# Scandi NER Model 🏔️

A multilingual Named Entity Recognition model trained on multiple Scandi language datasets plus English and German. The model identifies **Person (PER)**, **Organization (ORG)**, and **Location (LOC)** entities.

## Model Description

This model is based on `jhu-clsp/mmBERT-base` and has been fine-tuned for token classification on a combined dataset of Scandi NER corpora. It supports:

- 🇩🇰 **Danish** - Multiple high-quality datasets including DaNE
- 🇸🇪 **Swedish** - SUC 3.0, Swedish NER corpus, and more  
- 🇳🇴 **Norwegian** - NorNE (Bokmål and Nynorsk)
- 🇬🇧 **English** - CoNLL-2003 and additional datasets

## Performance

The model achieves the following performance on the held-out test set:

| Metric | Score |
|--------|-------|
| **F1 Score** | 0.8149 |
| **Precision** | 0.8092 |
| **Recall** | 0.8207 |

## Quick Start
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/nordic-ner-model")
model = AutoModelForTokenClassification.from_pretrained("your-username/nordic-ner-model")

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example usage
text = "Barack Obama besökte Stockholm och träffade Stefan Löfven."
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
```

## Supported Entity Types
The model predicts the following entity types using BIO tagging:

PER (Person): Names of people
ORG (Organization): Companies, institutions, organizations
LOC (Location): Geographic locations, places

## Training Data
The model was trained on a combination of the following datasets:
- **eriktks/conll2003**: 20,682 examples
- **NbAiLab/norne_bokmaal-7**: 20,044 examples
- **NbAiLab/norne_nynorsk-7**: 17,575 examples
- **KBLab/sucx3_ner_original_lower**: 71,915 examples
- **alexandrainst/dane**: 5,508 examples
- **klintan/swedish_ner_corpus**: 9,330 examples
- **ljos/norwegian_ner_nynorsk**: 17,575 examples
- **ljos/norwegian_ner_bokmaal**: 20,044 examples
- **chcaa/dansk-ner**: 14,651 examples
- **unimelb-nlp/wikiann_da**: 40,000 examples
- **unimelb-nlp/wikiann_sv**: 40,000 examples
- **unimelb-nlp/wikiann_no**: 40,000 examples
- **unimelb-nlp/wikiann_en**: 40,000 examples
- **unimelb-nlp/wikiann_de**: 40,000 examples
- **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
- **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
- **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples

## Dataset Statistics

Total examples: 943,804
Average sequence length: 11.8 tokens
Languages: en, no, sv, da, unknown, de
Label distribution:
  - B-ORG: 241,558 (2.2%)
  - O: 9,270,690 (82.9%)
  - B-PER: 427,033 (3.8%)
  - I-PER: 475,887 (4.3%)
  - B-LOC: 341,771 (3.1%)
  - I-ORG: 276,449 (2.5%)
  - I-LOC: 144,536 (1.3%)

## Training Details
### Training Hyperparameters

Base model: jhu-clsp/mmBERT-base
Training epochs: 30
Batch size: 16
Learning rate: 2e-05
Warmup steps: 5000
Weight decay: 0.01

### Training Infrastructure

Mixed precision: False
Gradient accumulation: 1
Early stopping: Enabled with patience=3

## Usage Examples
### Basic NER Tagging

```python
text = "Olof Palme var Sveriges statsminister."
entities = ner_pipeline(text)
# Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
```

### Batch Processing

```python
texts = [
    "Microsoft fue fundada por Bill Gates.",
    "Angela Merkel var förbundskansler i Tyskland.",
    "Universitetet i Oslo ligger i Norge."
]

for text in texts:
    entities = ner_pipeline(text)
    print(f"Text: {text}")
    for entity in entities:
        print(f"  {entity['word']} -> {entity['entity_group']}")
```

## Limitations and Considerations

Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
Subword handling: The model uses subword tokenization; ensure proper aggregation
Language mixing: While multilingual, performance is best when languages don't mix within sentences
Entity coverage: Limited to PER, ORG, LOC; doesn't detect MISC, DATE, or other entity types