--- license: apache-2.0 datasets: - eriktks/conll2003 language: - en base_model: - stefan-it/ettin-encoder-400m-tokenizer-fix tags: - ner --- # ✨ Ettin 400M for NER This repository hosts an Ettin 400M model that was fine-tuned on the CoNLL-2003 NER dataset with the awesome Flair libary. Please notice the following caveats: * ⚠️ To workaround a tokenizer problem in ModernBERT/Ettin, this model was fine-tuned on a [forked and modified](https://huggingface.co/stefan-it/ettin-encoder-400m-tokenizer-fix) Ettin 400M model. * ⚠️ At the moment, don't expect "uber" BERT-like performance, more experiments are needed. I am pretty sure that RoPE is causing this. ## 📝 Implementation The model was trained using my [ModernBERT experiments](https://github.com/stefan-it/modern-bert-ner) repo. ## 📊 Performance A very basic hyper-parameter search is performanced for five different seeds, with reported averaged micro F1-Score on the development set of CoNLL-2003: | Configuration | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Avg. | |------------------------|---------|---------|-----------|---------|---------|--------------| | `bs16-e10-cs0-lr4e-05` | 96 | 96.17 | **96.31** | 96.19 | 96.2 | 96.17 ± 0.1 | | `bs16-e10-cs0-lr3e-05` | 96.25 | 96.23 | 96.12 | 96.3 | 95.81 | 96.14 ± 0.18 | | `bs16-e10-cs0-lr2e-05` | 96.09 | 96.24 | 95.88 | 96.1 | 96.12 | 96.09 ± 0.12 | | `bs16-e10-cs0-lr5e-05` | 95.98 | 95.93 | 96.11 | 96.1 | 96 | 96.02 ± 0.07 | | `bs16-e10-cs0-lr1e-05` | 95.77 | 95.8 | 96.14 | 96.01 | 95.84 | 95.91 ± 0.14 | The performance of the current uploaded model is marked in bold. ## 📣 Usage The following code can be used to test the model and recognize named entities for a given sentence: ```python from flair.data import Sentence from flair.models import SequenceTagger # Load the model tagger = SequenceTagger.load("stefan-it/flair-ettin-400m-ner-conll03") # Define an example sentence sentence = Sentence("George Washington went to Washington very fast.") # Now let's predict named entities... tagger.predict(sentence) # Print-out the recognized named entities print("The following named entities are found:") for entity in sentence.get_spans('ner'): print(entity) ``` This outputs: ```text Span[0:2]: "George Washington" → PER (1.0000) Span[4:5]: "Washington" → LOC (1.0000) ```