marefa-nlp
/

marefa-ner

@@ -1,17 +1,18 @@
 ---
 language: ar
 datasets:
 - Marefa-NER
 ---
 # Tebyan تبيـان
 ## Marefa Arabic Named Entity Recognition Model
 ## نموذج المعرفة لتصنيف أجزاء النص
 ---------
-**Version**: 1.2
-**Last Update:** 22-05-2021
 ## Model description
@@ -38,152 +39,131 @@ Install the following Python packages
 > If you are using `Google Colab`, please restart your runtime after installing the packages.
-[**OPTIONAL**]
-Using of an Arabic segmentation tool approved better results in many scenarios. If you want to use `FarasaPy` to segment the texts, please ensure that you have `openjdk-11` installed in your machine, then install the package via:
-```bash
-# install openjdk-11-jdk
-$ apt-get install -y build-essential
-$ apt-get install -y openjdk-11-jdk
-# instll FarasaPy
-$ pip3 install farasapy==0.0.13
-```
-*Do not forget to set `USE_FARASAPY`  to  `True` in the following code*
- Also, you can set `USE_SENTENCE_TOKENIZER`  to  `True` for getting better results for  long texts.
 -----------
 ```python
 # ==== Set configurations
-# do you want to use FarasaPy Segmentation tool ?
-USE_FARASAPY = False # set to True to use it
-# do you want to split text into sentences [better for long texts] ?
-USE_SENTENCE_TOKENIZER = False # set to True to use it
-# ==== Import required modules
-import logging
-import re
 import nltk
 nltk.download('punkt')
-from nltk.tokenize import word_tokenize, sent_tokenize
-from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
-# disable INFO Logs
-transformers_logger = logging.getLogger("transformers")
-transformers_logger.setLevel(logging.WARNING)
-def _extract_ner(sent: str, ner: pipeline) -> list:
-    grouped_ents = []
-    current_ent = {}
-    results = ner(sent)
-    for ent in results:
-        if len(current_ent) == 0:
-            current_ent = ent
-            continue
-        if current_ent["end"] == ent["start"] and current_ent["entity_group"] == ent["entity_group"]:
-            current_ent["word"] = current_ent["word"]+ent["word"]
-        else:
-            grouped_ents.append(current_ent)
-            current_ent = ent
-    if len(grouped_ents) > 0 and grouped_ents[-1] != ent:
-        grouped_ents.append(current_ent)
-    elif len(grouped_ents) == 0 and len(current_ent) > 0:
-        grouped_ents.append(current_ent)
-    return [ g for g in grouped_ents if len(g["word"].strip()) ]
-if USE_FARASAPY:
-	from farasa.segmenter import FarasaSegmenter
-	segmenter = FarasaSegmenter()
-	def _segment_text(text: str, segmenter: FarasaSegmenter) -> str:
-	    segmented = segmenter.segment(text)
-	    f_segments = { w.replace("+",""): w.replace("و+","و ").replace("+","") for w in segmented.split(" ") if w.strip() != "" and w.startswith("و+") }
-	    for s,t in f_segments.items():
-	        text = text.replace(s, t)
-	    return text
-	_ = _segment_text("نص تجريبي للتأكد من عمل الأداة", segmenter)
-custom_labels = ["O", "B-job", "I-job", "B-nationality", "B-person", "I-person", "B-location",
-                 "B-time", "I-time", "B-event", "I-event", "B-organization", "I-organization",
-                 "I-location", "I-nationality", "B-product", "I-product", "B-artwork", "I-artwork"]
-# ==== Import/Download the NER Model
-m_name = "marefa-nlp/marefa-ner"
-tokenizer = AutoTokenizer.from_pretrained(m_name)
-model = AutoModelForTokenClassification.from_pretrained(m_name)
-ar_ner = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True, aggregation_strategy="simple")
-# ==== Model Inference
 samples = [
     "تلقى تعليمه في الكتاب ثم انضم الى الأزهر عام 1873م. تعلم على يد السيد جمال الدين الأفغاني والشيخ محمد عبده",
     "بعد عودته إلى القاهرة، التحق نجيب الريحاني فرقة جورج أبيض، الذي كان قد ضمَّ - قُبيل ذلك - فرقته إلى فرقة سلامة حجازي . و منها ذاع صيته",
     "امبارح اتفرجت على مباراة مانشستر يونايتد مع ريال مدريد في غياب الدون كرستيانو رونالدو",
-    "Government extends flight ban from India and Pakistan until June 21"
 ]
 # [optional]
 samples = [ " ".join(word_tokenize(sample.strip())) for sample in samples if sample.strip() != "" ]
 for sample in samples:
-    ents = []
-    if USE_FARASAPY:
-        sample = _segment_text(sample, segmenter)
-    if USE_SENTENCE_TOKENIZER:
-        for sent in sent_tokenize(sample):
-            ents += _extract_ner(sent, ar_ner)
-    else:
-        ents = _extract_ner(sample, ar_ner)
-    # print the results
-    print("(", sample, ")")
     for ent in ents:
-        print("\t", ent["word"], "=>", ent["entity_group"])
-    print("=========\n")
 ```
 Output
 ```
-( تلقى تعليمه في الكتاب ثم انضم الى الأزهر عام 1873م . تعلم على يد السيد جمال الدين الأفغاني والشيخ محمد عبده )
-	 الأزهر => organization
-	 عام 1873م => time
-	 جمال الدين الأفغاني => person
-	 محمد عبده => person
-=========
-( بعد عودته إلى القاهرة، التحق نجيب الريحاني فرقة جورج أبيض، الذي كان قد ضمَّ - قُبيل ذلك - فرقته إلى فرقة سلامة حجازي . و منها ذاع صيته )
-	 القاهرة => location
-	 نجيب الريحاني => person
-	 فرقة جورج أبيض => organization
-	 فرقة سلامة حجازي => organization
-=========
-( امبارح اتفرجت على مباراة مانشستر يونايتد مع ريال مدريد في غياب الدون كرستيانو رونالدو )
-	 مانشستر يونايتد => organization
-	 ريال مدريد => organization
-	 كرستيانو رونالدو => person
-=========
-( Government extends flight ban from India and Pakistan until June 21 )
-	 India => location
-	 Pakistan => location
-	 June 21 => time
-=========
 ```
 ## Fine-Tuning

 ---
 language: ar
 datasets:
 - Marefa-NER
+widget:
+- text: "في استاد القاهرة، بدأ حفل افتتاح بطولة كأس الأمم الأفريقية بحضور رئيس الجمهورية و رئيس الاتحاد الدولي لكرة القدم"
 ---
 # Tebyan تبيـان
 ## Marefa Arabic Named Entity Recognition Model
 ## نموذج المعرفة لتصنيف أجزاء النص
 ---------
+**Version**: 1.3
+**Last Update:** 3-12-2021
 ## Model description
 > If you are using `Google Colab`, please restart your runtime after installing the packages.
 -----------
 ```python
 # ==== Set configurations
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+import numpy as np
 import nltk
 nltk.download('punkt')
+from nltk.tokenize import word_tokenize
+custom_labels = ["O", "B-job", "I-job", "B-nationality", "B-person", "I-person", "B-location","B-time", "I-time", "B-event", "I-event", "B-organization", "I-organization", "I-location", "I-nationality", "B-product", "I-product", "B-artwork", "I-artwork"]
+def _extract_ner(text: str, model: AutoModelForTokenClassification,
+                 tokenizer: AutoTokenizer, start_token: str="▁"):
+    tokenized_sentence = tokenizer([text], padding=True, truncation=True, return_tensors="pt")
+    tokenized_sentences = tokenized_sentence['input_ids'].numpy()
+    with torch.no_grad():
+        output = model(**tokenized_sentence)
+    last_hidden_states = output[0].numpy()
+    label_indices = np.argmax(last_hidden_states[0], axis=1)
+    tokens = tokenizer.convert_ids_to_tokens(tokenized_sentences[0])
+    special_tags = set(tokenizer.special_tokens_map.values())
+    grouped_tokens = []
+    for token, label_idx in zip(tokens, label_indices):
+        if token not in special_tags:
+            if not token.startswith(start_token) and len(token.replace(start_token,"").strip()) > 0:
+                grouped_tokens[-1]["token"] += token
+            else:
+                grouped_tokens.append({"token": token, "label": custom_labels[label_idx]})
+    # extract entities
+    ents = []
+    prev_label = "O"
+    for token in grouped_tokens:
+        label = token["label"].replace("I-","").replace("B-","")
+        if token["label"] != "O":
+            if label != prev_label:
+                ents.append({"token": [token["token"]], "label": label})
+            else:
+                ents[-1]["token"].append(token["token"])
+        prev_label = label
+    # group tokens
+    ents = [{"token": "".join(rec["token"]).replace(start_token," ").strip(), "label": rec["label"]}  for rec in ents ]
+    return ents
+model_cp = "marefa-nlp/marefa-ner"
+tokenizer = AutoTokenizer.from_pretrained(model_cp)
+model = AutoModelForTokenClassification.from_pretrained(model_cp, num_labels=len(custom_labels))
 samples = [
     "تلقى تعليمه في الكتاب ثم انضم الى الأزهر عام 1873م. تعلم على يد السيد جمال الدين الأفغاني والشيخ محمد عبده",
     "بعد عودته إلى القاهرة، التحق نجيب الريحاني فرقة جورج أبيض، الذي كان قد ضمَّ - قُبيل ذلك - فرقته إلى فرقة سلامة حجازي . و منها ذاع صيته",
+    "في استاد القاهرة، قام حفل افتتاح بطولة كأس الأمم الأفريقية بحضور رئيس الجمهورية و رئيس الاتحاد الدولي لكرة القدم",
+    "من فضلك أرسل هذا البريد الى صديقي جلال الدين في تمام الساعة الخامسة صباحا في يوم الثلاثاء القادم",
     "امبارح اتفرجت على مباراة مانشستر يونايتد مع ريال مدريد في غياب الدون كرستيانو رونالدو",
+    "لا تنسى تصحيني الساعة سبعة, و ضيف في الجدول اني احضر مباراة نادي النصر غدا",
 ]
 # [optional]
 samples = [ " ".join(word_tokenize(sample.strip())) for sample in samples if sample.strip() != "" ]
 for sample in samples:
+    ents = _extract_ner(text=sample, model=model, tokenizer=tokenizer, start_token="▁")
+    print(sample)
     for ent in ents:
+        print("\t",ent["token"],"==>",ent["label"])
+    print("========\n")
 ```
 Output
 ```
+تلقى تعليمه في الكتاب ثم انضم الى الأزهر عام 1873م . تعلم على يد السيد جمال الدين الأفغاني والشيخ محمد عبده
+	 الأزهر ==> organization
+	 عام 1873م ==> time
+	 السيد جمال الدين الأفغاني ==> person
+	 محمد عبده ==> person
+========
+بعد عودته إلى القاهرة، التحق نجيب الريحاني فرقة جورج أبيض، الذي كان قد ضمَّ - قُبيل ذلك - فرقته إلى فرقة سلامة حجازي . و منها ذاع صيته
+	 القاهر��، ==> location
+	 نجيب الريحاني ==> person
+	 فرقة جورج أبيض، ==> organization
+	 فرقة سلامة حجازي ==> organization
+========
+في استاد القاهرة، قام حفل افتتاح بطولة كأس الأمم الأفريقية بحضور رئيس الجمهورية و رئيس الاتحاد الدولي لكرة القدم
+	 استاد القاهرة، ==> location
+	 بطولة كأس الأمم الأفريقية ==> event
+	 رئيس الجمهورية ==> job
+	 رئيس ==> job
+	 الاتحاد الدولي لكرة القدم ==> organization
+========
+من فضلك أرسل هذا البريد الى صديقي جلال الدين في تمام الساعة الخامسة صباحا في يوم الثلاثاء القادم
+	 جلال الدين ==> person
+	 الساعة الخامسة صباحا ==> time
+	 يوم الثلاثاء القادم ==> time
+========
+امبارح اتفرجت على مباراة مانشستر يونايتد مع ريال مدريد في غياب الدون كرستيانو رونالدو
+	 مانشستر يونايتد ==> organization
+	 ريال مدريد ==> organization
+	 كرستيانو رونالدو ==> person
+========
+لا تنسى تصحيني الساعة سبعة , و ضيف في الجدول اني احضر مباراة نادي النصر غدا
+	 الساعة سبعة ==> time
+	 نادي النصر ==> organization
+	 غدا ==> time
+========
 ```
 ## Fine-Tuning