october-finetuning-more-variables-sweep-20251012-211034-t15

Slur reclamation binary classifier
Task: LGBTQ+ reclamation vs non-reclamation use of harmful words on social media text.

Trial timestamp (UTC): 2025-10-12 21:10:34

Data case: en-es-it

Configuration (trial hyperparameters)

Model: Alibaba-NLP/gte-multilingual-base

Hyperparameter Value
LANGUAGES en-es-it
LR 3e-05
EPOCHS 3
MAX_LENGTH 256
USE_BIO False
USE_LANG_TOKEN False
GATED_BIO False
FOCAL_LOSS True
FOCAL_GAMMA 2.5
USE_SAMPLER True
R_DROP True
R_KL_ALPHA 0.5
TEXT_NORMALIZE True

Dev set results (summary)

Metric Value
f1_macro_dev_0.5 0.7215676547190872
f1_weighted_dev_0.5 0.847474799649862
accuracy_dev_0.5 0.8329621380846325
f1_macro_dev_best_global 0.7215676547190872
f1_weighted_dev_best_global 0.847474799649862
accuracy_dev_best_global 0.8329621380846325
f1_macro_dev_best_by_lang 0.7215676547190872
f1_weighted_dev_best_by_lang 0.847474799649862
accuracy_dev_best_by_lang 0.8329621380846325
default_threshold 0.5
best_threshold_global 0.5
thresholds_by_lang {"en": 0.5, "it": 0.5, "es": 0.5}

Thresholds

  • Default: 0.5
  • Best global: 0.5
  • Best by language: { "en": 0.5, "it": 0.5, "es": 0.5 }

Detailed evaluation

Classification report @ 0.5

              precision    recall  f1-score   support

 no-recl (0)     0.9454    0.8545    0.8977       385
    recl (1)     0.4455    0.7031    0.5455        64

    accuracy                         0.8330       449
   macro avg     0.6955    0.7788    0.7216       449
weighted avg     0.8742    0.8330    0.8475       449

Classification report @ best global threshold (t=0.50)

              precision    recall  f1-score   support

 no-recl (0)     0.9454    0.8545    0.8977       385
    recl (1)     0.4455    0.7031    0.5455        64

    accuracy                         0.8330       449
   macro avg     0.6955    0.7788    0.7216       449
weighted avg     0.8742    0.8330    0.8475       449

Classification report @ best per-language thresholds

              precision    recall  f1-score   support

 no-recl (0)     0.9454    0.8545    0.8977       385
    recl (1)     0.4455    0.7031    0.5455        64

    accuracy                         0.8330       449
   macro avg     0.6955    0.7788    0.7216       449
weighted avg     0.8742    0.8330    0.8475       449

Per-language metrics (at best-by-lang)

lang n acc f1_macro f1_weighted prec_macro rec_macro prec_weighted rec_weighted
en 154 0.8182 0.5371 0.8369 0.5338 0.5516 0.8588 0.8182
it 163 0.8712 0.8147 0.8781 0.7889 0.8587 0.8941 0.8712
es 132 0.8030 0.7128 0.8250 0.6892 0.8018 0.8762 0.8030

Data

  • Train/Dev: private multilingual splits with ~15% stratified Dev (by (lang,label)).
  • Source: merged EN/IT/ES data with bios retained (ignored if unused by model).

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch, numpy as np

repo = "SimoneAstarita/october-finetuning-more-variables-sweep-20251012-211034-t15"
tok = AutoTokenizer.from_pretrained(repo)
cfg = AutoConfig.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)

texts = ["example text ..."]
langs = ["en"]

mode = "best_global"  # or "0.5", "by_lang"

enc = tok(texts, truncation=True, padding=True, max_length=256, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits
probs = torch.softmax(logits, dim=-1)[:, 1].cpu().numpy()

if mode == "0.5":
    th = 0.5
    preds = (probs >= th).astype(int)
elif mode == "best_global":
    th = getattr(cfg, "best_threshold_global", 0.5)
    preds = (probs >= th).astype(int)
elif mode == "by_lang":
    th_by_lang = getattr(cfg, "thresholds_by_lang", {})
    preds = np.zeros_like(probs, dtype=int)
    for lg in np.unique(langs):
        t = th_by_lang.get(lg, getattr(cfg, "best_threshold_global", 0.5))
        preds[np.array(langs) == lg] = (probs[np.array(langs) == lg] >= t).astype(int)
print(list(zip(texts, preds, probs)))

Additional files

reports.json: all metrics (macro/weighted/accuracy) for @0.5, @best_global, and @best_by_lang. config.json: stores thresholds: default_threshold, best_threshold_global, thresholds_by_lang. postprocessing.json: duplicate threshold info for external tools.

Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including SimoneAstarita/trilingual-no-bio-20251012-211034-t23