File size: 2,422 Bytes

---
library_name: transformers
license: mit
datasets:
- data-is-better-together/fineweb-c
- HuggingFaceFW/fineweb-edu-llama3-annotations
- tartuNLP/fineweb-c-combined-resample
base_model:
- jhu-clsp/mmBERT-small
language:
- ekk
- eng
- lvs
- kor
- kin
- ita
- hin
- gom
- glg
- fra
- fas
- eus
- deu
- dan
- cat
- bho
- bak
- arz
- ary
- arb
- yor
- vie
- ukr
- tur
- tir
- tel
- tam
- swe
- spa
- slk
- rus
- por
- pbt
- nld
- nds
- vls
- srp
- sco
- lat
- ind
- gmh
- bul
- bre
- bar
- apc
- aeb
- ron
- mar
- kas
- hin
- crh
- arb
- zsm
- yue
- uzn
- uzn
- udm
- tha
- tat
- som
- sin
- pol
- pcm
- npi
- npi
- nan
- mal
- lug
- lit
- lez
- asm
- asm
- fil
- cmn
- ces
- ben
- ast
- ars
- jpn
- guj
- gsw
- fin
- tok
- srp
- quz
- pfl
- pdc
- nob
- lij
- hun
- hsb
- goh
- fao
pipeline_tag: text-classification
---

# Multilingual Educational Content Classifier

Trained on full documents of up to 8192 tokens in total. The train set of [tartuNLP/fineweb-c-combined-resample](https://huggingface.co/datasets/tartuNLP/fineweb-c-combined-resample)
was used, which itself is a mix and a resample of [HuggingFaceFW/fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) and
[data-is-better-together/fineweb-c](https://huggingface.co/datasets/data-is-better-together/fineweb-c).

## Labels

```
{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}
```

## Classification Report

Evaluated on the development set of [tartuNLP/fineweb-c-combined-resample](https://huggingface.co/datasets/tartuNLP/fineweb-c-combined-resample)
organized so that each language appears at least once.

```
              precision    recall  f1-score   support

           0       0.89      0.78      0.83       602
           1       0.65      0.88      0.75       916
           2       0.41      0.29      0.34       345
           3       0.40      0.30      0.34       179
           4       0.53      0.15      0.23       127
           5       0.55      0.39      0.45        44

    accuracy                           0.66      2213
   macro avg       0.57      0.46      0.49      2213
weighted avg       0.65      0.66      0.64      2213
```

## Confusion Matrix
```
[[471 114  10   6   0   1]
 [ 33 806  59  13   5   0]
 [ 10 204 101  28   2   0]
 [  7  72  37  53   8   2]
 [  7  35  27  28  19  11]
 [  2   7  10   6   2  17]]
```