File size: 2,422 Bytes
1eec3db b099a3c 1eec3db 45fa765 fac3a35 49f4de2 cbdd329 4e800b7 fac3a35 d9e5f24 d5d9163 d9e5f24 1eec3db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
library_name: transformers
license: mit
datasets:
- data-is-better-together/fineweb-c
- HuggingFaceFW/fineweb-edu-llama3-annotations
- tartuNLP/fineweb-c-combined-resample
base_model:
- jhu-clsp/mmBERT-small
language:
- ekk
- eng
- lvs
- kor
- kin
- ita
- hin
- gom
- glg
- fra
- fas
- eus
- deu
- dan
- cat
- bho
- bak
- arz
- ary
- arb
- yor
- vie
- ukr
- tur
- tir
- tel
- tam
- swe
- spa
- slk
- rus
- por
- pbt
- nld
- nds
- vls
- srp
- sco
- lat
- ind
- gmh
- bul
- bre
- bar
- apc
- aeb
- ron
- mar
- kas
- hin
- crh
- arb
- zsm
- yue
- uzn
- uzn
- udm
- tha
- tat
- som
- sin
- pol
- pcm
- npi
- npi
- nan
- mal
- lug
- lit
- lez
- asm
- asm
- fil
- cmn
- ces
- ben
- ast
- ars
- jpn
- guj
- gsw
- fin
- tok
- srp
- quz
- pfl
- pdc
- nob
- lij
- hun
- hsb
- goh
- fao
pipeline_tag: text-classification
---
# Multilingual Educational Content Classifier
Trained on full documents of up to 8192 tokens in total. The train set of [tartuNLP/fineweb-c-combined-resample](https://huggingface.co/datasets/tartuNLP/fineweb-c-combined-resample)
was used, which itself is a mix and a resample of [HuggingFaceFW/fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) and
[data-is-better-together/fineweb-c](https://huggingface.co/datasets/data-is-better-together/fineweb-c).
## Labels
```
{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}
```
## Classification Report
Evaluated on the development set of [tartuNLP/fineweb-c-combined-resample](https://huggingface.co/datasets/tartuNLP/fineweb-c-combined-resample)
organized so that each language appears at least once.
```
precision recall f1-score support
0 0.89 0.78 0.83 602
1 0.65 0.88 0.75 916
2 0.41 0.29 0.34 345
3 0.40 0.30 0.34 179
4 0.53 0.15 0.23 127
5 0.55 0.39 0.45 44
accuracy 0.66 2213
macro avg 0.57 0.46 0.49 2213
weighted avg 0.65 0.66 0.64 2213
```
## Confusion Matrix
```
[[471 114 10 6 0 1]
[ 33 806 59 13 5 0]
[ 10 204 101 28 2 0]
[ 7 72 37 53 8 2]
[ 7 35 27 28 19 11]
[ 2 7 10 6 2 17]]
```
|