File size: 5,942 Bytes
0839422
 
6d45003
 
01ae0fe
6d45003
 
 
 
 
 
 
 
0839422
 
 
6d45003
 
 
 
 
 
 
 
e356ace
6d45003
 
 
0839422
2e7d44f
0839422
1c8fcb8
0839422
6d45003
 
 
 
 
0839422
 
5cc54de
6972244
f7e4bf3
10e8541
 
 
 
f7e4bf3
 
5cc54de
f7e4bf3
5cc54de
0839422
6d45003
 
 
 
0839422
6d45003
 
 
0839422
6d45003
 
0839422
6d45003
 
0839422
6d45003
 
 
0839422
6d45003
 
 
 
 
0839422
9f094b9
6d45003
0839422
6d45003
 
 
 
 
0839422
 
 
6d45003
 
 
0839422
 
 
6d45003
 
 
 
0839422
 
 
 
24f0565
6d45003
 
 
0839422
 
 
 
6d45003
0839422
6d45003
 
 
 
0839422
f362418
6abb26c
0839422
 
6d45003
0839422
6d45003
 
 
 
 
 
 
 
0839422
6d45003
0839422
6d45003
0839422
6d45003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
library_name: transformers
license: cc-by-4.0
datasets:
- badrex/swahili-speech-400hr
language:
- sw
metrics:
- wer
- cer
base_model:
- facebook/w2v-bert-2.0
pipeline_tag: automatic-speech-recognition
---


<div align="center" style="line-height: 1;">
  <h1>Automatic Speech Recognition for Swahili</h1>
  <a href="https://huggingface.co/datasets/badrex/swahili-speech-400hr" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-ffc107?color=ffca28&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://huggingface.co/spaces/badrex/Swahili-ASR" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Space-ffc107?color=c62828&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://creativecommons.org/licenses/by/4.0/deed.en" style="margin: 2px;">
    <img alt="License" src="https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

### Model Description 🥥

This model is a fine-tuned version of Wav2Vec2-BERT 2.0 for Swahili automatic speech recognition (ASR). It was trained on 400+ hours of high-quality of human-transcribed speech, covering Health, Government, Finance, Education, and Agriculture domains. The model is robust and the in-domain WER is below 8.8%. 

- **Developed by:** Badr al-Absi
- **Model type:** Speech Recognition (ASR)
- **Language:** Swahili (sw)
- **License:** CC-BY-4.0
- **Finetuned from:** facebook/w2v-bert-2.0


### Examples 🚀
|  | Audio | Human Transcription |  ASR Transcription | 
|----------|--------|----------------|----------------|
| 1 | <audio controls src="https://huggingface.co/badrex/w2v-bert-2.0-swahili-asr/resolve/main/examples/example_1.wav"></audio> | Katika soko kuna duka la matunda na mboga likiwa na bidhaa zilizopangwa vizuri. Vitu mbalimbali kama karoti, mahindi, nyanya na viazi vimepangwa kwenye meza. | katika soko kuna duka la matunda na mboga likiwa na bidhaa zilizopangwa vizuri vitu mbalimbali kama karoti mahindi nyanya na viazi vimepangwa kwenye meza |
| 2 | <audio controls src="https://huggingface.co/badrex/w2v-bert-2.0-swahili-asr/resolve/main/examples/example_2.wav"></audio> | Maji ni uhai.Mifereji ya kuwezesha wananchi kuchota maji ya kunywa na pia ya matumizi nyumbani.Ni vyema kuhifadhi maji kwa njia inayofaa. | maji ni uhai mifereji ya kuwezesha wananchi kuchota maji ya kunywa na pia ya matumizi nyumbani ni vyema kuhifadhi maji kwa njia inayofaa |
| 3 | <audio controls src="https://huggingface.co/badrex/w2v-bert-2.0-swahili-asr/resolve/main/examples/example_3.wav"></audio>  | Barabara yenye shughuli nyingi ambapo watu wengi wanasubiri au wanapanda mabasi mawili ya abiria. Mabasi yote mawili yana muundo wa matatu yakionyesha sanaa na rangi mbali mbali na yametayarishwa kubeba abiria. | barabara yenye shughuli nyingi ambapo watu wengi wanasubiri au wanapanda mabasi mawili ya abiria mabasi yote mawili yana muundo wa matatu yakionyesha sanaa na rangi mbalimbali na yametayarishwa kubeba abiria | 



### Direct Use ℹ️

The model can be used directly for automatic speech recognition of Swahili audio as follows

```python
from transformers import Wav2Vec2BertProcessor, Wav2Vec2BertForCTC
import torch
import torchaudio

# load model and processor
processor = Wav2Vec2BertProcessor.from_pretrained("badrex/w2v-bert-2.0-swahili-asr")
model = Wav2Vec2BertForCTC.from_pretrained("badrex/w2v-bert-2.0-swahili-asr")

# load audio
audio_input, sample_rate = torchaudio.load("path/to/audio.wav")

# preprocess
inputs = processor(audio_input.squeeze(), sampling_rate=sample_rate, return_tensors="pt")

# inference
with torch.no_grad():
    logits = model(**inputs).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```


### Downstream Use

This model can be used as a foundation for:
- building voice assistants for Swahili speakers
- transcription services for Swahili content
- accessibility tools for Swahili-speaking communities
- research in low-resource speech recognition

### Out-of-Scope Use

- transcribing languages other than Swahili
- real-time applications without proper latency testing
- high-stakes applications without domain-specific validation

## Bias, Risks, and Limitations

- **Domain bias:** primarily trained on formal speech from specific domains (Health, Government, Finance, Education, Agriculture)
- **Accent variation:** may not perform well on dialects or accents not represented in training data
- **Audio quality:** performance may degrade on noisy or low-quality audio
- **Technical terms:** may struggle with specialized vocabulary outside training domains


### Training Data

- **Size:** 400+ hours of transcribed Swahili speech
- **Domains:** Health, Government, Finance, Education, Agriculture
- **Source:** Digital Umuganda (Gates Foundation funded)
- **License:** CC-BY-4.0




### Model Architecture

- **Base model:** Wav2Vec2-BERT 2.0
- **Architecture:** transformer-based with convolutional feature extractor
- **Parameters:** ~600M (inherited from base model)
- **Objective:** connectionist temporal classification (CTC)

### Funding
The development of this model was supported by [CLEAR Global](https://clearglobal.org/) and [Gates Foundation](https://www.gatesfoundation.org/). 


### Citation

```bibtex
@misc{w2v_bert_swahili_asr,
  author = {Badr M. Abdullah},
  title = {Adapting Wav2Vec2-BERT 2.0 for Swahili ASR},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/badrex/w2v-bert-2.0-swahili-asr}
}

```

### Model Card Contact

For questions or issues, please contact via the Hugging Face model repository in the community discussion section.