Automatic Speech Recognition for Swahili

Model Description 🥥

This model is a fine-tuned version of Wav2Vec2-BERT 2.0 for Swahili automatic speech recognition (ASR). It was trained on 400+ hours of high-quality of human-transcribed speech, covering Health, Government, Finance, Education, and Agriculture domains. The model is robust and the in-domain WER is below 8.8%.

Developed by: Badr al-Absi
Model type: Speech Recognition (ASR)
Language: Swahili (sw)
License: CC-BY-4.0
Finetuned from: facebook/w2v-bert-2.0

Examples 🚀

	Human Transcription	ASR Transcription
1	Katika soko kuna duka la matunda na mboga likiwa na bidhaa zilizopangwa vizuri. Vitu mbalimbali kama karoti, mahindi, nyanya na viazi vimepangwa kwenye meza.	katika soko kuna duka la matunda na mboga likiwa na bidhaa zilizopangwa vizuri vitu mbalimbali kama karoti mahindi nyanya na viazi vimepangwa kwenye meza
2	Maji ni uhai.Mifereji ya kuwezesha wananchi kuchota maji ya kunywa na pia ya matumizi nyumbani.Ni vyema kuhifadhi maji kwa njia inayofaa.	maji ni uhai mifereji ya kuwezesha wananchi kuchota maji ya kunywa na pia ya matumizi nyumbani ni vyema kuhifadhi maji kwa njia inayofaa
3	Barabara yenye shughuli nyingi ambapo watu wengi wanasubiri au wanapanda mabasi mawili ya abiria. Mabasi yote mawili yana muundo wa matatu yakionyesha sanaa na rangi mbali mbali na yametayarishwa kubeba abiria.	barabara yenye shughuli nyingi ambapo watu wengi wanasubiri au wanapanda mabasi mawili ya abiria mabasi yote mawili yana muundo wa matatu yakionyesha sanaa na rangi mbalimbali na yametayarishwa kubeba abiria

Direct Use ℹ️

The model can be used directly for automatic speech recognition of Swahili audio as follows

from transformers import Wav2Vec2BertProcessor, Wav2Vec2BertForCTC
import torch
import torchaudio

# load model and processor
processor = Wav2Vec2BertProcessor.from_pretrained("badrex/w2v-bert-2.0-swahili-asr")
model = Wav2Vec2BertForCTC.from_pretrained("badrex/w2v-bert-2.0-swahili-asr")

# load audio
audio_input, sample_rate = torchaudio.load("path/to/audio.wav")

# preprocess
inputs = processor(audio_input.squeeze(), sampling_rate=sample_rate, return_tensors="pt")

# inference
with torch.no_grad():
    logits = model(**inputs).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Downstream Use

This model can be used as a foundation for:

building voice assistants for Swahili speakers
transcription services for Swahili content
accessibility tools for Swahili-speaking communities
research in low-resource speech recognition

Out-of-Scope Use

transcribing languages other than Swahili
real-time applications without proper latency testing
high-stakes applications without domain-specific validation

Bias, Risks, and Limitations

Domain bias: primarily trained on formal speech from specific domains (Health, Government, Finance, Education, Agriculture)
Accent variation: may not perform well on dialects or accents not represented in training data
Audio quality: performance may degrade on noisy or low-quality audio
Technical terms: may struggle with specialized vocabulary outside training domains

Training Data

Size: 400+ hours of transcribed Swahili speech
Domains: Health, Government, Finance, Education, Agriculture
Source: Digital Umuganda (Gates Foundation funded)
License: CC-BY-4.0

Model Architecture

Base model: Wav2Vec2-BERT 2.0
Architecture: transformer-based with convolutional feature extractor
Parameters: ~600M (inherited from base model)
Objective: connectionist temporal classification (CTC)

Funding

The development of this model was supported by CLEAR Global and Gates Foundation.

Citation

@misc{w2v_bert_swahili_asr,
  author = {Badr M. Abdullah},
  title = {Adapting Wav2Vec2-BERT 2.0 for Swahili ASR},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/badrex/w2v-bert-2.0-swahili-asr}
}

Model Card Contact

For questions or issues, please contact via the Hugging Face model repository in the community discussion section.

Downloads last month: 96

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for badrex/w2v-bert-2.0-swahili-asr

Base model

facebook/w2v-bert-2.0

Finetuned

(388)

this model

Dataset used to train badrex/w2v-bert-2.0-swahili-asr

Space using badrex/w2v-bert-2.0-swahili-asr 1

Collection including badrex/w2v-bert-2.0-swahili-asr

ASR for African Voices 🌍

Collection

Robust speech-to-text models for languages of Africa • 14 items • Updated 13 days ago • 2