🏆 Catalan TTS Arena: Benchmarking TTS Models
The Catalan TTS Leaderboard ranks and evaluates TTS models in Catalan.
The leaderboard currently focuses on Catalan TTS, and will be expanded to multilingual evaluation in later versions.
Rank | Model | UTMOS ⬆️️ | WER ⬇️ | PESQ | SpeechBERT ⬆️ | Logf0 ⬆️ |
|---|---|---|---|---|---|---|
1 | Matxa-TTS-multiaccent | 3.61 | 0.179 | 3.539 | 0.7837 | 0.3831 |
Metrics
Models in the leaderboard are evaluated using several key metrics:
- UTMOS (UTokyo-SaruLab Mean Opinion Score),
- WER (Word Error Rate),
- PESQ (Perceptual Evaluation of Speech Quality). These metrics help evaluate both the accuracy and quality of the model.
UTMOS (UTokyo-SaruLab Mean Opinion Score)[Paper]
UTMOS is a MOS prediction system. A higher UTMOS indicates better quality of the generated voice.
WER (Word Error Rate)
WER is a common metric for evaluating speech recognition systems. It measures the percentage of words in the generated transcript that differ from the reference (correct) transcript. A lower WER value indicates higher accuracy. Example:
| Reference | the | cat | sat | on | the | mat |
|---|---|---|---|---|---|---|
| Prediction | the | cat | sit | on | the | |
| Label | ✅ | ✅ | S | ✅ | ✅ | D |
The WER calculation is done as follows:
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
Moreover, We calculate the WER using the STT_Ca_Citrinet_512 model. [Link]
PESQ (Perceptual Evaluation of Speech Quality)[Paper]
PESQ is a perceptual metric that evaluates the quality of speech in a similar manner to how a human listener would. A higher PESQ indicates better voice quality.
Benchmark Datasets
Model performance is evaluated using our test datasets. These datasets cover a variety of domains and acoustic conditions, ensuring a robust evaluation.
Last updated on 06/12/2024