Upload 2 files

Browse files

Files changed (3) hide show

.gitattributes +1 -0
README.md +28 -16
asset/dingding.png +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+asset/dingding.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -10,22 +10,15 @@
 ## Highlight🔥
-**CosyVoice 2.0** has been released! Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.
-### Multilingual
-- **Supported Language**: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
-- **Crosslingual & Mixlingual**：Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
-### Ultra-Low Latency
-- **Bidirectional Streaming Support**: CosyVoice 2.0 integrates offline and streaming modeling technologies.
-- **Rapid First Packet Synthesis**: Achieves latency as low as 150ms while maintaining high-quality audio output.
-### High Accuracy
-- **Improved Pronunciation**: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
-- **Benchmark Achievements**: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.
-### Strong Stability
-- **Consistency in Timbre**: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
-- **Cross-language Synthesis**: Marked improvements compared to version 1.0.
-### Natural Experience
-- **Enhanced Prosody and Sound Quality**: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
-- **Emotional and Dialectal Flexibility**: Now supports more granular emotional controls and accent adjustments.
 ## Roadmap
@@ -66,6 +59,25 @@
     - [x] WeTextProcessing support when ttsfrd is not available
     - [x] Fastapi server and client
 ## Install

 ## Highlight🔥
+**CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
+### Key Features
+- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
+- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
+- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
+- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
+- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
+- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
 ## Roadmap
     - [x] WeTextProcessing support when ttsfrd is not available
     - [x] Fastapi server and client
+## Evaluation
+| Model | CER (%) ↓ (test-zh) | WER (%) ↓ (test-en) | CER (%) ↓ (test-hard) |
+|-----|------------------|------------------|------------------|
+| Human | 1.26 | 2.14 | - |
+| F5-TTS | 1.53 | 2.00 | 8.67 |
+| SparkTTS | 1.20 | 1.98 | - |
+| Seed-TTS | 1.12 | 2.25 | 7.59 |
+| CosyVoice2 | 1.45 | 2.57 | 6.83 |
+| FireRedTTS-2 | 1.14 | 1.95 | - |
+| IndexTTS2 | 1.01 | 1.52 | 7.12 |
+| VibeVoice | 1.16 | 3.04 | - |
+| HiggsAudio | 1.79 | 2.44 | - |
+| MiniMax-Speech | 0.83 | 1.65 | - |
+| VoxPCM | 0.93 | 1.85 | 8.87 |
+| GLM-TTS | 1.03 | - | - |
+| GLM-TTS_RL | 0.89 | - | - |
+| CosyVoice3 | 1.21 |  2.24 | 6.71 |
+| CosyVoice3_RL | 0.81 | 1.68 | 5.44 |
 ## Install

asset/dingding.png CHANGED Viewed

Git LFS Details

SHA256: 7f04815e2e676d31b089af6fa270135f3214f2193d5e0ad98b491d007d48f1c6
Pointer size: 131 Bytes
Size of remote file: 123 kB