Text-to-Speech
ONNX
Safetensors
aluminumbox commited on
Commit
b657dca
·
verified ·
1 Parent(s): f22b9d4

Upload 2 files

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +28 -16
  3. asset/dingding.png +0 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ asset/dingding.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -10,22 +10,15 @@
10
 
11
  ## Highlight🔥
12
 
13
- **CosyVoice 2.0** has been released! Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.
14
- ### Multilingual
15
- - **Supported Language**: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
16
- - **Crosslingual & Mixlingual**:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
17
- ### Ultra-Low Latency
18
- - **Bidirectional Streaming Support**: CosyVoice 2.0 integrates offline and streaming modeling technologies.
19
- - **Rapid First Packet Synthesis**: Achieves latency as low as 150ms while maintaining high-quality audio output.
20
- ### High Accuracy
21
- - **Improved Pronunciation**: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
22
- - **Benchmark Achievements**: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.
23
- ### Strong Stability
24
- - **Consistency in Timbre**: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
25
- - **Cross-language Synthesis**: Marked improvements compared to version 1.0.
26
- ### Natural Experience
27
- - **Enhanced Prosody and Sound Quality**: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
28
- - **Emotional and Dialectal Flexibility**: Now supports more granular emotional controls and accent adjustments.
29
 
30
  ## Roadmap
31
 
@@ -66,6 +59,25 @@
66
  - [x] WeTextProcessing support when ttsfrd is not available
67
  - [x] Fastapi server and client
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ## Install
71
 
 
10
 
11
  ## Highlight🔥
12
 
13
+ **CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
14
+ ### Key Features
15
+ - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
16
+ - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
17
+ - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
18
+ - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
19
+ - **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
20
+ - **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
21
+
 
 
 
 
 
 
 
22
 
23
  ## Roadmap
24
 
 
59
  - [x] WeTextProcessing support when ttsfrd is not available
60
  - [x] Fastapi server and client
61
 
62
+ ## Evaluation
63
+ | Model | CER (%) ↓ (test-zh) | WER (%) ↓ (test-en) | CER (%) ↓ (test-hard) |
64
+ |-----|------------------|------------------|------------------|
65
+ | Human | 1.26 | 2.14 | - |
66
+ | F5-TTS | 1.53 | 2.00 | 8.67 |
67
+ | SparkTTS | 1.20 | 1.98 | - |
68
+ | Seed-TTS | 1.12 | 2.25 | 7.59 |
69
+ | CosyVoice2 | 1.45 | 2.57 | 6.83 |
70
+ | FireRedTTS-2 | 1.14 | 1.95 | - |
71
+ | IndexTTS2 | 1.01 | 1.52 | 7.12 |
72
+ | VibeVoice | 1.16 | 3.04 | - |
73
+ | HiggsAudio | 1.79 | 2.44 | - |
74
+ | MiniMax-Speech | 0.83 | 1.65 | - |
75
+ | VoxPCM | 0.93 | 1.85 | 8.87 |
76
+ | GLM-TTS | 1.03 | - | - |
77
+ | GLM-TTS_RL | 0.89 | - | - |
78
+ | CosyVoice3 | 1.21 | 2.24 | 6.71 |
79
+ | CosyVoice3_RL | 0.81 | 1.68 | 5.44 |
80
+
81
 
82
  ## Install
83
 
asset/dingding.png CHANGED

Git LFS Details

  • SHA256: 7f04815e2e676d31b089af6fa270135f3214f2193d5e0ad98b491d007d48f1c6
  • Pointer size: 131 Bytes
  • Size of remote file: 123 kB