Text-to-Speech
ONNX
Safetensors
aluminumbox commited on
Commit
7067950
·
verified ·
1 Parent(s): 653e0bd

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +237 -3
  2. cosyvoice3.yaml +223 -0
README.md CHANGED
@@ -1,3 +1,237 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)](https://github.com/Akshay090/svg-banners)
2
+
3
+ ## 👉🏻 CosyVoice 👈🏻
4
+
5
+ **CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
6
+
7
+ **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B)
8
+
9
+ **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice-300M)
10
+
11
+ ## Highlight🔥
12
+
13
+ **CosyVoice 2.0** has been released! Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.
14
+ ### Multilingual
15
+ - **Supported Language**: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
16
+ - **Crosslingual & Mixlingual**:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
17
+ ### Ultra-Low Latency
18
+ - **Bidirectional Streaming Support**: CosyVoice 2.0 integrates offline and streaming modeling technologies.
19
+ - **Rapid First Packet Synthesis**: Achieves latency as low as 150ms while maintaining high-quality audio output.
20
+ ### High Accuracy
21
+ - **Improved Pronunciation**: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
22
+ - **Benchmark Achievements**: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.
23
+ ### Strong Stability
24
+ - **Consistency in Timbre**: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
25
+ - **Cross-language Synthesis**: Marked improvements compared to version 1.0.
26
+ ### Natural Experience
27
+ - **Enhanced Prosody and Sound Quality**: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
28
+ - **Emotional and Dialectal Flexibility**: Now supports more granular emotional controls and accent adjustments.
29
+
30
+ ## Roadmap
31
+
32
+ - [x] 2025/12
33
+
34
+ - [x] release CosyVoice3-0.5B base model and its training/inference script
35
+ - [x] release CosyVoice3-0.5B modelscope gradio space
36
+
37
+ - [x] 2025/08
38
+
39
+ - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
40
+
41
+ - [x] 2025/07
42
+
43
+ - [x] release CosyVoice 3.0 eval set
44
+
45
+ - [x] 2025/05
46
+
47
+ - [x] add CosyVoice2-0.5B vllm support
48
+
49
+ - [x] 2024/12
50
+
51
+ - [x] 25hz CosyVoice2-0.5B released
52
+
53
+ - [x] 2024/09
54
+
55
+ - [x] 25hz CosyVoice-300M base model
56
+ - [x] 25hz CosyVoice-300M voice conversion function
57
+
58
+ - [x] 2024/08
59
+
60
+ - [x] Repetition Aware Sampling(RAS) inference for llm stability
61
+ - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
62
+
63
+ - [x] 2024/07
64
+
65
+ - [x] Flow matching training support
66
+ - [x] WeTextProcessing support when ttsfrd is not available
67
+ - [x] Fastapi server and client
68
+
69
+
70
+ ## Install
71
+
72
+ ### Clone and install
73
+
74
+ - Clone the repo
75
+ ``` sh
76
+ git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
77
+ # If you failed to clone the submodule due to network failures, please run the following command until success
78
+ cd CosyVoice
79
+ git submodule update --init --recursive
80
+ ```
81
+
82
+ - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
83
+ - Create Conda env:
84
+
85
+ ``` sh
86
+ conda create -n cosyvoice -y python=3.10
87
+ conda activate cosyvoice
88
+ pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
89
+
90
+ # If you encounter sox compatibility issues
91
+ # ubuntu
92
+ sudo apt-get install sox libsox-dev
93
+ # centos
94
+ sudo yum install sox sox-devel
95
+ ```
96
+
97
+ ### Model download
98
+
99
+ We strongly recommend that you download our pretrained `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.
100
+
101
+ ``` python
102
+ # SDK模型下载
103
+ from modelscope import snapshot_download
104
+ snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
105
+ snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
106
+ snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
107
+ snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
108
+ snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
109
+ snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
110
+ ```
111
+
112
+ Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
113
+
114
+ Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.
115
+
116
+ ``` sh
117
+ cd pretrained_models/CosyVoice-ttsfrd/
118
+ unzip resource.zip -d .
119
+ pip install ttsfrd_dependency-0.1-py3-none-any.whl
120
+ pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
121
+ ```
122
+
123
+ ### Basic Usage
124
+
125
+ We strongly recommend using `CosyVoice3-0.5B` for better performance.
126
+ Follow the code in `example.py` for detailed usage of each model.
127
+ ```sh
128
+ python example.py
129
+ ```
130
+
131
+ #### CosyVoice2 vllm Usage
132
+ If you want to use vllm for inference, please install `vllm==v0.9.0`. Older vllm version do not support CosyVoice2 inference.
133
+
134
+ Notice that `vllm==v0.9.0` has a lot of specific requirements, for example `torch==2.7.0`. You can create a new env to in case your hardward do not support vllm and old env is corrupted.
135
+
136
+ ``` sh
137
+ conda create -n cosyvoice_vllm --clone cosyvoice
138
+ conda activate cosyvoice_vllm
139
+ pip install vllm==v0.9.0 transformers==4.51.3 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
140
+ python vllm_example.py
141
+ ```
142
+
143
+ #### Start web demo
144
+
145
+ You can use our web demo page to get familiar with CosyVoice quickly.
146
+
147
+ Please see the demo website for details.
148
+
149
+ ``` python
150
+ # change iic/CosyVoice-300M-SFT for sft inference, or iic/CosyVoice-300M-Instruct for instruct inference
151
+ python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
152
+ ```
153
+
154
+ #### Advanced Usage
155
+
156
+ For advanced users, we have provided training and inference scripts in `examples/libritts/cosyvoice/run.sh`.
157
+
158
+ #### Build for deployment
159
+
160
+ Optionally, if you want service deployment,
161
+ You can run the following steps.
162
+
163
+ ``` sh
164
+ cd runtime/python
165
+ docker build -t cosyvoice:v1.0 .
166
+ # change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference
167
+ # for grpc usage
168
+ docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity"
169
+ cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
170
+ # for fastapi usage
171
+ docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity"
172
+ cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
173
+ ```
174
+
175
+ #### Using Nvidia TensorRT-LLM for deployment
176
+
177
+ Using TensorRT-LLM to accelerate cosyvoice2 llm could give 4x acceleration comparing with huggingface transformers implementation.
178
+ To quick start:
179
+
180
+ ``` sh
181
+ cd runtime/triton_trtllm
182
+ docker compose up -d
183
+ ```
184
+ For more details, you could check [here](https://github.com/FunAudioLLM/CosyVoice/tree/main/runtime/triton_trtllm)
185
+
186
+ ## Discussion & Communication
187
+
188
+ You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
189
+
190
+ You can also scan the QR code to join our official Dingding chat group.
191
+
192
+ <img src="./asset/dingding.png" width="250px">
193
+
194
+ ## Acknowledge
195
+
196
+ 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
197
+ 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
198
+ 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
199
+ 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
200
+ 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
201
+
202
+ ## Citations
203
+
204
+ ``` bibtex
205
+ @article{du2024cosyvoice,
206
+ title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
207
+ author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
208
+ journal={arXiv preprint arXiv:2407.05407},
209
+ year={2024}
210
+ }
211
+
212
+ @article{du2024cosyvoice,
213
+ title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
214
+ author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
215
+ journal={arXiv preprint arXiv:2412.10117},
216
+ year={2024}
217
+ }
218
+
219
+ @article{du2025cosyvoice,
220
+ title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
221
+ author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
222
+ journal={arXiv preprint arXiv:2505.17589},
223
+ year={2025}
224
+ }
225
+
226
+ @inproceedings{lyu2025build,
227
+ title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
228
+ author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
229
+ booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
230
+ pages={1--2},
231
+ year={2025},
232
+ organization={IEEE}
233
+ }
234
+ ```
235
+
236
+ ## Disclaimer
237
+ The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
cosyvoice3.yaml ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # set random seed, so that you may reproduce your result.
2
+ __set_seed1: !apply:random.seed [1986]
3
+ __set_seed2: !apply:numpy.random.seed [1986]
4
+ __set_seed3: !apply:torch.manual_seed [1986]
5
+ __set_seed4: !apply:torch.cuda.manual_seed_all [1986]
6
+
7
+ # fixed params
8
+ sample_rate: 24000
9
+ llm_input_size: 896
10
+ llm_output_size: 896
11
+ spk_embed_dim: 192
12
+ qwen_pretrain_path: ''
13
+ token_frame_rate: 25
14
+ token_mel_ratio: 2
15
+
16
+ # stream related params
17
+ chunk_size: 25 # streaming inference chunk size, in token
18
+ num_decoding_left_chunks: -1 # streaming inference flow decoder left chunk size, <0 means use all left chunks
19
+
20
+ # model params
21
+ # for all class/function included in this repo, we use !<name> or !<new> for intialization, so that user may find all corresponding class/function according to one single yaml.
22
+ # for system/third_party class/function, we do not require this.
23
+ llm: !new:cosyvoice.llm.llm.CosyVoice3LM
24
+ llm_input_size: !ref <llm_input_size>
25
+ llm_output_size: !ref <llm_output_size>
26
+ speech_token_size: 6561
27
+ length_normalized_loss: True
28
+ lsm_weight: 0
29
+ mix_ratio: [5, 15]
30
+ llm: !new:cosyvoice.llm.llm.Qwen2Encoder
31
+ pretrain_path: !ref <qwen_pretrain_path>
32
+ sampling: !name:cosyvoice.utils.common.ras_sampling
33
+ top_p: 0.8
34
+ top_k: 25
35
+ win_size: 10
36
+ tau_r: 0.1
37
+
38
+ flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithDiT
39
+ input_size: 80
40
+ output_size: 80
41
+ spk_embed_dim: !ref <spk_embed_dim>
42
+ output_type: 'mel'
43
+ vocab_size: 6561
44
+ input_frame_rate: !ref <token_frame_rate>
45
+ only_mask_loss: True
46
+ token_mel_ratio: !ref <token_mel_ratio>
47
+ pre_lookahead_len: 3
48
+ pre_lookahead_layer: !new:cosyvoice.transformer.upsample_encoder.PreLookaheadLayer
49
+ in_channels: 80
50
+ channels: 1024
51
+ pre_lookahead_len: 3
52
+ decoder: !new:cosyvoice.flow.flow_matching.CausalConditionalCFM
53
+ in_channels: 240
54
+ n_spks: 1
55
+ spk_emb_dim: 80
56
+ cfm_params: !new:omegaconf.DictConfig
57
+ content:
58
+ sigma_min: 1e-06
59
+ solver: 'euler'
60
+ t_scheduler: 'cosine'
61
+ training_cfg_rate: 0.2
62
+ inference_cfg_rate: 0.7
63
+ reg_loss_type: 'l1'
64
+ estimator: !new:cosyvoice.flow.DiT.dit.DiT
65
+ dim: 1024
66
+ depth: 22
67
+ heads: 16
68
+ dim_head: 64
69
+ ff_mult: 2
70
+ mel_dim: 80
71
+ mu_dim: 80
72
+ spk_dim: 80
73
+ out_channels: 80
74
+ static_chunk_size: !ref <chunk_size> * <token_mel_ratio>
75
+ num_decoding_left_chunks: !ref <num_decoding_left_chunks>
76
+
77
+ hift: !new:cosyvoice.hifigan.generator.CausalHiFTGenerator
78
+ in_channels: 80
79
+ base_channels: 512
80
+ nb_harmonics: 8
81
+ sampling_rate: !ref <sample_rate>
82
+ nsf_alpha: 0.1
83
+ nsf_sigma: 0.003
84
+ nsf_voiced_threshold: 10
85
+ upsample_rates: [8, 5, 3]
86
+ upsample_kernel_sizes: [16, 11, 7]
87
+ istft_params:
88
+ n_fft: 16
89
+ hop_len: 4
90
+ resblock_kernel_sizes: [3, 7, 11]
91
+ resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
92
+ source_resblock_kernel_sizes: [7, 7, 11]
93
+ source_resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
94
+ lrelu_slope: 0.1
95
+ audio_limit: 0.99
96
+ conv_pre_look_right: 4
97
+ f0_predictor: !new:cosyvoice.hifigan.f0_predictor.CausalConvRNNF0Predictor
98
+ num_class: 1
99
+ in_channels: 80
100
+ cond_channels: 512
101
+
102
+ # gan related module
103
+ mel_spec_transform1: !name:matcha.utils.audio.mel_spectrogram
104
+ n_fft: 1920
105
+ num_mels: 80
106
+ sampling_rate: !ref <sample_rate>
107
+ hop_size: 480
108
+ win_size: 1920
109
+ fmin: 0
110
+ fmax: null
111
+ center: False
112
+ hifigan: !new:cosyvoice.hifigan.hifigan.HiFiGan
113
+ generator: !ref <hift>
114
+ discriminator: !new:cosyvoice.hifigan.discriminator.MultipleDiscriminator
115
+ mpd: !new:matcha.hifigan.models.MultiPeriodDiscriminator
116
+ mrd: !new:cosyvoice.hifigan.discriminator.MultiResSpecDiscriminator
117
+ mel_spec_transform: [
118
+ !ref <mel_spec_transform1>
119
+ ]
120
+
121
+ # processor functions
122
+ parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
123
+ get_tokenizer: !name:cosyvoice.tokenizer.tokenizer.get_qwen_tokenizer
124
+ token_path: !ref <qwen_pretrain_path>
125
+ skip_special_tokens: True
126
+ version: cosyvoice3
127
+ allowed_special: 'all'
128
+ tokenize: !name:cosyvoice.dataset.processor.tokenize
129
+ get_tokenizer: !ref <get_tokenizer>
130
+ allowed_special: !ref <allowed_special>
131
+ filter: !name:cosyvoice.dataset.processor.filter
132
+ max_length: 40960
133
+ min_length: 100
134
+ token_max_length: 200
135
+ token_min_length: 1
136
+ resample: !name:cosyvoice.dataset.processor.resample
137
+ resample_rate: !ref <sample_rate>
138
+ truncate: !name:cosyvoice.dataset.processor.truncate
139
+ truncate_length: 24480 # must be a multiplier of hop_size
140
+ feat_extractor: !name:matcha.utils.audio.mel_spectrogram
141
+ n_fft: 1920
142
+ num_mels: 80
143
+ sampling_rate: !ref <sample_rate>
144
+ hop_size: 480
145
+ win_size: 1920
146
+ fmin: 0
147
+ fmax: null
148
+ center: False
149
+ compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
150
+ feat_extractor: !ref <feat_extractor>
151
+ compute_f0: !name:cosyvoice.dataset.processor.compute_f0
152
+ sample_rate: !ref <sample_rate>
153
+ hop_size: 480
154
+ parse_embedding: !name:cosyvoice.dataset.processor.parse_embedding
155
+ normalize: True
156
+ shuffle: !name:cosyvoice.dataset.processor.shuffle
157
+ shuffle_size: 1000
158
+ sort: !name:cosyvoice.dataset.processor.sort
159
+ sort_size: 500 # sort_size should be less than shuffle_size
160
+ batch: !name:cosyvoice.dataset.processor.batch
161
+ batch_type: 'dynamic'
162
+ max_frames_in_batch: 2000
163
+ padding: !name:cosyvoice.dataset.processor.padding
164
+ use_spk_embedding: False # change to True during sft
165
+
166
+
167
+ # dataset processor pipeline
168
+ data_pipeline: [
169
+ !ref <parquet_opener>,
170
+ !ref <tokenize>,
171
+ !ref <filter>,
172
+ !ref <resample>,
173
+ !ref <compute_fbank>,
174
+ !ref <parse_embedding>,
175
+ !ref <shuffle>,
176
+ !ref <sort>,
177
+ !ref <batch>,
178
+ !ref <padding>,
179
+ ]
180
+ data_pipeline_gan: [
181
+ !ref <parquet_opener>,
182
+ !ref <tokenize>,
183
+ !ref <filter>,
184
+ !ref <resample>,
185
+ !ref <truncate>,
186
+ !ref <compute_fbank>,
187
+ !ref <compute_f0>,
188
+ !ref <parse_embedding>,
189
+ !ref <shuffle>,
190
+ !ref <sort>,
191
+ !ref <batch>,
192
+ !ref <padding>,
193
+ ]
194
+
195
+ # llm flow train conf
196
+ train_conf:
197
+ optim: adam
198
+ optim_conf:
199
+ lr: 1e-5 # change to 1e-5 during sft
200
+ scheduler: constantlr # change to constantlr during sft
201
+ scheduler_conf:
202
+ warmup_steps: 2500
203
+ max_epoch: 200
204
+ grad_clip: 5
205
+ accum_grad: 2
206
+ log_interval: 100
207
+ save_per_step: -1
208
+
209
+ # gan train conf
210
+ train_conf_gan:
211
+ optim: adam
212
+ optim_conf:
213
+ lr: 0.0002 # use small lr for gan training
214
+ scheduler: constantlr
215
+ optim_d: adam
216
+ optim_conf_d:
217
+ lr: 0.0002 # use small lr for gan training
218
+ scheduler_d: constantlr
219
+ max_epoch: 200
220
+ grad_clip: 5
221
+ accum_grad: 1 # in gan training, accum_grad must be 1
222
+ log_interval: 100
223
+ save_per_step: -1