Building on HF

12 5 30

Yuriy Perezhohin PRO

yuriyvnv

https://scholar.google.com/citations?user=I5uzFtwAAAAJ&hl=en

AI & ML interests

Automatic Speech Recognition, Embeddings, Code Generation, Synthetic Data Generation and Filtering

Recent Activity

updated a dataset about 18 hours ago

yuriyvnv/synthetic_transcript_pt

posted an update 8 days ago

🔊 Four Qwen3-ASR (0.6B and 1.7B) Fine-Tunes for Portuguese and Dutch. Both the 1.7B and 0.6B variants of Alibaba's Qwen3-ASR, fine-tuned for European Portuguese and Dutch and bundled in a single collection. 🔗 Collection: https://huggingface.co/collections/yuriyvnv/qwen-asr-for-portuguese-and-dutch-17b-and-06b Headline numbers — Common Voice 22 test, with the zero-shot baseline. 🇵🇹 Qwen3-ASR-1.7B-PT — 12.91% → 8.50% WER (-34%) 🇵🇹 Qwen3-ASR-0.6B-PT — 18.26% → 11.85% WER (-35%) 🇳🇱 Qwen3-ASR-1.7B-NL — 6.68% → 5.28% WER (-21%) 🇳🇱 Qwen3-ASR-0.6B-NL — 12.46% → 8.31% WER (-33%) The 0.6B variants are the more interesting half of the release. They give up only a few WER points compared to the 1.7B at a third of the parameters — relevant for edge hardware, CPU inference, or anywhere keeping inference cost down. The Dutch 0.6B in particular lands at 8.3% WER on CV22, competitive with much larger systems. The Dutch 1.7B started from a strong 6.7% zero-shot, so the absolute gain is smaller — Qwen already handles Dutch well, and the fine-tune mostly sharpens it on Common Voice's casing and punctuation conventions. Training stuck close to Qwen's official SFT recipe (lr 2e-5, linear schedule, 2% warmup, bf16, gradient checkpointing on a single H100). The data is the differentiator: Common Voice 22 train + validation augmented with synthetic OpenAI-TTS speech, filtered by the WAVe multimodal embedding model that scores clips at the word level and drops the ones that don't align well with their transcripts. 📦 Full pipeline — synthetic data generation, WAVe filtering, training scripts, evaluation protocol — is open-source: github.com/yuriyvnv/TTS-Augmented-ASR @hf-audio . #asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

updated a model 8 days ago

yuriyvnv/Qwen3-ASR-1.7B-NL

View all activity

Organizations

posted an update 8 days ago

Post

1340

🔊 Four Qwen3-ASR (0.6B and 1.7B) Fine-Tunes for Portuguese and Dutch.

Both the 1.7B and 0.6B variants of Alibaba's Qwen3-ASR, fine-tuned for European Portuguese and Dutch and bundled in a single collection.

🔗 Collection: https://huggingface.co/collections/yuriyvnv/qwen-asr-for-portuguese-and-dutch-17b-and-06b

Headline numbers — Common Voice 22 test, with the zero-shot baseline.
🇵🇹 Qwen3-ASR-1.7B-PT — 12.91% → 8.50% WER (-34%)
🇵🇹 Qwen3-ASR-0.6B-PT — 18.26% → 11.85% WER (-35%)
🇳🇱 Qwen3-ASR-1.7B-NL — 6.68% → 5.28% WER (-21%)
🇳🇱 Qwen3-ASR-0.6B-NL — 12.46% → 8.31% WER (-33%)

The 0.6B variants are the more interesting half of the release. They give up only a few WER points compared to the 1.7B at a third of the parameters — relevant for edge hardware, CPU inference, or anywhere keeping inference cost down. The Dutch 0.6B in particular lands at 8.3% WER on CV22, competitive with much larger systems.

The Dutch 1.7B started from a strong 6.7% zero-shot, so the absolute gain is smaller — Qwen already handles Dutch well, and the fine-tune mostly sharpens it on Common Voice's casing and punctuation conventions.

Training stuck close to Qwen's official SFT recipe (lr 2e-5, linear schedule, 2% warmup, bf16, gradient checkpointing on a single H100). The data is the differentiator: Common Voice 22 train + validation augmented with synthetic OpenAI-TTS speech, filtered by the WAVe multimodal embedding model that scores clips at the word level and drops the ones that don't align well with their transcripts.

📦 Full pipeline — synthetic data generation, WAVe filtering, training scripts, evaluation protocol — is open-source:
github.com/yuriyvnv/TTS-Augmented-ASR
@hf-audio .
#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

replied to their post 18 days ago

Thanks! Just pushed the repo public: github.com/yuriyvnv/TTS-Augmented-ASR

This is the codebase behind a paper I wrote on Estonian and Slovenian, so you'll find the full pipeline there: not just the Parakeet fine-tuning scripts, but also the synthetic data generation (LLM text diversification + OpenAI TTS synthesis) that powers the augmentation. Everything was trained on a single NVIDIA H100.

One thing worth knowing for African languages:

Parakeet v3 is only pretrained on 25 languages, so you'd be doing cross-lingual transfer from scratch. The base won't recognize the language zero-shot, but fine-tuning still works — just expect a much rougher starting point than what you saw in my models.
Always evaluate zero-shot first. I had one language (Polish) where fine-tuning actually made things worse due to domain mismatch, or the learning rate was too low (still analyzing why this happened).
Standard recipe worked across everything I tried: AdamW, lr=5e-5, cosine annealing, 10% warmup, bf16, batch 32-64, early stopping on val_wer. The larger the batch size, especially for parakeet models, the better the gradient flow during training, since the model is compact.
Happy to help if you hit anything weird.

posted an update 19 days ago

Post

632

🎙️Parakeet-TDT Fine Tuning: 4 New ASR Models

Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian — among the first community fine-tunes of this architecture for the aforementioned languages

📊 Results on Common Voice 17 test sets:

🇸🇮 Slovenian: 50.49% → 11.56% WER (-77%)
🇵🇹 Portuguese: 15.86% → 10.71% WER (-32%)
🇪🇪 Estonian: 27.15% → 21.03% WER (-23%)
🇳🇱 Dutch: 5.99% → 5.33% WER (-11%)

All models output cased text with punctuation.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)

🔗 Models:
🇳🇱 yuriyvnv/parakeet-tdt-0.6b-dutch
🇵🇹 yuriyvnv/parakeet-tdt-0.6b-portuguese
🇪🇪 yuriyvnv/parakeet-tdt-0.6b-estonian
🇸🇮 yuriyvnv/parakeet-tdt-0.6b-slovenian

🏗️ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.

@hf-audio @NVIDIADev

#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

3 replies

posted an update 3 months ago

Post

490

🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe — a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data — thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community — @BramVanroy @GroNLP — especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.

replied to their post 3 months ago

🔥 Hello Everyone, given the community's increased interest in the WAVe for the Portuguese Language, the team has retrained the model for over 100 epochs to further extend learning. The results are much better than those from the previous version with 30 epochs.
Key improvements:

Metric	30 ep	100 ep	Change
Loss	0.49	0.22	-56%
Alignment Gap	0.079	0.118	+49%
Corrupt Similarity	0.31	0.23	-25%

The biggest win is the alignment gap nearly doubling -- the model is now much better at catching word-level errors like mispronunciations and timing artifacts. Corrupt pairs get
penalized harder (0.23 vs 0.31), so the filtering threshold becomes more reliable.

Same repo, same API, drop-in replacement:

model = AutoModel.from_pretrained("yuriyvnv/WAVe-1B-Multimodal-PT", trust_remote_code=True)

Updated README of the model card includes side-by-side training curves for both versions, check it out.

replied to their post 3 months ago

Hello everyone, yesterday there were minor problems that prevented the usage of the Embedding model. Mainly because of the Processor Class.
Posting here that the team has already solved the bugs.
If there is any problem with your usage, first delete the cached model (.cache folder in Hugging Face), redownload it, and if the issue persists, post a thread on the model page.

posted an update 3 months ago

Post

2277

🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality

Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.

📊 Impact on Portuguese ASR:
• 34% reduction in training steps
• 50% better cross-domain generalization
• 30% less synthetic data needed
• Word-aligned attention finds errors other methods miss

🏗️ Architecture:
• Text: XLM-RoBERTa (278M params)
• Audio: Wav2Vec2-BERT 2.0 (581M params)
• Word Alignment: Multi-head attention + GLU (14M params)
• Total: 1B parameters

from transformers import AutoModel, AutoProcessor

  processor = AutoProcessor.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )
  model = AutoModel.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )

# Assess speech-transcript alignment

inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
  quality = model(**inputs).quality_score.item()

Perfect for filtering synthetic speech datasets before ASR training.

Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment

2 replies

Yuriy Perezhohin PRO

AI & ML interests

Recent Activity

Organizations

yuriyvnv's activity