VibeVoice 7B - 4-bit Quantized (bitsandbytes NF4)
This is a 4-bit quantized version of VibeVoice 7B using bitsandbytes NF4 quantization.
Model Details
| Property | Value |
|---|---|
| Base Model | vibevoice/VibeVoice-7B |
| Quantization | bitsandbytes NF4 (4-bit) |
| VRAM Usage | ~6.2 GB |
| Model Size | ~6.2 GB on disk |
| Sample Rate | 24kHz |
VRAM Comparison
| Mode | VRAM | Reduction |
|---|---|---|
| Full bfloat16 | ~17 GB | baseline |
| ao-int8 | ~9.4 GB | 45% |
| bnb-4bit | ~6.2 GB | 64% |
Quick Start
Installation
pip install transformers bitsandbytes torch torchaudio
pip install git+https://github.com/vibevoice-community/VibeVoice.git
Usage
import torch
from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
# Load quantized model
model_id = "marksverdhai/vibevoice-7b-bnb-4bit"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
model_id,
device_map={"": 0}, # Load on GPU 0
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
)
processor = VibeVoiceProcessor.from_pretrained(model_id)
model.eval()
model.set_ddpm_inference_steps(num_steps=10)
# Generate speech
text = "Speaker 1: Hello! This is VibeVoice, a state-of-the-art text-to-speech model."
inputs = processor(
text=[text],
padding=True,
return_tensors="pt",
return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=None,
cfg_scale=1.3,
tokenizer=processor.tokenizer,
generation_config={"do_sample": False},
verbose=False,
is_prefill=False,
)
# Get audio
audio = outputs.speech_outputs[0].squeeze().cpu()
sample_rate = 24000
# Save to file
import torchaudio
torchaudio.save("output.wav", audio.unsqueeze(0), sample_rate)
Voice Cloning
# With voice reference
inputs = processor(
text=["Speaker 1: Hello, I can clone any voice!"],
voice_samples=[["path/to/reference.wav"]],
padding=True,
return_tensors="pt",
return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}
with torch.no_grad():
outputs = model.generate(
**inputs,
cfg_scale=1.3,
tokenizer=processor.tokenizer,
is_prefill=True, # Enable voice cloning
)
Quality Verification
This model was tested with Whisper transcription to verify output quality:
| Test Sentence | WER |
|---|---|
| "Hello, this is a test." | 0% |
| "The quick brown fox jumps over the lazy dog." | 0% |
| "Good morning, how are you today?" | 0% |
| "Machine learning is transforming technology." | 0% |
| "Please remember to save your work frequently." | 0% |
All test sentences achieved 0% Word Error Rate, matching the full-precision model quality.
Quantization Details
This model uses bitsandbytes NF4 quantization:
- NF4 (NormalFloat4): Optimized 4-bit data type for neural network weights
- Double Quantization: Nested quantization for additional memory savings
- Compute dtype: bfloat16 for computations
The quantization is applied to the Qwen2 LLM backbone while preserving full precision for:
- Audio tokenizers (semantic and acoustic)
- Diffusion head
Limitations
- Requires CUDA GPU with bitsandbytes support
- Slightly slower inference than full precision (~1.3x)
- Longer model load time (~65s vs ~24s)
Citation
@misc{vibevoice2024,
title={VibeVoice: Emotion-Aware Text-to-Speech},
author={VibeVoice Team},
year={2024},
url={https://github.com/vibevoice-community/VibeVoice}
}
License
MIT
- Downloads last month
- 65
Model tree for marksverdhai/vibevoice-7b-bnb-4bit
Base model
vibevoice/VibeVoice-7B