VibeVoice 7B - 4-bit Quantized (bitsandbytes NF4)

This is a 4-bit quantized version of VibeVoice 7B using bitsandbytes NF4 quantization.

Model Details

Property Value
Base Model vibevoice/VibeVoice-7B
Quantization bitsandbytes NF4 (4-bit)
VRAM Usage ~6.2 GB
Model Size ~6.2 GB on disk
Sample Rate 24kHz

VRAM Comparison

Mode VRAM Reduction
Full bfloat16 ~17 GB baseline
ao-int8 ~9.4 GB 45%
bnb-4bit ~6.2 GB 64%

Quick Start

Installation

pip install transformers bitsandbytes torch torchaudio
pip install git+https://github.com/vibevoice-community/VibeVoice.git

Usage

import torch
from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor

# Load quantized model
model_id = "marksverdhai/vibevoice-7b-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    model_id,
    device_map={"": 0},  # Load on GPU 0
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)
processor = VibeVoiceProcessor.from_pretrained(model_id)

model.eval()
model.set_ddpm_inference_steps(num_steps=10)

# Generate speech
text = "Speaker 1: Hello! This is VibeVoice, a state-of-the-art text-to-speech model."

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=None,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={"do_sample": False},
        verbose=False,
        is_prefill=False,
    )

# Get audio
audio = outputs.speech_outputs[0].squeeze().cpu()
sample_rate = 24000

# Save to file
import torchaudio
torchaudio.save("output.wav", audio.unsqueeze(0), sample_rate)

Voice Cloning

# With voice reference
inputs = processor(
    text=["Speaker 1: Hello, I can clone any voice!"],
    voice_samples=[["path/to/reference.wav"]],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        is_prefill=True,  # Enable voice cloning
    )

Quality Verification

This model was tested with Whisper transcription to verify output quality:

Test Sentence WER
"Hello, this is a test." 0%
"The quick brown fox jumps over the lazy dog." 0%
"Good morning, how are you today?" 0%
"Machine learning is transforming technology." 0%
"Please remember to save your work frequently." 0%

All test sentences achieved 0% Word Error Rate, matching the full-precision model quality.

Quantization Details

This model uses bitsandbytes NF4 quantization:

  • NF4 (NormalFloat4): Optimized 4-bit data type for neural network weights
  • Double Quantization: Nested quantization for additional memory savings
  • Compute dtype: bfloat16 for computations

The quantization is applied to the Qwen2 LLM backbone while preserving full precision for:

  • Audio tokenizers (semantic and acoustic)
  • Diffusion head

Limitations

  • Requires CUDA GPU with bitsandbytes support
  • Slightly slower inference than full precision (~1.3x)
  • Longer model load time (~65s vs ~24s)

Citation

@misc{vibevoice2024,
  title={VibeVoice: Emotion-Aware Text-to-Speech},
  author={VibeVoice Team},
  year={2024},
  url={https://github.com/vibevoice-community/VibeVoice}
}

License

MIT

Downloads last month
65
Safetensors
Model size
10B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for marksverdhai/vibevoice-7b-bnb-4bit

Quantized
(3)
this model