mratsim's picture
text2text-generation -> text-generation
0dfd1ea verified
metadata
license: mit
base_model:
  - ByteDance-Seed/Seed-OSS-36B-Instruct
datasets:
  - mit-han-lab/pile-val-backup
pipeline_tag: text-generation
tags:
  - nvfp4
  - vllm
  - llmcompressor
  - text-generation-inference

Seed-OSS-36B-Instruct Quantized with NVFP4

This repo contains Seed-OSS-36B-Instruct quantized with NVFP4, suitable for max performance on Nvidia Blackwell hardware (5070, 5080, 5090, RTX Pro 6000, B200, B300, ...).

It can only be run on architectures with hardware FP4 support (Blackwell or later).

Original Model:

This model requires ~21.1GB of VRAM however the max context size of 512k tokens requires 128GB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

📥 Usage & Running Instructions

The model was tested with vLLM and 2x RTX Pro 6000, here is a script suitable for such configuration.

export MODEL="mratsim/Seed-OSS-36B-Instruct-NVFP4"
vllm serve "${MODEL}" \
  --served-model-name seed-oss-36b \
  --tensor-parallel-size 2
  --gpu-memory-utilization 0.85

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

and calibrated on 512 samples, 4096 sequence length of mit-han-lab/pile-val-backup