Seed-OSS-36B-Instruct-NVFP4 / README.md

mratsim

text2text-generation -> text-generation

0dfd1ea verified 4 months ago

preview code

raw

history blame contribute delete

1.6 kB

metadata

license: mit
base_model:
  - ByteDance-Seed/Seed-OSS-36B-Instruct
datasets:
  - mit-han-lab/pile-val-backup
pipeline_tag: text-generation
tags:
  - nvfp4
  - vllm
  - llmcompressor
  - text-generation-inference

Seed-OSS-36B-Instruct Quantized with NVFP4

This repo contains Seed-OSS-36B-Instruct quantized with NVFP4, suitable for max performance on Nvidia Blackwell hardware (5070, 5080, 5090, RTX Pro 6000, B200, B300, ...).

It can only be run on architectures with hardware FP4 support (Blackwell or later).

Original Model:

ByteDance-Seed/Seed-OSS-36B-Instruct

This model requires ~21.1GB of VRAM however the max context size of 512k tokens requires 128GB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

📥 Usage & Running Instructions

The model was tested with vLLM and 2x RTX Pro 6000, here is a script suitable for such configuration.

export MODEL="mratsim/Seed-OSS-36B-Instruct-NVFP4"
vllm serve "${MODEL}" \
  --served-model-name seed-oss-36b \
  --tensor-parallel-size 2
  --gpu-memory-utilization 0.85

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

and calibrated on 512 samples, 4096 sequence length of mit-han-lab/pile-val-backup