gemma4-prometheus-gptq-4bit

GPTQ 4-bit quantized version of groxaxo/gemma4-prometheus-merged (Prometheus-steered google/gemma-4-31B-it).
Quantized with gptqmodel v5.8.0. 69.3% size reduction: 58 GiB → 17.9 GiB.

Related repositories

Repo Description
groxaxo/gemma4-prometheus-merged Full BF16 source model
groxaxo/gemma4-prometheus-workflow Reproducible scripts, config, and checkpoint journal
groxaxo/gemma4-prometheus-fixes All local patches applied to make this work
google/gemma-4-31B-it Original base model

Quantization details

Parameter Value
Bits 4
Group size 128
Format GPTQ
Symmetric Yes
desc_act No
Size (disk) 17.91 GiB (5 shards)
Reduction 69.3% vs BF16 merged
Tool gptqmodel 5.8.0
Calibration 16 samples (8 benign + 8 adversarial)

How to run

Requirements

  • 1–2 × GPU with ≥ 20 GiB total VRAM (single 24 GB GPU works)
  • gptqmodel >= 5.8.0 with the Gemma4 patch applied (see patches section)

Install

pip install gptqmodel>=5.8.0

Note: The standard gptqmodel package does not include Gemma4 support. Apply the patches from groxaxo/gemma4-prometheus-fixes before loading this model. The patches add Gemma4QModel and fix the alternating rotary-embedding shape mismatch.

Inference

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

from gptqmodel import GPTQModel
from transformers import AutoTokenizer
import torch

model_id = "groxaxo/gemma4-prometheus-gptq-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.load(
    model_id,
    device_map="auto",          # single GPU or multi-GPU pipeline
    # max_memory={0: "22GiB"},  # uncomment to set per-GPU budget
)
model.eval()

messages = [{"role": "user", "content": "Explain gradient descent."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,       # suppress chain-of-thought tokens
)
ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

with torch.no_grad():
    out = model.generate(
        ids,
        max_new_tokens=512,
        do_sample=False,
        temperature=None,
        top_p=None,
        pad_token_id=tokenizer.eos_token_id,
    )
print(tokenizer.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

Two-GPU pipeline parallel (2 × 24 GiB)

model = GPTQModel.load(
    model_id,
    device_map="balanced",
    max_memory={0: "22GiB", 1: "22GiB"},
)

Evaluation results

All tests run on 2 × RTX 3090 (24 GB each) in pipeline-parallel mode. True tensor-parallelism (TP=2) requires vLLM, which does not yet support the gemma4 architecture natively.

Coherence test (5/5 passed ✅)

Prompt Response excerpt
Explain how neural networks learn from data. "…a neural network learns by trial and error. It makes a guess, finds out how wrong that guess was, and then adjusts its internal settings…"
Supervised vs unsupervised learning? "…In supervised learning, the data is 'labeled' (it has an answer key)…In unsupervised learning, the data is 'unlabeled'…"
Gradient descent? "…Gradient Descent is an optimization algorithm used to minimize a function…the 'engine' used to train models by minimizing the Cost Function…"
Transformers in NLP? "…a Transformer is a deep learning architecture designed to process sequential data…focusing on the most important parts of the input, regardless of how far apart they are…"
What is quantization? "…quantization is the process of reducing the precision of the numbers used to represent a neural network's weights and activations…"

Context length (2 × RTX 3090, GPTQ-4bit, no flash-attn)

KV Cache Max Tokens Bottleneck
FP16 6 144 Attention compute O(n²)
FP8 (software) 6 144 Same — attention matrix dominates

Without flash-attn, the bottleneck is the attention matrix (O(n²) per layer), not KV cache storage. FP8 KV cache does not help here.

With flash-attn installed (estimated):

KV Cache Estimated Max Tokens
FP16 ~113 000
FP8 ~226 000

Recommendation: pip install flash-attn --no-build-isolation to unlock much longer contexts. The model supports up to 262 144 tokens.

Perplexity (WikiText-2)

Model PPL Notes
Merged (BnB-8bit reference) 1782.3 Chat model on raw text — PPL magnitude expected to be high
GPTQ-4bit (this model) 1815.8 +1.9% vs reference

ΔPPL = +1.9% is the meaningful signal. The absolute values are high because instruction-tuned models trained on chat data have poor raw-text likelihood.

KL divergence (this model vs merged reference)

Metric Value
Direction KL(merged_bnb8 ‖ gptq_4bit)
Mean KL 4.77 nats
Std KL 3.65 nats
Prompts 8 ML-domain questions
Vocab comparison Top-1000 tokens

KL ~4.77 nats is a typical result for 4-bit GPTQ on a 30B-class model. Part of the divergence is attributable to the bnb-8bit reference noise; true KL vs FP16 would be slightly lower.


Architecture notes (Gemma4 quirks)

Feature Detail
Text layers 60, alternating sliding-window / full attention
Sliding attention window=1024, 16 KV heads, head_dim=256
Global attention 4 KV heads, head_dim=512
GQA 32 query heads
Max position 262 144 tokens
VLM wrapper Vision tower present; text-only inference supported

Why layer_modules_strict = False: sliding-window attention layers omit v_proj, so a strict module check would fail. The flag allows partial matches.

Rotary embedding fix: Gemma4 alternates sliding_attention (head_dim=256) and full_attention (head_dim=512). gptqmodel cached the first layer's position_embeddings and replayed them for all layers, causing a shape mismatch at the first global-attention layer (layer 5). The fix regenerates position_embeddings per layer using the correct layer_type.


Patches required

See groxaxo/gemma4-prometheus-fixes for full patch diffs and instructions.

Required patches to gptqmodel:

  1. gptqmodel/models/definitions/gemma4.py — new Gemma4QModel class
  2. gptqmodel/models/auto.py"gemma4" -> Gemma4QModel mapping
  3. gptqmodel/looper/module_looper.py — free-memory device scheduling + per-layer rotary fix

Quantization config

{
  "bits": 4,
  "group_size": 128,
  "format": "gptq",
  "desc_act": false,
  "sym": true,
  "quant_method": "gptq"
}

Citation / acknowledgements

Downloads last month
57
Safetensors
Model size
31B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/gemma4-prometheus-gptq-4bit

Quantized
(1)
this model