gemma4-prometheus-gptq-4bit

GPTQ 4-bit quantized version of groxaxo/gemma4-prometheus-merged (Prometheus-steered google/gemma-4-31B-it).
Quantized with gptqmodel v5.8.0. 69.3% size reduction: 58 GiB → 17.9 GiB.

Related repositories

Repo	Description
groxaxo/gemma4-prometheus-merged	Full BF16 source model
groxaxo/gemma4-prometheus-workflow	Reproducible scripts, config, and checkpoint journal
groxaxo/gemma4-prometheus-fixes	All local patches applied to make this work
google/gemma-4-31B-it	Original base model

Quantization details

Parameter	Value
Bits	4
Group size	128
Format	GPTQ
Symmetric	Yes
desc_act	No
Size (disk)	17.91 GiB (5 shards)
Reduction	69.3% vs BF16 merged
Tool	gptqmodel 5.8.0
Calibration	16 samples (8 benign + 8 adversarial)

How to run

Requirements

1–2 × GPU with ≥ 20 GiB total VRAM (single 24 GB GPU works)
gptqmodel >= 5.8.0 with the Gemma4 patch applied (see patches section)

Install

pip install gptqmodel>=5.8.0

Note: The standard gptqmodel package does not include Gemma4 support. Apply the patches from groxaxo/gemma4-prometheus-fixes before loading this model. The patches add Gemma4QModel and fix the alternating rotary-embedding shape mismatch.

Inference

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

from gptqmodel import GPTQModel
from transformers import AutoTokenizer
import torch

model_id = "groxaxo/gemma4-prometheus-gptq-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.load(
    model_id,
    device_map="auto",          # single GPU or multi-GPU pipeline
    # max_memory={0: "22GiB"},  # uncomment to set per-GPU budget
)
model.eval()

messages = [{"role": "user", "content": "Explain gradient descent."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,       # suppress chain-of-thought tokens
)
ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

with torch.no_grad():
    out = model.generate(
        ids,
        max_new_tokens=512,
        do_sample=False,
        temperature=None,
        top_p=None,
        pad_token_id=tokenizer.eos_token_id,
    )
print(tokenizer.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

Two-GPU pipeline parallel (2 × 24 GiB)

model = GPTQModel.load(
    model_id,
    device_map="balanced",
    max_memory={0: "22GiB", 1: "22GiB"},
)

Evaluation results

All tests run on 2 × RTX 3090 (24 GB each) in pipeline-parallel mode. True tensor-parallelism (TP=2) requires vLLM, which does not yet support the gemma4 architecture natively.

Coherence test (5/5 passed ✅)

Prompt	Response excerpt
Explain how neural networks learn from data.	"…a neural network learns by trial and error. It makes a guess, finds out how wrong that guess was, and then adjusts its internal settings…"
Supervised vs unsupervised learning?	"…In supervised learning, the data is 'labeled' (it has an answer key)…In unsupervised learning, the data is 'unlabeled'…"
Gradient descent?	"…Gradient Descent is an optimization algorithm used to minimize a function…the 'engine' used to train models by minimizing the Cost Function…"
Transformers in NLP?	"…a Transformer is a deep learning architecture designed to process sequential data…focusing on the most important parts of the input, regardless of how far apart they are…"
What is quantization?	"…quantization is the process of reducing the precision of the numbers used to represent a neural network's weights and activations…"

Context length (2 × RTX 3090, GPTQ-4bit, no flash-attn)

KV Cache	Max Tokens	Bottleneck
FP16	6 144	Attention compute O(n²)
FP8 (software)	6 144	Same — attention matrix dominates

Without flash-attn, the bottleneck is the attention matrix (O(n²) per layer), not KV cache storage. FP8 KV cache does not help here.

With flash-attn installed (estimated):

KV Cache	Estimated Max Tokens
FP16	~113 000
FP8	~226 000

Recommendation: pip install flash-attn --no-build-isolation to unlock much longer contexts. The model supports up to 262 144 tokens.

Perplexity (WikiText-2)

Model	PPL	Notes
Merged (BnB-8bit reference)	1782.3	Chat model on raw text — PPL magnitude expected to be high
GPTQ-4bit (this model)	1815.8	+1.9% vs reference

ΔPPL = +1.9% is the meaningful signal. The absolute values are high because instruction-tuned models trained on chat data have poor raw-text likelihood.

KL divergence (this model vs merged reference)

Metric	Value
Direction	KL(merged_bnb8 ‖ gptq_4bit)
Mean KL	4.77 nats
Std KL	3.65 nats
Prompts	8 ML-domain questions
Vocab comparison	Top-1000 tokens

KL ~4.77 nats is a typical result for 4-bit GPTQ on a 30B-class model. Part of the divergence is attributable to the bnb-8bit reference noise; true KL vs FP16 would be slightly lower.

Architecture notes (Gemma4 quirks)

Feature	Detail
Text layers	60, alternating sliding-window / full attention
Sliding attention	window=1024, 16 KV heads, head_dim=256
Global attention	4 KV heads, head_dim=512
GQA	32 query heads
Max position	262 144 tokens
VLM wrapper	Vision tower present; text-only inference supported

Why layer_modules_strict = False: sliding-window attention layers omit v_proj, so a strict module check would fail. The flag allows partial matches.

Rotary embedding fix: Gemma4 alternates sliding_attention (head_dim=256) and full_attention (head_dim=512). gptqmodel cached the first layer's position_embeddings and replayed them for all layers, causing a shape mismatch at the first global-attention layer (layer 5). The fix regenerates position_embeddings per layer using the correct layer_type.

Patches required

See groxaxo/gemma4-prometheus-fixes for full patch diffs and instructions.

Required patches to gptqmodel:

gptqmodel/models/definitions/gemma4.py — new Gemma4QModel class
gptqmodel/models/auto.py — "gemma4" -> Gemma4QModel mapping
gptqmodel/looper/module_looper.py — free-memory device scheduling + per-layer rotary fix

Quantization config

{
  "bits": 4,
  "group_size": 128,
  "format": "gptq",
  "desc_act": false,
  "sym": true,
  "quant_method": "gptq"
}

Citation / acknowledgements

Base model: google/gemma-4-31B-it
Steering: Prometheus (local)
Quantization: gptqmodel v5.8.0

Downloads last month: 57

Safetensors

Model size

31B params

Tensor type

BF16

I32

Model tree for groxaxo/gemma4-prometheus-gptq-4bit

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Finetuned

groxaxo/gemma4-prometheus-merged

Quantized

(1)

this model