EAGLE3 Draft Model for GLM-4.7-Flash

An EAGLE3 draft model that accelerates inference for zai-org/GLM-4.7-Flash (30B MoE, ~3B active) through speculative decoding.

1.66x mean speedup at B=1 across 4 benchmarks on a single H200.

Results

Verified 2026-04-12 on 1x NVIDIA H200 144GB, TP=1, FlashInfer, temp=0, max_tokens=512.

B=1 (single request)

Dataset	Baseline tok/s	Eagle3 tok/s	Speedup	Accept Rate	Accept Length
HumanEval (75)	130.2	231.8	1.78x	57.1%	3.42
Terminal-Bench (112)	128.0	220.2	1.72x	62.9%	3.77
MT-Bench (154)	129.2	207.1	1.60x	47.9%	2.88
SWEBench-Verified (75)	127.4	194.4	1.53x	51.7%	3.10
Mean	128.7	213.4	1.66x	54.9%	3.29

B=32 (32 concurrent requests)

Dataset	Baseline tok/s	Eagle3 tok/s	Speedup
SWEBench-Verified	1,415.3	1,830.4	1.29x
HumanEval	1,595.8	1,851.5	1.16x
MT-Bench	1,489.9	1,627.9	1.09x
Terminal-Bench	1,479.4	1,614.0	1.09x
Mean	1,495.1	1,731.0	1.16x

Protocol: B=1: 5 warmup + 20 measured (sequential). B=32: 15 warmup + 60 measured (32 concurrent). Metrics from server-side Prometheus.

Architecture

Parameter	Value
Type	LlamaForCausalLMEagle3
Hidden Size	2048
Heads / KV Heads	16 / 4 (GQA)
Head Dimension	128
Intermediate Size	8192
Layers	1
Vocab Size	154,880 (draft: 32,000)
Size	278 MB

Training

54K samples (45% ShareGPT, 35% UltraChat, 20% PerfectBlend). 3 epochs, LR=1e-4, max_length=1024, batch_size=1. Trained with --target-model-backend sglang via SpecForge.

Best training accuracy (acc_0): 79.2%. Note: training accuracy does not predict inference accept rate — there is a 30-60pp gap.

Usage

Benchmarked with our SGLang fork (tails-mpt/sglang, commit 63291f7f51). Upstream SGLang may produce different speedups due to scheduling overhead differences.

python -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-Flash \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 4 \
  --tp 1 --trust-remote-code --port 30000 \
  --enable-metrics --mem-fraction-static 0.65

Pinned dependencies: sgl-kernel 0.3.18.post2, flashinfer 0.6.6, torch 2.9.1+cu126.

Verify accept rate > 0% after startup to confirm the draft model loaded correctly.

Citation

@article{li2025eagle3,
  title={EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and others},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Downloads last month: 372

Safetensors

Model size

0.1B params

Tensor type

I64

BF16

BOOL

Paper for thoughtworks/GLM-4.7-Flash-Eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 9