Z-Image-Turbo Student Adapter

Text-encoder distillation for VRAM-efficient Z-Image inference.

Official base model: Tongyi-MAI/Z-Image-Turbo

Overview

This project proves that we can reduce VRAM usage by replacing Z-Image-Turbo's original Qwen3-4B text encoder with a distilled Qwen3-1.7B student + lightweight adapter. The student+adapter is trained via hidden-state matching against the original 4B encoder.

No other optimizations are applied — the DiT transformer and VAE are unchanged. The VRAM savings come entirely from the smaller text encoder.

Results

Original	Qwen3-1.7B

Architecture

Original:  Prompt → Qwen3-4B (36L, 2560d) → hidden_states[-2] → DiT
                                                                    ↓
Student:   Prompt → Qwen3-1.7B (28L, 2048d) → Adapter → hidden_states[-2] → DiT

The adapter uses prompt-dependent cross-attention queries (no static learned queries), converting student hidden states to teacher-equivalent conditioning vectors before they reach the DiT.

The student receives the same chat-template-formatted prompts as the teacher, with a curriculum annealing from teacher format to raw prompts for deployment-readiness.

Benchmarks

Measured on T4 (22 GB VRAM) with torch.bfloat16, guidance_scale=0.0, 9 inference steps, 1024×1024.

Metric	Original (4B)	Student (1.7B)	Savings
Weight VRAM	20.70 GB	16.30 GB	4.40 GB (21%)
Peak VRAM	21.35 GB	16.76 GB	4.59 GB (22%)
Generation time	3.9s	3.5s	—

The student+adapter brings peak VRAM from 21.4 GB down to 16.8 GB fitting comfortably on a 22 GB T4 where the original barely fits. The DiT transformer and VAE are unchanged (12 GB total); all savings come from the text encoder.

Quick Start

from huggingface_hub import snapshot_download
from diffusers import ZImagePipeline
from transformers import AutoModel
from pathlib import Path
import torch

# Download the repo locally
repo_dir = Path('./zimage-student-adapter')
snapshot_download(
    'SearchingMan/Z-Image-Turbo-student-adapter',
    local_dir=str(repo_dir),
    local_dir_use_symlinks=False,
)

# Two-stage loading (required — trust_remote_code is not forwarded
# to component loaders by diffusers)
text_encoder = AutoModel.from_pretrained(
    str(repo_dir / 'text_encoder'),
    trust_remote_code=True,
    dtype=torch.bfloat16,
)
pipe = ZImagePipeline.from_pretrained(
    str(repo_dir),
    dtype=torch.bfloat16,
    text_encoder=text_encoder,
    trust_remote_code=True,
).to('cuda')

# Generate
image = pipe(
    prompt='a serene mountain lake at sunrise, oil painting',
  num_inference_steps=9,
  guidance_scale=0.0,
  generator=torch.Generator(device='cpu').manual_seed(42),
).images[0]
image.save('output.png')

Important: Always use a CPU generator: torch.Generator(device='cpu'). CUDA generators cause device-mismatch errors with diffusers' mixed scheduler placement.

Limitations

No end-to-end quality guarantees. The student is trained to match hidden states, not final images. Visual quality may differ from the original Z-Image-Turbo.
VRAM savings are from the text encoder only. The DiT (12 GB) and VAE (0.5 GB) are unchanged. With guidance_scale=0 and 9 steps the pipeline peaks at ~17 GB — fitting a 22 GB T4/L4.
Chat template required. The text encoder expects the same apply_chat_template(enable_thinking=True, add_generation_prompt=True) format used during training.
Single-prompt only. Batch generation shares the same text encoder forward, but DiT processes samples as a list (not batched tensor), so throughput is per-sample.

Training Details

Student: Qwen/Qwen3-1.7B (28 layers, hidden_size=2048)
Teacher: Tongyi-MAI/Z-Image-Turbo text encoder (Qwen3-4B, 36 layers, hidden_size=2560)
Adapter: 2 XAttn blocks, dim=1024, 8 heads, ff_mult=4 (~39M params)
Tokenizers: Same Qwen2Tokenizer for both student and teacher (same tokenizer family)

License

Same as the base model: Tongyi-MAI/Z-Image-Turbo

Downloads last month: 79