Transformers documentation

MiniMax-M3-VL

Transformers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was contributed to Hugging Face Transformers on 2026-06-12.

MiniMax-M3-VL

Overview

MiniMax-M3-VL is the vision-language member of the MiniMax-M3 family. It pairs a CLIP-style vision tower (Conv3d patch embedding with 3D rotary position embeddings) with the MiniMax-M3 text backbone, a mixed dense/sparse Mixture-of-Experts decoder that uses SwiGLU-OAI gated experts and a lightning indexer for block-sparse attention.

Architecture

Block-sparse attention (Lightning Indexer)

Every layer is GQA (num_key_value_heads = 4) with per-head QK-norm and partial RoPE on the first rotary_dim. config.layer_types[i] then picks "full_attention" (dense causal) or "minimax_m3_sparse", where a MiniMaxM3VLIndexer decides, per query, which block of keys the main attention may see.

The indexer scores every key, then max-poolsthose per-key scores into blocks of index_block_size keys, so selection happens at the granularity of a block of keys: per query it keeps the top-index_topk_blocks key blocks plus the always-on index_local_blocks local-window block (under block-level causality), broadcasts the per-block 0/-inf choice back onto every key in the block. The result is a [B, 1, S_q, S_k] additive bias summed onto the causal mask. Theoretically this means that the attention is only computed over the selected blocks of keys, but transformers does not support the kernels that compute this efficiently! We are adding it to kernels asap!

Vision tower

A MiniMaxM3VLVisionModel: a Conv3d patch embedding over flattened [N_patches, C·T·P·P] input, a stack of CLIP-style encoder layers carrying a 3D rotary position embedding (time / height / width bands). A MiniMaxM3VLPatchMerger groups spatial_merge_size² patches into the channel dim before the 2-layer GELU MiniMaxM3VLMultiModalProjector maps vision features into the text hidden size.

Usage examples

The example below runs the model on a real image loaded with load_image().

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.image_utils import load_image


model = AutoModelForImageTextToText.from_pretrained(
    "MiniMaxAI/MiniMax-M3-preview", dtype=torch.bfloat16, device_map="auto",
)
processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-M3-preview")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image briefly."},
        ],
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(images=[image], text=text, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Apple example

This example asks the model about an image of apples, again loading a real image with load_image().

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.image_utils import load_image


model = AutoModelForImageTextToText.from_pretrained(
    "MiniMaxAI/MiniMax-M3-preview", dtype=torch.bfloat16, device_map="auto",
)
processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-M3-preview")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How many apples are in this image, and what color are they?"},
        ],
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(images=[image], text=text, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Fastest inference configuration

ctx	SDPA decode	MSA decode	MSA decode adv.	SDPA prefill	MSA prefill	MSA prefill adv.
2K	27.8 tok/s	31.0	+12%	303 ms	257 ms	1.18×
4K	23.4 tok/s	30.5	+30%	684 ms	460 ms	1.49×
8K	17.8 tok/s	29.6	+66%	1906 ms	976 ms	1.95×
16K	12.0 tok/s	27.6	+130%	6110 ms	2344 ms	2.61×

The checkpoint ships in native MXFP8. For decode throughput, the fastest validated configuration is bf16 (dequantized at load) + the MSA block-sparse attention kernel + tensor & expert parallelism + a reduce-overhead cudagraph compile — roughly 31 tok/s decode on 8×B200 at a 2048-token prefill.

Keeping the weights in native FP8 is a memory-footprint option only — it is never faster on this setup. The FP8 Triton experts/linear kernels lower as opaque inductor fallback kernels that cudagraph cannot capture on the hot expert path, so native-FP8 decode measured ~4.2 tok/s (≈7× slower than the bf16 path) even under torch.compile(fullgraph=True). Use FP8 only when the bf16 weights do not fit.

config (sdpa baseline, TP+EP, 2048-token prefill, 8×B200)	decode
bf16 dequantize-at-load + MSA + compile/cudagraph	~31 tok/s
bf16 dequantize-at-load + sdpa + compile/cudagraph	~28 tok/s
native FP8 + compile/cudagraph	~4 tok/s (memory-only, not for speed)

Dequantizing to bf16 only fits with even sharding across GPUs (TP/EP), not with device_map="auto" (pipeline placement OOMs at load). Launch one process per GPU with torchrun:

torchrun --nproc_per_node=8 fastest_m3_vl.py

# fastest_m3_vl.py
import os, sys
import torch
import torch.distributed as dist
from transformers import (
    AutoModelForImageTextToText,
    AutoTokenizer,
    CompileConfig,
    FineGrainedFP8Config,
)
from transformers.distributed import DistributedConfig

# The indexer feeds SDPA an additive float mask; the cuDNN SDP backend segfaults on it (B200).
torch.backends.cuda.enable_cudnn_sdp(False)

model = AutoModelForImageTextToText.from_pretrained(
    "MiniMaxAI/MiniMax-M3-preview",
    dtype=torch.bfloat16,
    # Dequantize the native MXFP8 weights to bf16 at load (the speed win); needs even TP/EP sharding.
    quantization_config=FineGrainedFP8Config(dequantize=True),
    tp_plan="auto",
    distributed_config=DistributedConfig(enable_expert_parallel=True),
    attn_implementation="kernels-staging/msa@v0",  # MSA block-sparse attention kernel
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M3-preview")
messages = [{"role": "user", "content": "Summarize the history of computing."}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
).to(f"cuda:{os.environ.get('LOCAL_RANK', '0')}")

generated_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    # Static cache + reduce-overhead cudagraph capture is what pushes decode to ~31 tok/s.
    cache_implementation="static",
    compile_config=CompileConfig(mode="reduce-overhead", fullgraph=True),
)
if int(os.environ.get("RANK", "0")) == 0:
    print(tokenizer.decode(generated_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

# cudagraph-captured NCCL collectives deadlock the NCCL/CUDA destructors at teardown; the output is
# already produced, so hard-exit to skip the hanging cleanup.
if dist.is_initialized():
    sys.stdout.flush()
    os._exit(0)

Transformers

MiniMax-M3-VL

Overview

Architecture

Block-sparse attention (Lightning Indexer)

Vision tower

Usage examples

Apple example

Fastest inference configuration

MiniMaxM3VLConfig

class transformers.MiniMaxM3VLConfig

MiniMaxM3VLTextConfig

class transformers.MiniMaxM3VLTextConfig

MiniMaxM3VLVisionConfig

class transformers.MiniMaxM3VLVisionConfig

MiniMaxM3VLProcessor

class transformers.MiniMaxM3VLProcessor

post_process_image_text_to_text

MiniMaxM3VLImageProcessor

class transformers.MiniMaxM3VLImageProcessor

preprocess

MiniMaxM3VLVideoProcessor

class transformers.MiniMaxM3VLVideoProcessor

MiniMaxM3VLVisionModel

class transformers.MiniMaxM3VLVisionModel

forward

MiniMaxM3VLTextModel

class transformers.MiniMaxM3VLTextModel

forward

MiniMaxM3VLModel

class transformers.MiniMaxM3VLModel

forward

MiniMaxM3VLForCausalLM

class transformers.MiniMaxM3VLForCausalLM

forward

MiniMaxM3SparseForConditionalGeneration

class transformers.MiniMaxM3SparseForConditionalGeneration

forward