VMamba: Visual State Space Model
Paper β’ 2401.10166 β’ Published β’ 40
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
A novel attention-free architecture for image generation, super-resolution, artifact removal, and artistic style transfer.
Image β PixelUnshuffle stem β MobileConv stages β Parallel 2D Mamba blocks
β Multi-scale latent (z_base H/16, z_detail H/8, z_style global)
β Light parallel decoder with FiLM style modulation β Reconstructed image
| Component | Choice | Why |
|---|---|---|
| Backbone | MobileConv + Parallel 2D Mamba | Fast, efficient, attention-free |
| Downsampling | PixelUnshuffle β stride-2 conv | Lossless initial features |
| Upsampling | PixelShuffle (sub-pixel) | Mobile-friendly, no checkerboard |
| Latent | Multi-scale (base/detail/style) | Controllable, prevents collapse |
| Style control | FiLM conditioning | Lightweight, multiplicative |
| Global context | 4-dir cross-scan SSM | O(n) complexity, no attention |
| Local context | Depthwise separable conv + SE | Standard mobile building block |
| Training | Progressive resolution + KL warmup | Stable convergence |
| Loss | L1 + VGG + PatchGAN + edge + KL | Comprehensive quality |
z_base (structure), z_detail (texture), z_style (global style)| Config | Encoder | Decoder | Total | Target |
|---|---|---|---|---|
pmavae_tiny |
0.56M | 1.91M | 2.47M | Testing |
pmavae_small |
2.00M | 4.27M | 6.27M | Free Colab T4 |
pmavae_base |
5.18M | 9.83M | 15.01M | Colab Pro / better GPU |
z_base : H/16 Γ W/16 Γ 24-32 β Structure, composition, objects
z_detail : H/8 Γ W/8 Γ 6-8 β Texture, brush strokes, edges
z_style : 1 Γ 1 Γ 96-128 β Global style vector
This separation enables:
z_style between imagesz_detail while keeping z_basez_detail while preserving structureLoss = L1 + 0.5 Γ VGG_perceptual + 0.1 Γ edge_sobel + Ξ² Γ KL_free_bits + Ξ» Γ PatchGAN
Phase 1: 256Γ256 β Learn structure
Phase 2: 384Γ384 β Refine texture
Phase 3: 512Γ512 β Full detail
Phase 4: FHD tiled β High-resolution fine-tuning
The decoder is designed for mobile:
model.py β Full PMA-VAE architecture (encoder, decoder, SSM blocks)losses.py β Loss functions (VGG perceptual, PatchGAN, KL free bits, edge loss)train.py β Training script with progressive resolution, checkpoint managementPMA_VAE_Colab_Training.ipynb β Complete Colab notebookMIT