🎨 PMA-VAE: Parallel Mobile Artistic Variational Autoencoder

A novel attention-free architecture for image generation, super-resolution, artifact removal, and artistic style transfer.

🏗️ Architecture

Image → PixelUnshuffle stem → MobileConv stages → Parallel 2D Mamba blocks
  → Multi-scale latent (z_base H/16, z_detail H/8, z_style global)
  → Light parallel decoder with FiLM style modulation → Reconstructed image

Key Design Principles

Component	Choice	Why
Backbone	MobileConv + Parallel 2D Mamba	Fast, efficient, attention-free
Downsampling	PixelUnshuffle → stride-2 conv	Lossless initial features
Upsampling	PixelShuffle (sub-pixel)	Mobile-friendly, no checkerboard
Latent	Multi-scale (base/detail/style)	Controllable, prevents collapse
Style control	FiLM conditioning	Lightweight, multiplicative
Global context	4-dir cross-scan SSM	O(n) complexity, no attention
Local context	Depthwise separable conv + SE	Standard mobile building block
Training	Progressive resolution + KL warmup	Stable convergence
Loss	L1 + VGG + PatchGAN + edge + KL	Comprehensive quality

✨ Features

Attention-free: Uses parallel 2D Mamba/SSM blocks instead of self-attention
Mobile-deployable: Lightweight decoder (~4-8M params) using depthwise separable convolutions
Multi-scale latent space: z_base (structure), z_detail (texture), z_style (global style)
No sequential pixel loops: Fully parallel training AND inference via Blelloch parallel scan
Anti-collapse: KL warmup + free bits + progressive resolution training
FiLM conditioning: Style modulation throughout the decoder for artist style transfer
Pure PyTorch: No custom CUDA kernels needed — works on Colab free tier T4

📊 Model Variants

Config	Encoder	Decoder	Total	Target
`pmavae_tiny`	0.56M	1.91M	2.47M	Testing
`pmavae_small`	2.00M	4.27M	6.27M	Free Colab T4
`pmavae_base`	5.18M	9.83M	15.01M	Colab Pro / better GPU

🔧 Multi-Scale Latent Space

z_base   : H/16 × W/16 × 24-32  → Structure, composition, objects
z_detail : H/8  × W/8  × 6-8    → Texture, brush strokes, edges
z_style  : 1 × 1 × 96-128       → Global style vector

This separation enables:

Style transfer: Swap z_style between images
Super-resolution: Enhance z_detail while keeping z_base
Artifact removal: Clean up z_detail while preserving structure
Image generation: Sample from learned distributions

🏋️ Training

Loss Function

Loss = L1 + 0.5 × VGG_perceptual + 0.1 × edge_sobel + β × KL_free_bits + λ × PatchGAN

KL warmup: β linearly increases from 0 → 1e-6 over 5000 steps
Discriminator cold start: PatchGAN activates after 10000 steps
Adaptive disc weight: Gradient magnitude balancing (taming-transformers trick)
Free bits: 0.25 nats per latent dimension prevents posterior collapse

Progressive Resolution

Phase 1: 256×256 → Learn structure
Phase 2: 384×384 → Refine texture
Phase 3: 512×512 → Full detail
Phase 4: FHD tiled → High-resolution fine-tuning

📱 Mobile Deployment

The decoder is designed for mobile:

Depthwise separable convolutions (MobileNet-style)
PixelShuffle upsampling (no transpose conv artifacts)
Squeeze-Excitation for channel attention
FiLM style modulation (single MLP)
Exportable to ONNX → Core ML / TFLite / ONNX Runtime Mobile

📁 Files

model.py — Full PMA-VAE architecture (encoder, decoder, SSM blocks)
losses.py — Loss functions (VGG perceptual, PatchGAN, KL free bits, edge loss)
train.py — Training script with progressive resolution, checkpoint management
PMA_VAE_Colab_Training.ipynb — Complete Colab notebook

📚 References

Mamba: Gu & Dao, 2023. arxiv:2312.00752
VMamba: Liu et al., 2024. arxiv:2401.10166
Latent Diffusion: Rombach et al., 2022. arxiv:2112.10752
VAE: Kingma & Welling, 2013. arxiv:1312.6114
Taming Transformers: Esser et al., 2021. arxiv:2012.09841
EfficientViT: Cai et al., 2023. arxiv:2305.07027
MobileOne: Vasu et al., 2022. arxiv:2206.04040

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for krystv/PMA-VAE