arxiv:2512.11749

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Published on Dec 12

· Submitted by

taesiri on Dec 15

#3 Paper of the day

Kling Team

Upvote

Authors:

Minglei Shi ,

Haolin Wang ,

Abstract

SVG-T2I, a scaled SVG framework, enables high-quality text-to-image synthesis directly in the Visual Foundation Model feature domain, achieving competitive performance in generative tasks.

AI-generated summary

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

View arXiv page View PDF GitHub 56 Add to collection

Community

taesiri

Paper submitter 1 day ago

MingleiShi

Paper author 1 day ago

Hi everyone! 👋
We’re excited to introduce SVG-T2I, an experimental research project aimed at providing the community with a representation-based text-to-image generation framework for further exploration and study.

All code and model weights are fully open-sourced. If you find this work interesting or useful, we’d greatly appreciate your support with an Upvote on Hugging Face and a Star on GitHub ⭐🤗

Links:

🤗 Hugging Face Paper: https://huggingface.co/papers/2512.11749
💻 Code: https://github.com/KlingTeam/SVG-T2I
📦 Model Weights: https://huggingface.co/KlingTeam/SVG-T2I
📄 arXiv: https://arxiv.org/abs/2512.11749

SVG-T2I is a pure VFM-based text-to-image generation framework that performs diffusion modeling directly in the representation space, completely removing the need for traditional VAEs.
The primary goal of this work is to validate the scalability and effectiveness of representation-based generation at scale, while also providing the community with a fully open and end-to-end solution, including training code, inference and evaluation pipelines, and pre-trained checkpoints.

We hope this project can serve as a useful foundation for future research on representation generation and related directions. 🚀