SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
Abstract
SVG-T2I, a scaled SVG framework, enables high-quality text-to-image synthesis directly in the Visual Foundation Model feature domain, achieving competitive performance in generative tasks.
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
Community
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
Hi everyone! π
Weβre excited to introduce SVG-T2I, an experimental research project aimed at providing the community with a representation-based text-to-image generation framework for further exploration and study.
All code and model weights are fully open-sourced. If you find this work interesting or useful, weβd greatly appreciate your support with an Upvote on Hugging Face and a Star on GitHub βπ€
Links:
- π€ Hugging Face Paper: https://huggingface.co/papers/2512.11749
- π» Code: https://github.com/KlingTeam/SVG-T2I
- π¦ Model Weights: https://huggingface.co/KlingTeam/SVG-T2I
- π arXiv: https://arxiv.org/abs/2512.11749
SVG-T2I is a pure VFM-based text-to-image generation framework that performs diffusion modeling directly in the representation space, completely removing the need for traditional VAEs.
The primary goal of this work is to validate the scalability and effectiveness of representation-based generation at scale, while also providing the community with a fully open and end-to-end solution, including training code, inference and evaluation pipelines, and pre-trained checkpoints.
We hope this project can serve as a useful foundation for future research on representation generation and related directions. π
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Latent Diffusion Model without Variational Autoencoder (2025)
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models (2025)
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios (2025)
- TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows (2025)
- MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency (2025)
- Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper