Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Abstract
A commercial-scale virtual try-on system achieves high success rates, photorealistic results, and real-time performance through integrated system design and multi-stage training.
Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.
Community
!!
the multi-image diffusion approach in mmdiT, coordinating up to 6 references while keeping identity and background stable, stands out in this space. one thing iâm curious about is how you resolve conflicting cues from multiple references when textures and lighting disagree across garments. would love to see a clean ablation showing how fidelity and artifact rates scale as you vary the number of references from 1 to 6. btw the arxivlens breakdown helped me parse the method details.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories (2026)
- Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off (2026)
- GEditBench v2: A Human-Aligned Benchmark for General Image Editing (2026)
- PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On (2026)
- A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks (2026)
- VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On (2026)
- Kling-MotionControl Technical Report (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Hi Authors,
Thanks for your impressive work and for sharing the detailed pipeline! I have a question regarding the training strategy depicted in the overall framework diagram.
In the "Training Stage" pipeline, the model first undergoes "Pre-training for general editing" before transitioning to "High-Quality Vertical Domain Data SFT". As we know, the SFT stage for virtual try-on typically relies on high-quality triplets (i.e., <Garment Image, Person in Source Clothing, Person in Target Garment>). However, the exact formulation of the pre-training stage is not entirely clear to me.
Could you please clarify the following details regarding the pre-training stage?
1ãTask Formulation: Was this pre-training stage formulated purely as a Text-to-Image (T2I) generation task, a generic Image Inpainting/Editing task, or a Reference-guided Image Synthesis task?
2ãDataset Domain: Did the pre-training utilize general domain data (e.g., general objects from COCO/OpenImages) to learn a universal prior for image fusion, or did it already utilize vertical domain data (e.g., large-scale fashion/human datasets, possibly just garment-person pairs without the source clothing)?
3ãInput Modality: If it was trained as an editing task, how were the inputs constructed? Did you use mask-based self-supervised learning (like randomly masking a region and conditioning on a reference image) or other data generation pipelines (like traceless erasing) to bridge the gap toward the triplet inputs required in SFT?
Looking forward to your reply. Thank you in advance!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
