task_categories: - text-to-image
Unify-Agent
This repository contains the official resources for Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis.
π Intro
We introduce Unify-Agent, an end-to-end unified multimodal agent for world-grounded image synthesis. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively reason, search, and integrate external world knowledge at inference time, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.
Unify-Agent unifies four core capabilities within a single model:
- THINK: understand the prompt and identify missing knowledge
- RESEARCH: retrieve relevant textual and visual evidence
- RECAPTION: convert retrieved evidence into grounded generation guidance
- GENERATE: synthesize the final image
To train this agent, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis.
We further introduce FactIP, a new benchmark for factual and knowledge-intensive image generation, covering 12 categories of culturally significant and long-tail concepts that explicitly require external knowledge grounding.
As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world visual synthesis.
π FactIP Benchmark
Our FactIP benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings.
FactIP contains three major groups β Character, Scene, and Object β and 12 fine-grained subcategories, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology.
The full benchmark contains 2,462 prompts, and we also provide a mini test subset with category proportions aligned to the full benchmark.
π Performance
Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across FactIP, WiSE, KiTTEN, and T2I-FactualBench.
Our method produces images that better preserve:
- subject identity
- fine-grained visual attributes
- prompt-specific details
- real-world factual grounding
while maintaining strong visual quality and broad stylistic versatility.
π§ Pipeline
Given an input prompt, Unify-Agent first performs prompt understanding and cognitive gap detection to identify missing but visually critical attributes. It then acquires complementary evidence through both textual evidence search and visual evidence search.
Based on the collected evidence, the model grounds the generation process with:
- identity-preserving constraints for character-specific visual traits
- scene-compositional constraints for pose, environment, clothing, and mood
These grounded constraints are then integrated into an evidence-grounded recaptioning module, which produces a detailed caption for the downstream image generator.
π¦ Release Status
The repository is now available, and the code, benchmark, and checkpoints are being prepared for full release.
Please stay tuned for upcoming updates.
Citation
If you find this work helpful, please consider citing:
@article{chen2026unify,
title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
journal={arXiv preprint arXiv:2603.29620},
year={2026}
}
- Downloads last month
- 12