task_categories: - text-to-image

Unify-Agent

This repository contains the official resources for Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis.

👀 Intro

We introduce Unify-Agent, an end-to-end unified multimodal agent for world-grounded image synthesis. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively reason, search, and integrate external world knowledge at inference time, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.

Unify-Agent unifies four core capabilities within a single model:

THINK: understand the prompt and identify missing knowledge
RESEARCH: retrieve relevant textual and visual evidence
RECAPTION: convert retrieved evidence into grounded generation guidance
GENERATE: synthesize the final image

To train this agent, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis.

We further introduce FactIP, a new benchmark for factual and knowledge-intensive image generation, covering 12 categories of culturally significant and long-tail concepts that explicitly require external knowledge grounding.

As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world visual synthesis.

🔍 FactIP Benchmark

Our FactIP benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings.

FactIP contains three major groups — Character, Scene, and Object — and 12 fine-grained subcategories, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology.

The full benchmark contains 2,462 prompts, and we also provide a mini test subset with category proportions aligned to the full benchmark.

🏆 Performance

Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across FactIP, WiSE, KiTTEN, and T2I-FactualBench.

Our method produces images that better preserve:

subject identity
fine-grained visual attributes
prompt-specific details
real-world factual grounding

while maintaining strong visual quality and broad stylistic versatility.

🧠 Pipeline

Given an input prompt, Unify-Agent first performs prompt understanding and cognitive gap detection to identify missing but visually critical attributes. It then acquires complementary evidence through both textual evidence search and visual evidence search.

Based on the collected evidence, the model grounds the generation process with:

identity-preserving constraints for character-specific visual traits
scene-compositional constraints for pose, environment, clothing, and mood

These grounded constraints are then integrated into an evidence-grounded recaptioning module, which produces a detailed caption for the downstream image generator.

📦 Release Status

The repository is now available, and the code, benchmark, and checkpoints are being prepared for full release.

Please stay tuned for upcoming updates.

Citation

If you find this work helpful, please consider citing:

@article{chen2026unify,
  title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
  author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
  journal={arXiv preprint arXiv:2603.29620},
  year={2026}
}