Vision and language
updated
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
Determines Multimodal Model Performance
Paper
• 2404.04125
• Published
• 29
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
• 2404.03653
• Published
• 35
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion
Models
Paper
• 2404.02747
• Published
• 13
3D Congealing: 3D-Aware Image Alignment in the Wild
Paper
• 2404.02125
• Published
• 10
BeyondScene: Higher-Resolution Human-Centric Scene Generation With
Pretrained Diffusion
Paper
• 2404.04544
• Published
• 23
ControlNet++: Improving Conditional Controls with Efficient Consistency
Feedback
Paper
• 2404.07987
• Published
• 48
BRAVE: Broadening the visual encoding of vision-language models
Paper
• 2404.07204
• Published
• 19
RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth
Diffusion
Paper
• 2404.07199
• Published
• 27
Learning to Route Among Specialized Experts for Zero-Shot Generalization
Paper
• 2402.05859
• Published
• 5
Improving Explicit Spatial Relationships in Text-to-Image Generation
through an Automatically Derived Dataset
Paper
• 2403.00587
• Published
ReGround: Improving Textual and Spatial Grounding at No Cost
Paper
• 2403.13589
• Published
• 1
FlexCap: Generating Rich, Localized, and Flexible Captions in Images
Paper
• 2403.12026
• Published
• 2
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
• 2403.03206
• Published
• 71
Editable Image Elements for Controllable Synthesis
Paper
• 2404.16029
• Published
• 12
Move Anything with Layered Scene Diffusion
Paper
• 2404.07178
• Published
Kaleido Diffusion: Improving Conditional Diffusion Models with
Autoregressive Latent Modeling
Paper
• 2405.21048
• Published
• 16