LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation Paper • 2510.11063 • Published Oct 13, 2025 • 1
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything Paper • 2401.10228 • Published Jan 18, 2024
RecTok: Reconstruction Distillation along Rectified Flow Paper • 2512.13421 • Published Dec 15, 2025 • 5
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing Paper • 2512.11715 • Published Dec 12, 2025
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World Paper • 2512.10958 • Published Dec 11, 2025 • 1
Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future Paper • 2512.16760 • Published Dec 18, 2025 • 15
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation Paper • 2412.03255 • Published Dec 4, 2024
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models Paper • 2602.01842 • Published Feb 2 • 3
RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation Paper • 2312.07526 • Published Apr 8, 2024
Watch, Remember, Reason: Human-View Video Understanding with MLLMs Paper • 2606.07433 • Published 10 days ago • 21
Watch, Remember, Reason: Human-View Video Understanding with MLLMs Paper • 2606.07433 • Published 10 days ago • 21
MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft Paper • 2605.30931 • Published 17 days ago • 11
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models Paper • 2412.12932 • Published Dec 17, 2024 • 2
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining Paper • 2412.10342 • Published Dec 13, 2024
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark Paper • 2502.04976 • Published Feb 7, 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology Paper • 2503.14911 • Published Mar 19, 2025 • 3