See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Paper • 2605.18018 • Published • 33
We advance the development of AGI and foster open source collaboration towards a smarter future.
ESPO: Early-Stopping Proximal Policy Optimization
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation