Generative Frame Sampler for Long Video Understanding Paper β’ 2503.09146 β’ Published Mar 12, 2025 β’ 1
Kimi-VL-A3B Collection Moonshot's efficient MoE VLMs, exceptional on agent, long-context, and thinking β’ 7 items β’ Updated Oct 30, 2025 β’ 78
view article Article π€ππ¬π₯οΈπ Kimi-VL-A3B-Thinking-2506: A Quick Navigation Jun 21, 2025 β’ 74
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks Paper β’ 2503.06885 β’ Published Mar 10, 2025 β’ 4
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Paper β’ 2505.23359 β’ Published May 29, 2025 β’ 38
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Paper β’ 2504.08837 β’ Published Apr 10, 2025 β’ 43
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper β’ 2504.10479 β’ Published Apr 14, 2025 β’ 306
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published Feb 20, 2025 β’ 157
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper β’ 2411.13281 β’ Published Nov 20, 2024 β’ 20
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 434
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 126
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper β’ 2412.05237 β’ Published Dec 6, 2024 β’ 46
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper β’ 2412.00927 β’ Published Dec 1, 2024 β’ 29
Data Engineering for Scaling Language Models to 128K Context Paper β’ 2402.10171 β’ Published Feb 15, 2024 β’ 25
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper β’ 2410.03051 β’ Published Oct 4, 2024 β’ 6