MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper β’ 2411.15296 β’ Published Nov 22, 2024 β’ 21
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper β’ 2501.13826 β’ Published Jan 23, 2025 β’ 23
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Paper β’ 2509.23661 β’ Published Sep 28, 2025 β’ 47
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper β’ 2412.05237 β’ Published Dec 6, 2024 β’ 46
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Paper β’ 2411.14982 β’ Published Nov 22, 2024 β’ 19
Large Language Models are Visual Reasoning Coordinators Paper β’ 2310.15166 β’ Published Oct 23, 2023 β’ 2
OtterHD: A High-Resolution Multi-modality Model Paper β’ 2311.04219 β’ Published Nov 7, 2023 β’ 34
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures Paper β’ 2410.13754 β’ Published Oct 17, 2024 β’ 75
Octopus: Embodied Vision-Language Programmer from Environmental Feedback Paper β’ 2310.08588 β’ Published Oct 12, 2023 β’ 38
MIMIC-IT: Multi-Modal In-Context Instruction Tuning Paper β’ 2306.05425 β’ Published Jun 8, 2023 β’ 11
MMBench: Is Your Multi-modal Model an All-around Player? Paper β’ 2307.06281 β’ Published Jul 12, 2023 β’ 5
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper β’ 2407.12772 β’ Published Jul 17, 2024 β’ 35
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper β’ 2407.07895 β’ Published Jul 10, 2024 β’ 42