Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding Paper • 2508.13804 • Published Aug 19, 2025 • 3
Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information Paper • 2508.11252 • Published Aug 15, 2025 • 3
RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation Paper • 2508.13968 • Published Aug 19, 2025 • 1