BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding Paper • 2606.31315 • Published 3 days ago • 68
TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents Paper • 2606.28480 • Published 7 days ago • 44
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks Paper • 2606.29537 • Published 5 days ago • 18
The Verification Horizon: No Silver Bullet for Coding Agent Rewards Paper • 2606.26300 • Published 9 days ago • 46
Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence Paper • 2606.15932 • Published 17 days ago • 38
Autodata: An agentic data scientist to create high quality synthetic data Paper • 2606.25996 • Published 9 days ago • 18
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 10 days ago • 144
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Paper • 2606.23654 • Published 11 days ago • 79
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 12 days ago • 96
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Paper • 2606.22883 • Published 11 days ago • 37
VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models Paper • 2606.16140 • Published 18 days ago • 121
BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering Paper • 2606.17049 • Published 18 days ago • 27
Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models Paper • 2606.16281 • Published 18 days ago • 34
CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks? Paper • 2606.15300 • Published 20 days ago • 13
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents Paper • 2606.12087 • Published 23 days ago • 77
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch Paper • 2606.10728 • Published 24 days ago • 34
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields Paper • 2606.11042 • Published 24 days ago • 22