SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks
Abstract
SciOrch is a framework that uses a lightweight orchestrator model to coordinate multiple frontier LLMs for scientific reasoning, achieving superior performance through MCTS-based training and GRPO-style optimization while reducing API costs.
Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.
Community
Weβre excited to share our latest work, SciOrch: Learning to Orchestrate Expert LLMs for Frontier Multimodal Scientific Reasoning π§¬
Scientific reasoning often requires reading complex figures, combining knowledge from different fields, and solving problems step by step. Different LLMs are good at different parts of this process β so instead of relying on just one model, we ask: can a small model learn to coordinate multiple expert LLMs?
To answer this, we propose SciOrch πΌ, an 8B vision-language model that learns to break down scientific questions, call the right expert models, and combine their answers.
Since calling commercial models can be costly, we train SciOrch with an efficient MCTS-based pipeline π³.
Our results show that SciOrch outperforms strong single-model and multi-agent baselines, while reducing API cost. We hope this is a step toward more efficient and collaborative AI systems for scientific reasoning π
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation (2026)
- ATLAS: Agentic Test-time Learning-to-Allocate Scaling (2026)
- Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles (2026)
- Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning (2026)
- ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward (2026)
- FastContext: Training Efficient Repository Explorer for Coding Agents (2026)
- World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Neat paper. The idea of using a lightweight 8B model to orchestrate larger frontier models makes a lot of sense, especially since the complementarity between different models is often ignored in single-model benchmarks. It's refreshing to see a focus on keeping API costs down while actually hitting better performance than the individual models.
How did the MCTS-based training handle the trade-off between exploring diverse orchestration paths and keeping the API budget reasonable during the training phase?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/70184a66-a094-4fed-824b-6545a3f0703c
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper