Papers
arxiv:2509.26184

Auto-ARGUE: LLM-Based Report Generation Evaluation

Published on Sep 30
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A robust LLM-based system, Auto-ARGUE, is introduced for evaluating report generation tasks, demonstrating good system-level correlations with human judgments and providing a web app for visualization.

AI-generated summary

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.26184 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.26184 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.