AssistantBench

AssistantBench aims to evaluate the ability of web agents to assist with real and time-consuming tasks. For more information, please check out our paper or the official website. To download AssistantBench, press here.

AssistantBench Leaderboard

Model Name	Accuracy	Answer rate	Precision	EM	Accuracy (easy)	Accuracy (medium)	Accuracy (hard)	Base Model	Organization
Magentic-1	28.30	94.5	28.8	10.5	67.8	48.5	15.5	gpt-4o, o1-preview	MSR AI Frontiers

Model Name	Accuracy	Answer rate	Precision	EM	Accuracy (easy)	Accuracy (medium)	Accuracy (hard)	Base Model	Organization
tt_api_1	28.30	94.5	30	11	67.8	48.5	15.5	GPT-4o
check_1	28.30	98.3	28.8	11	67.8	48.5	15.5	GPT-4o
Agent Context Protocols	28.30	98.3	28.8	11	67.8	48.5	15.5	GPT-4o
tt_api	27.70	92.8	29.8	10.5	67.8	46.5	15.5	GPT-4o
Magentic-o1	27.70	95.6	29	13.3	73.4	47.1	14.8	gpt-4o, o1-preview	MSR AI Frontiers
Magentic-UI (fix none answers)	27.60	83.4	33.1	9.4	60.7	43.8	17.3	o4-mini	MSR AI Frontiers
tt_api_2	27.50	91.2	30.2	12.7	75.3	45.1	15.4	GPT-4o
Magentic-Ui	26.80	100	26.8	9.9	63.3	40.1	17.6	o4-mini	MSR AI Frontiers
SPA→CB (Claude)	26.40	81.8	32.2	13.8	81	44.6	13.3	Claude-3.5-Sonnet	TAU NLP
baseline_3	25.90	98.9	26.2	9.9	70.2	34.1	18.5	GPT-4o
Magentic-1	25.30	100	25.3	11	69.9	35.6	16.9	GPT-4o	MSR AI Frontiers
SPA→CB	25.20	91.7	27.5	9.9	80.7	42.7	12.4	GPT4-T	TAU NLP
test_submission_3	24.80	93.4	26.5	9.4	81.6	38.9	13.5	GPT-4o
sf_med	24.80	95.2	26.1	10.3	90	27.6	18.4	GPT-4o
tt_api_3	24.70	95.6	25.8	9.4	82.9	33.7	15.8	GPT-4o
Swarm	24.20	95	25.5	9.4	82.9	32	15.9	GPT-4o
tt2	24.00	96.1	25	9.9	41.8	40.7	14.6	GPT-4o
sf_wiki	23.50	96.7	24.4	9.4	63.7	36.8	14	GPT-4o
test_submission_2	23.40	93.9	24.9	9.4	62.2	29.3	17.5	GPT-4o
SeeAct→CB	23.40	89.5	26.1	9.4	82	47.7	7.1	GPT4-T	TAU NLP
sf_wiki_1	23.30	96.1	24.2	8.8	67.9	35.8	13.8	GPT-4o
bs1	23.20	99.4	23.3	10.5	60.8	38.3	12.9	GPT-4o
o3-mini	23.20	99.4	23.3	10.5	60.8	38.3	12.9	tt_o3
RALM-Inst→CB (Claude)	22.50	79.6	28.3	8.3	65.2	40.1	10.7	Claude-3.5-Sonnet	TAU NLP
tt_o1	22.50	92.8	24.2	9.9	51.6	34.6	14.4	o1
SeeAct→CB (Claude)	22.30	76.2	29.3	7.7	87.4	46	5.8	Claude-3.5-Sonnet	TAU NLP
CB-1S	22.20	89.5	24.8	8.3	82.1	49.7	4.2	GPT4-T	TAU NLP
CB-1S (Claude)	21.90	76.2	28.8	6.6	86.6	47.9	4.4	Claude-3.5-Sonnet	TAU NLP
RALM-1S→CB (Claude)	21.60	82.3	26.3	6.6	63.3	40	9.5	Claude-3.5-Sonnet	TAU NLP
test_submission_5	21.30	95	22.4	8.8	61.8	33.4	12.4	GPT-4o
tt1	21.20	93.9	22.5	9.9	61.4	38.7	9.6	GPT-4o
test_1	21.20	93.9	22.5	9.9	61.4	38.7	9.6	GPT-4o
baseline_2	21.20	100	21.2	7.2	69.5	30.4	13	GPT-4o
tt3	21.00	96.7	21.8	7.2	51.3	29.1	14.8	GPT-4o
test_submission_1	20.90	83.4	25	8.3	56.7	24.9	16.1	GPT-4o
baseline	20.60	100	20.6	7.2	68.9	30.8	12	GPT-4o
sf-2	19.90	94.5	21.1	7.2	60.3	32.2	10.9	GPT-4o
sf	19.60	94.5	20.7	7.2	60.3	31	10.9	GPT-4o
test_single_panel	19.50	92.3	21.2	6.1	57.4	33.7	9.8	GPT-4o
RALM-1S→CB	19.50	92.8	21	6.1	81.3	35	7.3	GPT4-T	TAU NLP
test_submission_4	19.00	89.5	21.2	6.6	62.2	26	12.3	GPT-4o
RALM-Inst→CB	18.70	93.9	19.9	6.6	57.8	34.7	8	GPT4-T	TAU NLP
tt_o3	18.00	87.3	20.6	9.9	46.3	28.2	10.9	o3-mini
CB-Inst (Claude)	17.70	69.1	25.6	6.1	82.2	37.4	3.2	Claude-3.5-Sonnet	TAU NLP
CB-Inst	16.50	53.6	30.7	6.1	51.2	40.2	2.3	GPT4-T	TAU NLP
Infogent	14.50	71.3	20.4	5.5	63.9	19.3	8.4	GPT4O
InfoGent-GPT4O	14.50	71.3	20.4	5.5	63.9	19.3	8.4	GPT4O
ACP	14.20	49.2	28.8	5.5	33.9	24.3	7.8	GPT-4o
SPA (Claude)	12.90	34.3	37.7	8.8	11.1	18.4	10.4	Claude-3.5-Sonnet	TAU NLP
RALM-Inst	11.80	60.2	19.5	5.5	50.2	17	6.2	GPT4-T	TAU NLP
RALM-Inst (Claude)	11.50	43.1	26.7	5	50	11.2	8.7	Claude-3.5-Sonnet	TAU NLP
GPT-4o	11.30	100	11.3	3.3	36	24.7	3	GPT-4o
GPT-4o test	11.30	100	11.3	3.3	36	24.7	3	GPT-4o
SPA	11.10	35.9	30.9	5.5	29.5	12.3	9.1	GPT4-T	TAU NLP
RALM-1S (Claude)	11.00	42.5	25.9	3.3	35.5	15.5	6.9	Claude-3.5-Sonnet	TAU NLP
RALM-1S	10.70	48.1	22.4	3.9	80	10.5	5.5	GPT4-T	TAU NLP
AgentLab GenericAgent	4.80	47.5	10.1	1.1	10.7	9.8	1.9	GPT-4o	ServiceNow
SeeAct	4.10	15.5	26.3	2.2	28.9	2.5	2.9	GPT4-T	TAU NLP
SeeAct (Claude)	2.20	13.8	15.8	1.7	11.1	1.8	1.7	Claude-3.5-Sonnet	TAU NLP

Making a New Submission

To make a new submission, upload a predictions file. Our scoring function can be found here. We support JSONL files with the following format:

{"id": "task_id_1", "answer": "Answer 1 from your model"}
{"id": "task_id_2", "answer": "Answer 2 from your model"}

Model Name

Base Model

URL to Model Information

Organization

Contact Email (will be stored privately & used if there is an issue with your submission)

File

We would like to thank the GAIA team for sharing the source code for their leaderboard which we used as a template and HuggingFace for hosting the leaderboard.