WebDS: An End-to-End Benchmark for Web-based Data Science

Abstract

Many real-world data science tasks involve complex web-based interactions: finding appropriate data available on the internet, synthesizing multimodal data from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions and often do not require diverse tool-using capabilities. Conversely, traditional data science benchmarks typically concentrate on static, highly structured datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step, tool-based operations, across heterogeneous data formats, to better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on WebVoyager, completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes, such as poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS's tasks display. By contrast, humans achieve around 90% accuracy, highlighting a substantial gap between current agents and human performance. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.

Example Task

WebDS tasks begin with autonomous web browsing for relevant data, followed by analysis and/or visualization, and culminate in a well-reasoned, context-aware output. The example below illustrates a multi-site task where an agent must navigate a university data portal, cross-reference national demographic sources, and produce a report.

Example WebDS task flow showing an agent navigating a university data portal and cross-referencing external sources to produce a report. — “Analyze the total enrollment numbers by racial/ethnic category for undergraduates (both degree- and non-degree-seeking) as of October 19, 2022. Cross-reference these numbers with national demographic trends and discuss the potential impact on the university's diversity initiatives. Write a report for the university's strategic planning committee on these trends and recommendations.”

Dataset Statistics

WebDS consists of 870 human-written tasks spanning 29 data-rich websites grouped into 10 high-stakes domains. Every task is manually labeled with one or more of seven attributes and categorized into easy, medium, or hard difficulty based on its structural and content properties.

870

Tasks

29

Websites

10

Domains

7

Task Attributes

2

Tracks (Live + Dockerized)

90%

Human Success Rate

Treemap showing the distribution of tasks across 10 domains and 29 websites. — Distribution of tasks across domains. Websites are grouped into 10 high-stakes domains chosen based on subject interviews with journalists, data scientists, and domain experts.

Bar chart of the number of tasks per attribute. — Counts of tasks per attribute. Attributes are multi-label, so a single task can span several categories (e.g., multihop + structured + tool-use).

Distribution of tasks by difficulty: 247 easy, 275 medium, 348 hard. — Task difficulty distribution: 247 easy, 275 medium, 348 hard. Agents perform on average 2.5× better on easy tasks than on medium or hard.

Domains & Websites

Domain	Websites
Economics / Markets	BEA, FRED, St Louis FED, Trading Economics, DataUSA
Demographics	WorldPop, Worldometer
Music	Tunebat, MusicBrainz, RIAA, Richard Powers
Tourism, Trade & Airlines	UNWTO, IATA
Higher Education	MIT, UChicago
Scientific Research	arXiv
Sports	Understat
Government & Public Policy	CFPB, Our World in Data, Reddit
Energy & Climate	Climate.gov, NOAA
Health	CDC Mental Health, CDC COVID, CDC Obesity, NIH
E-Commerce	Shopping, Stocknear

Benchmark Comparison

WebDS is the only benchmark that spans all eight task dimensions—multihop reasoning, structured and unstructured data, web navigation, QA, multi-site integration, actions, and tool use.

Dataset	Multihop	Structured	Unstructured	Web Nav	QA	Multisite	Actions	Tool-Use
SQuAD	×	×	✓	×	×	×	×	×
WikiTableQ	✓	✓	×	×	×	×	×	×
HotpotQA	✓	×	✓	×	×	×	×	×
WebVoyager	×	×	×	✓	✓	×	✓	×
WebArena	✓	×	×	✓	✓	✓	✓	✓
GAIA	✓	×	✓	✓	✓	✓	×	×
WebWalker	✓	×	✓	✓	✓	✓	×	×
AssistantBench	✓	×	✓	✓	✓	×	✓	✓
WebDS (ours)	✓	✓	✓	✓	✓	✓	✓	✓

Leaderboard

Success Rate (SR %) and LLM-judged score (1–5) of 10 agents on WebDS-live. Success is determined by a binary SUCCESS/UNSUCCESS judgment; the score is a 1–5 integer from an LLM judge that evaluates the full trajectory. Human performance on the same protocol is ~90%.

#	Agent	Overall	Easy	Medium	Hard

If only the model name is specified, the model is run on the base WebArena agent framework. BrowserUse and AgentOccam rows use their native frameworks with the adaptations described in the paper.

Citation

@inproceedings{hsu2026webds,
  title     = {WebDS: An End-to-End Benchmark for Web-based Data Science},
  author    = {Hsu, Ethan and Yam, Hong Meng and Bouissou, Ines and Murali John, Aaron
               and Thota, Raj and Koe, Josh and Putta, Vivek Sarath and Dharesan, G K
               and Spangher, Alexander and Murty, Shikhar and Huang, Tenghao
               and Manning, Christopher D.},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2508.01222}
}