DSGym provides reusable container-based substrate for building and benchmarking data science agents

Data science agents must inspect datasets, design workflows, run code, and return verifiable answers, not just autocomplete Pandas code. DSGym, introduced by researchers at Stanford University, Together AI, Duke University, and Harvard University, is a framework that evaluates and trains such agents across over 1,000 data science challenges with expertly curated ground truth and a consistent post-training pipeline.

Why do existing benchmarks fall short??

The research team first examines existing benchmarks that claim to test data aware agents. Models maintain high accuracy even when data files are hidden. The average degradation on QRData is 40.5 percent, on DAEvel it is 86.8 percent, and on DiscoveryBench it is 44.4 percent. Many queries can be solved using only priors and pattern matching on text rather than actual data analysis, and they also contain annotation errors and inconsistent numerical tolerances.

Task, Agent and Environment

DSGym standardizes assessment Three objects, action, agent and environment. The tasks are either data analysis or data prediction. Data analysis tasks provide one or more files with natural language questions that must be answered through code. Data forecasting tasks provide train and test splits with an explicit metric and require the agent to create a modeling pipeline and output forecasts.

Each task is packaged into a task object that contains data files, query prompts, scoring functions, and metadata. Agents interact through CodeAct style loops. At each turn, the agent writes a logic block that describes its plan, a code block that runs inside the environment, and a reply block when it is ready to commit. The environment is implemented as a manager and worker cluster of Docker containers, where each worker mounts the data as a read-only volume, exposes a writable workspace, and ships with domain specific Python libraries.

DSGym Tasks, DSBio, and DSPredict

On top of this runtime, the DSGym task collects and refines existing datasets and adds new ones. The research team cleaned up QRData, DAEval, DABStep, MLEBench Lite and others by removing unscorable items and applying a shortcut filter that removes queries that were easily solved by multiple models without data access.

To cover scientific discovery, they introduce DSBO, a suite of 90 bioinformatics functions derived from peer reviewed papers and open source datasets. Functions cover single cell analysis, spatial and multi-omics and human genetics with deterministic numerical or categorical answers supported by expert reference notebooks.

DSPredict targets modeling on real Kaggle competitions. A crawler collects recent contests that accept CSV submissions and meet size and clarity rules. After preprocessing, the suite is divided into DSPedict Easy with 38 playground-style and introductory competitions and DSPedict Hard with 54 high complexity challenges. In total, DSGym tasks include 972 data analysis tasks and 114 forecasting tasks.

What current agents can and cannot do

The evaluation includes closed source models such as GPT-5.1, GPT-5, and GPT-4o, open weighted models such as Qwen3-Coder-480B, Qwen3-235B-Instruct, and GPT-OSS-120B, and small models such as Qwen2.5-7B-Instruct and Qwen3-4B-Instruct. All run with the same CodeAct agent, temperature 0 and tools disabled.

Clearing common analysis benchmarks, such as QRData Verified, DAEval Verified, and DABStep’s Easy Segmentation, the top models reach between 60 percent and 90 percent exact matching accuracy. On DABStep Hard, accuracy decreases for each model, showing that multi-step quantitative reasoning on financial tables is still brittle.

DSBio highlights a more serious weakness. km-k2-instruct achieves the best overall accuracy of 43.33 percent. For all models, 85 to 96 percent of the failures observed on DSBio are domain grounding errors, including misuse of particular libraries and incorrect biological interpretations, rather than basic coding mistakes.

On MLEBench Lite and DSPredict Easy, most Frontier models achieve perfect valid submission rates above 80 percent. On DSPredict Hard, valid submissions rarely exceed 70 percent and medal rates on the Kaggle leaderboard are close to 0 percent. This pattern supports the research team’s observation of simplicity bias where agents stop after a baseline solution rather than exploring more competing models and hyperparameters.

DSGym as a data factory and training site

The same environment can also synthesize training data. Starting with a subset of QRData and DABStep, the research team asks agents to explore the dataset, propose questions, solve them with code, and record trajectories, yielding 3,700 synthetic questions. A judge model filters these on a set of 2,000 high quality query plus trajectory pairs called DSGym-SFT, and fine-tuning the 4B Qwen3 based model on DSGym-SFT produces an agent that reaches competitive performance with GPT-4o on standardized analysis benchmarks, despite having much fewer parameters.

key takeaways

DSGym provides an integrated task, agent and environment framework with containerized execution and a CodeAct style loop to evaluate data science agents on real code based workflows instead of static signals.
The benchmark suite, DSGym-Task, consolidates and cleans prior datasets and combines DSBio and DSPredict, reaching 972 data analysis tasks and 114 prediction tasks in domains such as finance, bioinformatics, and earth sciences.
Shortcut analysis on existing benchmarks shows that removing data accesses results in a modest reduction in accuracy in many cases, confirming that prior evaluations often measure pattern matching on text rather than actual data analysis.
Frontier models achieve strong performance on cleaned general analysis tasks and easy prediction tasks, but they perform poorly on DSBio and DSPredict-Hard, where most errors come from domain grounding issues and conservatism under tuned modeling pipelines.
The DSGym-SFT dataset, constructed from 2,000 filtered synthetic trajectories, enables the 4B Qwen3 based agent to reach GPT-4o level accuracy on multiple analysis benchmarks, demonstrating that performance-based supervision on structured tasks is an effective way to improve data science agents.

check it out paper, And repo. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

DSGym provides reusable container-based substrate for building and benchmarking data science agents

Why do existing benchmarks fall short??

Task, Agent and Environment

DSGym Tasks, DSBio, and DSPredict

What current agents can and cannot do

DSGym as a data factory and training site

key takeaways

Atlas: practical scaling laws for multilingual models

How Tree-KG enables hierarchical knowledge graphs for contextual navigation and interpretable multi-hop logic beyond traditional RAGs

Related Articles

Leave a Comment Cancel Reply