StepFun introduces Step-DeepResearch, a 32B parameter end-to-end deep research agent that aims to transform web searches into real research workflows with long-term reasoning, tool usage, and structured reporting. The model is built on the Qwen2.5 32B-base and is trained to act as a single agent that plans, searches for sources, verifies evidence and writes reports with citations while keeping estimation costs low.
From discovery to deep research
Most existing web agents are ready for multi-hop question-answering benchmarks. They try to match ground truth answers to short queries. This is closer to targeted retrieval than actual research. Intensive research work is different. They include latent intent identification, long-term decision making, use of multi-turn tools, structured-reasoning, and cross-source validation under uncertainty.
Step-DeepResearch redefines this as making sequential decisions on a compact set of nuclear capabilities. The research team defines 4 nuclear capabilities, planning and task decomposition, in-depth information discovery, reflection and verification, and professional report generation. Instead of orchestrating multiple external agents, the system internalizes this loop into a single model that decides the next action at each step.
Data synthesis around nuclear capabilities
To teach these atomistic abilities, the research team created separate data pipelines for each skill. To plan, they start with high-quality technical reports, survey papers, and financial analysis documents. They reverse-engineer realistic research plans and task trees from title, abstract, and structure, then generate trajectories that follow these plans. This exposes the model not only to short query templates, but also to longer horizon project structures.
To obtain deeper information, they make graph based queries on knowledge graphs such as Wikidata5m and CN-DBpedia. They sample subgraphs, expand them using search, and synthesize queries that require multi-hop logic across entities and documents. A separate pipeline uses a wiki-style hyperlink index to force cross document retrieval and collation of evidence. Easy queries that a strong model can already solve with a simple React style strategy are filtered out, so training focuses on the harder search problems.
Reflection and validation data is generated through self-correction loops and multi-agent teacher traces. Teacher agents draw claims, plan an investigation, verify facts, re-plan when discrepancies emerge and then write a report. The resulting trajectories are cleaned and used as supervision for a single student agent. Report generation is trained in 2 stages, middle training for domain style and depth using query report pairs, then supervised fine-tuning with strict formatting and plan consistency constraints.
Progressive training on Qwen2.5-32B-base
The training pipeline has 3 stages, agentic mid-training, supervised fine-tuning, and reinforcement learning. In the middle training phase-1, the team injects atomic capabilities without tools, using a reference length of up to 32k tokens. Data include active reading, synthetic reasoning traces, summaries, and reflections. The research team has shown consistent gains over SimpleQA, TriviaQA, and Frames as training scales up to approximately 150B tokens, with the largest gains on Frames, which emphasizes structured reasoning.
In Phase-2, the context is expanded to 128k tokens and explicit tool calls are introduced. The model learns tasks such as URL based question-answering, deep web searches, long document summaries and long dialog logic. This phase aligns the model with real research scenarios where searching, browsing, and analysis must be mixed in one trajectory.
During supervised fine-tuning, 4 nuclear capabilities are combined into complete deep exploration and deep research niches. Data cleaning maintains trajectories that are correct and short in terms of steps and tool calls. The pipeline injects controlled tool errors which are then corrected to improve robustness, and enforces citation formats so that reports persist across retrieved sources.
Reinforcement learning then adapts the agent to the real device environment. The research team creates tasks and checklists through reverse synthesis, and trains a checklist style rubrics judge to score reports along granular dimensions. The reward design converts ternary rubric labels into asymmetric binary rewards that capture both positive goals and violations. The policy is trained with PPOs and a learned critic, using generalized benefit estimates with near-zero discounting so as not to shorten the long trajectory.
Single Agent React Architecture and Search Stack
At inference time, Step-DeepResearch runs as a single React style agent that alternates thinking, tool calls, and observations until it decides to output a report. The tool set includes batch web search, a task manager, shell commands, and file operations. The execution runs in the sandbox with terminal persistence via tmux. A perception-oriented browser reduces unnecessary page capture by using perceptual hash distance. Tools support multimodal input for document parsing, audio transcription, and image analysis.
Information acquisition uses 2 related resources. The StepFun team says its search API is based on over 20M high-quality papers and 600 premium indices. The research team then describes a curated authority indexing strategy that isolates over 600 trusted domains, including government, academic, and institutional sites. Retrieval operates at the article level and uses authority aware ranking so that higher trust domains are prioritized when relevancy is equal.
File tools support patch-based editing, so the agent can update only the modified sections of a report. A summary aware storage scheme writes full tool output to local files and injects only compact summaries into the context. It acts as external memory and prevents context overflow for long projects.
Assessment, cost and access
To measure deep research behavior, the team introduced ADR-Bench, a Chinese benchmark with 110 open-ended tasks in 9 domains. The 70 works cover general domains such as education, science and engineering and social life, which are evaluated by experts through side-by-side comparisons. The 40 tasks in finance and law are scored with explicit rubrics that adhere to atomicity and verifiability constraints.
at scale ai Research RubricsStep-Deep Research reaches 61.42 percent rubric compliance, which is comparable to OpenAI-Deep Research and Gemini-Deep Research, and clearly ahead of many open and proprietary baselines. On ADR-Bench, expert-based Elo ratings show that the 32B model outperforms large open-models such as Minimax-M2, GLM-4.6 and DeepSeq-v3.2, and is competitive with systems such as km-researcher and minimax-agent-pro.
key takeaways
- Single Agent, Nuclear Capability Design:Step-DeepResearch is a 32B parameter single agent built on Qwen2.-32B-base, it internalizes 4 atomic capabilities, planning, deep information exploration, reflection and verification, and professional report generation, instead of relying on multiple external agents.
- Targeted data synthesis for each skill: The research team creates different data pipelines for planning, gaining deeper insights, reflection and writing reports using reverse-engineered plans from real reports, graph-based queries on Wikidata5M and CN-DBpedia, multi-agent teacher traces and strict report formatting data.
- Three-tier training with long context and RL: Training uses median training, supervised fine-tuning and reinforcement learning, with median training up to 150B tokens in 32k and then 128k context, SFT creates full deep research trajectories, and optimizes reports against PPO based RL fine grained checklist with Rubrics Judge.
- React architecture with curated search and external memory: At inference time the model runs a React loop that calls tools for batch web search, todo, shell, and file operations, uses a search API based in over 20M papers and 600 premium indices with 600+ trusted domains, and relies on patch editing and summary aware storage to act as external memory.
- Competitive quality with low cost: On the Scale AI Research rubrics the model reaches 61.42 percent rubric compliance and is competitive with OpenAI-DeepResearch and Gemini-DeepResearch; on the ADR bench it achieves a 67.1 percent win or tie rate against the strong baseline.
check it out paper And repo. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
