Meta and Harvard researchers introduce the Confucian Code Agent (CCA): a software engineering agent that can operate on massive codebases

How far can a medium-sized language model go if the real innovation moves from the backbone to the agent scaffolding and tool stack? Meta and Harvard researchers have released Confucius Code Agent, an open source AI software engineer built on the Confucius SDK, designed for industrial-scale software repositories and long-running sessions. The system targets real GitHub projects, complex testing toolchains at the time of evaluation, and reproducible results on benchmarks such as SWE Bench Pro and SWE Bench Verified, while exposing full scaffolding to developers.

Confucius SDK, scaffolding around models

Confucius SDK is an agent development platform that treats scaffolding as the primary design problem, rather than a thin wrapper around the language model. It is organized around 3 axes, Agent Experience, User Experience and Developer Experience,

Agent Experience It controls what the model sees, including the context layout, working memory, and tool results. user experience Focuses on readable traces, code differences, and security measures for human engineers. developer experience Focuses only on overview, configuration, and debugging of the agent.

SDK 3 introduces the core mechanisms, an integrated orchestrator with hierarchical working memory, a continuous note taking system, and a modular extension interface to tools. A meta agent then automates the synthesis and refinement of agent configurations through a build, test, improve loop. The Confucius Code Agent is a concrete example of this scaffolding for software engineering.

Hierarchical working memory for long horizon coding

Real software tasks on SWE Bench Pro often require logic over dozens of files and multiple interaction steps. The orchestrator in the Confucius SDK maintains a hierarchical working memory, which divides a trajectory into scopes, summarizing previous steps and keeping compressed context for subsequent turns.

This helps keep signals within model context boundaries while preserving important artifacts such as design patches, error logs, and design decisions. The key point is that effective device-based coding agents require a clear memory architecture, not just a sliding window of previous messages.

Continuous note taking for cross session learning

The second mechanism is a note taking system that uses a dedicated agent to write structured Markdown notes from the execution trace. These notes capture task specific strategies, repository conventions, and common failure modes, and they are stored as long-term memory that can be reused across sessions.

The research team ran the Confucius Code Agent twice on 151 SWE Bench Pro instances with Cloud 4.5 Sonnet. In the first instance the agent solves the tasks from the beginning and prepares notes. The agent reads these notes a second time. At this setting, average turns drops from 64 to 61, token usage drops from about 104k to 93k, and Resolve@1 improves from 53.0 to 54.4. This shows that notes are not just logs, they act as effective cross session memory.

Modular extensions and tool usage sophistication

Confucius SDK features tools as extensions, for example file editing, command execution, test runner, and code search. Each extension can maintain its own position and prompt wiring.

The research team studies the effect of tool usage sophistication using ablation on a 100 example subset of SWE Bench Pro. With Cloud 4 Sonnet, moving from a configuration without advanced context features to a configuration with advanced context increases Resolve@1 from 42.0 to 48.6. With Cloud 4.5 Sonnet, a simple tool use configuration reaches 44.0, while rich tool handling reaches 51.6, with 51.0 for an intermediate version. These numbers indicate that how the agent selects and indexes the tools matters almost as much as the choice of backbone model.

Meta Agent for Automated Agent Design

On top of these mechanisms, the Confucius SDK includes a meta agent that takes the natural language specification of an agent and iteratively proposes configuration, prompts, and extension sets. It then runs the candidate agent on tasks, inspects traces and metrics, and edits the configuration in a build, test, improve loop.

The Confucius Code agent that the research team evaluates is created with the help of this meta agent instead of simply creating it by hand. This approach turns some of the agent engineering process itself into an LLM directed optimization problem.

Results verified on SWE Bench Pro and SWE Bench

The main assessment uses SWE Bench Pro, which contains 731 GitHub issues that require modifying the real repository until it passes the test. All compared systems share the same repositories, tool environments, and evaluation harnesses, so the differences come from the scaffolding and models.

On SWE Bench Pro, the reported Resolve@1 scores are

Cloud 4 Sonnet with SWE Agent, 42.7
Claude 4 Sonnet with Confucian Code Agent, 45.5
Cloud 4.5 Sonnet with SWE Agent, 43.6
Cloud 4.5 Sonnet with Live SWE Agent, 45.8
Cloud 4.5 Sonnet with Confucian Code Agent, 52.7
Claude 4.5 Opus with Anthropic Systems Card Scaffold, 52.0
Claude with the Confucian Code Agent 4.5 Opus, 54.3

These results show that a strong scaffolding with a mid-tier model, Cloud 4.5 Sonnet with Confucius Code Agent at 52.7, can outperform a strong model with weak scaffolding, Cloud 4.5 Opus with 52.0.

On SWE Bench Verified, the Confucius Code Agent Resolve@1 with Cloud 4 Sonnet reaches 74.6, compared to 66.6 for the SWE Agent and 72.8 for OpenHands. A mini SWE agent with Cloud 4.5 sonnet reaches version 70.6, which is even lower than the Confucius Code agent with Cloud 4 sonnet.

The research team also reports performance based on edited file counts. For tasks of editing 1 to 2 files, the Confucius Code Agent reaches 57.8 Resolve@1, for 3 to 4 files it reaches 49.2, for 5 to 6 files it reaches 44.1, for 7 to 10 files it reaches 52.6, and for more than 10 files it reaches 44.4. This indicates stable behavior on multi file changes in large codebases.

key takeaways

Scaffolding may be heavier than the size of the model:Confucius Code Agent shows that with strong scaffolding, Cloud 4.5 Sonnet reaches 52.7 resolution@1 on SWE-Bench-Pro, surpassing Cloud 4.5 Opus with weak scaffolding at 52.0.
Hierarchical working memory is essential for long horizon coding.:The Confucius SDK Orchestrator uses hierarchical working memory and context compression to manage long trajectories on large repositories, rather than relying on a simple rolling history.
Persistent notes act as effective cross session memory: On 151 SWE-Bench-Pro tasks with Cloud 4.5 Sonnet, Structured Notes reuse drops from 64 to 61, Token usage drops from about 104k to 93k, and Resolve@1 increases from 53.0 to 54.4.
Tool configuration materially affects success rates: On the 100 task SWE-Bench-Pro subset, moving from simple to rich tool handling with Cloud 4.5 Sonnet increases Resolve@1 from 44.0 to 51.6, indicating that the learned tool routing and retrieval strategies are a key performance lever, not just an implementation detail.
Meta Agent automates agent design and tuning: A meta agent iteratively proposes prompts, tool sets, and configurations, then evaluates and edits them in a build, test, improve loop, and the production Confucian code agent is generated with this process instead of just manual tuning.

check it out paper hereAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Check out our latest releases ai2025.devA 2025-focused analytics platform that models launches, benchmarks and transforms ecosystem activity into a structured dataset that you can filter, compare and export.

Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

Meta and Harvard researchers introduce the Confucian Code Agent (CCA): a software engineering agent that can operate on massive codebases

Confucius SDK, scaffolding around models

Hierarchical working memory for long horizon coding

Continuous note taking for cross session learning

Modular extensions and tool usage sophistication

Meta Agent for Automated Agent Design

Results verified on SWE Bench Pro and SWE Bench

key takeaways

The best skywatching and stargazing events of 2026

Wave Browser helps clear your to-do list and eliminate toxic marine waste

Related Articles

Leave a Comment Cancel Reply