Meet 'AutoAgent': the open-source library that lets an AI engineer customize his or her own agent harness overnight

There is one particular type of difficulty that every AI engineer knows well: the prompt-tuning loop. You write a system prompt, run your agent against the benchmark, read the failure traces, make changes to the prompt, add a tool, run again. Repeat this a few dozen times and you can move the needle. This is grunt work done in Python files. Now, a new open-source library called autoagentCreated by Kevin Gu thirdlayer.incOffers a troubling alternative – don’t do the job yourself. Let an AI do it.

AutoAgent is an open source library for autonomously improving an agent on any domain. In the 24-hour run, it reached #1 on SpreadsheetBench with a score of 96.5%, and achieved the #1 GPT-5 score on TerminalBench with 55.1%.

https://x.com/kevingu/status/2039843234760073341

What exactly is AutoAgent?

AutoAgent has been described as ‘like AutoResearch but for agent engineering’. The idea: Give an AI agent a task, let it work on one agent autonomously throughout the night. It modifies the system prompts, tools, agent configuration and orchestration, runs benchmarks, checks the scores, commits or discards the changes, and repeats.

To understand the analogy: Andrzej Karpathy’s autoresearch The same works for ML training – it loops through proposal-train-evaluate cycles, keeping only changes that improve validation loss. AutoAgent ports the same ratchet loop from ML training to agent engineering. Instead of optimizing a model’s weights or training hyperparameters, it optimizes harness – System prompts, tool definitions, routing logic, and orchestration strategies that determine how an agent behaves on a task.

A harnessIn this context, there is scaffolding around the LLM: what system prompts it receives, what tools it can call, how it routes between sub-agents, and how tasks are formatted as input. Most agent engineers make this scaffolding by hand. AutoAgent automates iterations on that scaffold itself.

Architecture: two agents, one file, one instruction

GitHub repo The structure is intentionally simple. agent.py A single file contains the entire harness under test – including configuration, tool definitions, agent registry, routing/orchestration, and harbor adapter limitation. The adapter section is clearly marked as fixed; The rest is the primary editing surface for the meta-agent. program.md It contains instructions for the meta-agent as well as instructions (what type of agent to create), and is the only file that a human edits.

Think of it as a separation of concerns between human and machine. human sets Direction Inside program.md. meta-agent (a separate, higher-level AI) then reads that instruction, inspects agent.pyRuns benchmarks, diagnoses what failed, rewrites relevant parts agent.pyAnd repeats. humans never touch agent.py directly.

An important part of the infrastructure that keeps loops consistent across iterations results.tsv – An experiment log is automatically created and maintained by the Meta-Agent. It tracks each experiment, giving the meta-agent a history to learn from and assess what to try next. Includes full project structure Dockerfile.basean alternative .agent/ Directory for reusable agent workspace artifacts such as hints and skills, a tasks/ folder for the benchmark payload (added per benchmark branch), and a jobs/ Directory for harbor job output.

The metric is the aggregate score produced by the benchmark’s working test suites. The meta-agent excels on this score. Each experiment generates a numerical score: keep if better, discard if not – a loop similar to AutoResearch.

Working Format and Harbor Integration

Benchmarks are expressed as functions in the Harbor format. resides under each task tasks/my-task/ And it includes a task.toml For configuration such as timeouts and metadata, a instruction.md which is a signal sent to the agent, a tests/ directory with a test.sh entry point that writes a digit /logs/reward.txtand a test.py Use deterministic checking or LLM-A-Judge for verification. One environment/Dockerfile The function defines the container, and a files/ The directory holds the reference files installed in the container. The test verifiers write a score between 0.0 and 1.0 in the log. The meta-agent takes over.

As LLM-Judge The pattern here is worth noting: instead of just checking the answers deterministically (like unit tests), the test suite can use another LLM to evaluate whether the agent’s output is ‘good enough’. This is common in agentive benchmarks where the correct answers cannot be deduced from string matching.

key takeaways

Autonomous Harness Engineering Work – AutoAgent proves that a meta-agent can completely replace the human prompt-tuning loop agent.py Overnight without any human touching the harness files directly.
Benchmark results validate the approach – In a 24-hour race, AutoAgent achieved #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%), beating every other entry hand-engineered by humans.
‘Model empathy’ may be a real phenomenon – A cloud meta-agent customizing a cloud task agent appears to diagnose failures more accurately than customizing a GPT-based agent, suggesting that same-family model pairing may matter when designing your autoagent loop.
Man’s job changes from engineer to director — you don’t write or edit agent.py. you write program.md – A simple Markdown directive that drives the meta-agent. This difference reflects a broad variation in agentic engineering, from writing code to setting goals.
It’s plug-and-play with any benchmark – Because tasks follow Harbor’s open format and agents run in Docker containers, AutoAgent is domain-agnostic. Any scorable task – spreadsheet, terminal command, or your own custom domain – can become a target for autonomous self-optimization.

check it out repo And do. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

Meet ‘AutoAgent’: the open-source library that lets an AI engineer customize his or her own agent harness overnight

What exactly is AutoAgent?

Architecture: two agents, one file, one instruction

Working Format and Harbor Integration

key takeaways

how we stopped forgiving

Listen is a music copyright nightmare that AI is capable of taking out cover slop

Related Articles

Leave a Comment Cancel Reply