How to create specialized AI without training agent skills - O'Reilly

Our previous article framed the Model Context Protocol (MCP) as a toolbox that provides AI agent tools and agent skills In the form of content that teaches AI agents to complete tasks. This is different from pre- or post-training, which determines the general behavior and expertise of a model. Agent skills do not “train” agents. They soft-fork agent behavior at runtime, telling the model how to perform specific tasks it may need.

Word soft fork Comes from open source development. A soft fork is a backward-compatible change that does not require upgrading every layer of the stack. Applied to AI, this means that skills modify agent behavior through context injection at runtime rather than changing model weights or refactoring the AI system. The underlying models and AI systems remain unchanged.

The architecture clearly informs how we think about traditional computing. Models are CPUs – they provide the raw intelligence and computation capability. Agents like Anthropic’s Cloud Code harness operating systems – they manage resources, handle permissions and coordinate processes. Skills are applications – they run on top of the OS, specializing the system for specific tasks without modifying the underlying hardware or kernel.

You do not recompile the Linux kernel to run a new application. You don’t redesign the CPU to use a different text editor. You install a new application on top, using the intelligence of the CPU exposed and organized by the OS. Agent skills work the same way. They layer expertise on top of the agent harness, using the capabilities provided by the model, without updating the model or changing the harness.

This distinction matters because it changes the economics of AI expertise. Fine-tuning demands significant investment in talent, calculations, data, and ongoing maintenance every time the base model is updated. The skill requires only Markdown files and resource bundles.

How do soft forks work?

Skills achieve this through three mechanisms—skill package formats, progressive disclosure, and execution context modification.

skills package There is a folder. At a minimum, it includes a SKILL.md file with Frontmatter metadata and instructions. Frontmatter announces skills name, description, allowed-toolsAnd versionsThen there’s the real expertise: context, problem-solving approaches, growth criteria, and patterns to follow.

Figure 2. Frontmatter for Anthropic skill-creator package. Frontmatter lives on top of Markdown files. Agents choose skills based on their description.

Folders may also contain reference documents, templates, resources, configurations, and executable scripts. It contains everything an agent needs to perform expert-level work for a specific task, packaged as a versioned artifact that you can review, approve, and deploy. .zip file or .skill file bundle.

personal skill item — Figure 3. A skill object for anthropic‘S `skill-creator`. skill-builders included `SKILL.md`, `LICENSE.txt`Python scripts and reference files.

Because the skills package format is just folders and files, you can use all the tooling we’ve built to manage code – tracking changes in Git, rolling back bugs, maintaining audit trails, and using all the best practices of the software engineering development lifecycle. The same format is also used to define sub-agents and agent teams, meaning that a single packaging abstraction controls individual expertise, delegated workflows, and multi-agent coordination alike.

progressive disclosure Keeps skills light. it’s just a matter of fact SKILL.md The session is initially loaded into the agent context. This respects the symbolic semantics of the limited context window. metadata included name, description, model, license, versionand very important thing allowed-tools. Full skill content is loaded only when the agent determines relevance and decides to implement it. This is similar to the way operating systems manage memory; Applications are loaded into RAM when launched, not all at once. You can have dozens of skills available without overwhelming the model’s context window, and behavioral modification is only present when needed, never permanently.

agent skill execution flow — Figure 4. Agent skill execution flow. At session start, only Frontmatter is loaded. Once the agent selects a skill, it reads the entire SKILL.md and executes with the permission of the skill.

execution reference modification Controls what the skill can do. When agents invoke a skill, the permission system changes the scope of the skill’s definition, specifically, model And allowed-tools Announced in its frontmatter. It returns after execution is complete. A skill may use a different model and a different set of tools than the original session. It has sandboxed the permissions environment, so skills only get scoped access, not arbitrary system control. This ensures that behavioral modification operates within limits.

This is what differentiates the skill from earlier approaches. OpenAI’s Custom GPT and Google’s Gemini Gems are useful but opaque, non-transferable, and impossible to audit. Skills are readable because they are markdown. They are auditable because you can apply version control. They are composable because skills can stack. And they’re governable because you can build approval workflows and rollback capability. You can read SKILL.md to understand why an agent behaves a certain way.

what the data shows

Building skills is easy with coding agents. The hard part is knowing whether they work or not. Traditional software testing does not apply. You can’t write a unit test saying that the expert behavior occurred. The output may be correct while the logic was shallow, or the logic may be sophisticated while the output contains formatting errors.

skillsbench There is a benchmarking effort and framework designed to address this. It uses a paired assessment design where similar tasks are assessed with and without skill enhancement. The benchmark consists of 85 tasks, stratified by domain and difficulty levels. By comparing the same agent on the same task with the only variable being the presence of a skill, the SkillsBench model separates the effect due to skill from ability and task difficulty. Performance is measured using normalized profitThe fraction of potential improvement actually achieved by the skill.

SkillsBench’s findings challenge our assumption that skills universally improve performance.

Skills improve average performance by 13.2 percentage points. But 24 out of 85 works got spoiled. Got 32 marks in manufacturing tasks. Software engineering tasks lost 5. The total number hides the variations that domain-level assessment reveals. This is why soft forks require evaluation infrastructure. Unlike hard forks where you fully commit, soft forks let you measure before deploying widely. Organizations should break down evaluation by domain and functions and test not only for improvements, but also for regressions. For example, something that improves document processing may impair code generation.

Compact skills perform about 4 times better than comprehensive skills. An improvement of +18.9 percentage points was seen in focused skills with intensive guidance. Comprehensive skills covering each edge case showed +5.7 points. Using two to three skills per task is optimal, with four or more showing diminishing returns. The temptation when building skills is to include everything. Every warning, every exception, every piece of relevant context. resist it. Let the model’s intelligence do the work. Small, targeted behavior changes perform better than extensive rewrites. Skills builders should start with the minimum viable guidance and add detail only when assessment shows specific gaps.

Models cannot reliably generate effective skills on their own. SkillsBench tested a “bring your own skill” condition, where agents were prompted to generate their own procedural knowledge before attempting tasks. Performance remained at baseline. Effective skills require human-curated domain expertise that models cannot reliably generate on their own. AI can help with packaging and formatting, but insight must come from people who actually have the expertise. Human-labeled insights are the barrier to building effective skills, not packaging or deployment.

Skills can partially replace model scale. Cloud Haiku, a small model, achieved a 25.2% passing rate with well-designed skills. This is slightly higher than Cloud Opus, the flagship model without skills at 23.6%. Packaged expertise compensates model intelligence on procedural tasks. This has cost implications: smaller models with skills can outperform larger models without them at a fraction of the estimated cost. Soft forks democratize capacity. You don’t need the biggest model if you have the right expertise.

open questions

Many challenges remain unresolved. What happens when multiple skills conflict with each other during a session? How should organizations control the skills portfolio when each team deploys its skills on shared agents? How quickly does encoded expertise become obsolete, and what refresh cadence keeps the skill effective without creating a maintenance burden? Skills inherit biases in the expertise of their authors, so how do you audit that? And as the industry matures, how should assessment infrastructure like the Skillbench scale be maintained to keep pace with the increasing complexity of skill enhancement systems?

These are not reasons to avoid skills. They are the reason to invest in skills development as well as assessment infrastructure and governance practices. The ability to measure performance must evolve with technology.

agent skill advantage

Fine-tuning a model for a single use case is no longer the only path to expertise. This demands significant investment in talent, computation, and data and creates a permanent variance that requires reevaluation and potentially retraining every time the base model is updated. Fine-tuning a broad set of capabilities remains good for improving a foundation model, but fine-tuning a narrow workflow is exactly the kind of expertise that skills can now achieve at a fraction of the cost.

Skills are not maintenance free. Just as applications sometimes break when operating systems are updated, skills need to be reevaluated when the underlying agent usage or model changes. But the recovery path is easier: update the skills package, re-run the assessment harness, and deploy again rather than having to retrain from a new checkpoint.

Mainframes gave way to client-server. Monoliths gave way to microservices. Highly sophisticated models are now giving way to agents augmented by specific expertise artefacts. Models provide intelligence, agents provide harness runtime, skills provide expertise, and evaluation tells you whether it all works together.

How to create specialized AI without training agent skills – O’Reilly

How do soft forks work?

what the data shows

open questions

agent skill advantage

How to Save Money on YouTube TV: Consider These 12 Cheap Packages (Including Live Sports)

Top OpenAI executive resigns in protest

Related Articles

Leave a Comment Cancel Reply