How to build a production-ready multi-agent incident response system using OpenAI Swarm and tool-augmented agents

In this tutorial, we use OpenAI Swarm to build an advanced but practical multi-agent system that runs in Colab. We demonstrate how we can organize specialized agents such as triage agents, SRE agents, communication agents, and critics to collaboratively handle a real-world production incident scenario. By structuring agent handoffs, integrating lightweight tools for knowledge retrieval and decision ranking, and keeping the implementation clean and modular, we show how Swarm enables us to design controllable, agentic workflows without bulky frameworks or complex infrastructure. check it out full code here,

!pip -q install -U openai
!pip -q install -U "git+https://github.com/openai/swarm.git"


import os


def load_openai_key():
   try:
       from google.colab import userdata
       key = userdata.get("OPENAI_API_KEY")
   except Exception:
       key = None
   if not key:
       import getpass
       key = getpass.getpass("Enter OPENAI_API_KEY (hidden): ").strip()
   if not key:
       raise RuntimeError("OPENAI_API_KEY not provided")
   return key


os.environ("OPENAI_API_KEY") = load_openai_key()

We set up the environment and safely load the OpenAI API key so that the notebook can run securely in Google Colab. We ensure that the key is obtained from Colab Secrets when available and falls back to a hidden signal otherwise. This keeps authentication simple and reusable across sessions. check it out full code here,

import json
import re
from typing import List, Dict
from swarm import Swarm, Agent


client = Swarm()

We import the core Python utilities and initialize the Swarm client that orchestrates all agent interactions. This snippet establishes the runtime backbone that allows agents to communicate, delegate tasks, and execute tool calls. It serves as the entry point for multi-agent workflows. check it out full code here,

KB_DOCS = (
   {
       "id": "kb-incident-001",
       "title": "API Latency Incident Playbook",
       "text": "If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts."
   },
   {
       "id": "kb-risk-001",
       "title": "Risk Communication Guidelines",
       "text": "Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging."
   },
   {
       "id": "kb-ops-001",
       "title": "On-call Handoff Template",
       "text": "Include summary, timeline, current status, mitigations, open questions, next actions, and owners."
   },
)


def _normalize(s: str) -> List(str):
   return re.sub(r"(^a-z0-9s)", " ", s.lower()).split()


def search_kb(query: str, top_k: int = 3) -> str:
   q = set(_normalize(query))
   scored = ()
   for d in KB_DOCS:
       score = len(q.intersection(set(_normalize(d("title") + " " + d("text")))))
       scored.append((score, d))
   scored.sort(key=lambda x: x(0), reverse=True)
   docs = (d for s, d in scored(:top_k) if s > 0) or (scored(0)(1))
   return json.dumps(docs, indent=2)

We define a lightweight internal knowledge base and implement a retrieval function to surface relevant context during agent reasoning. Using simple token-based matching, we allow agents to base their responses on predefined operational documents. It demonstrates how Swarm can be augmented with domain-specific memory without external dependencies. check it out full code here,

def estimate_mitigation_impact(options_json: str) -> str:
   try:
       options = json.loads(options_json)
   except Exception as e:
       return json.dumps({"error": str(e)})
   ranking = ()
   for o in options:
       conf = float(o.get("confidence", 0.5))
       risk = o.get("risk", "medium")
       penalty = {"low": 0.1, "medium": 0.25, "high": 0.45}.get(risk, 0.25)
       ranking.append({
           "option": o.get("option"),
           "confidence": conf,
           "risk": risk,
           "score": round(conf - penalty, 3)
       })
   ranking.sort(key=lambda x: x("score"), reverse=True)
   return json.dumps(ranking, indent=2)

We present a structured tool that evaluates and ranks mitigation strategies based on confidence and risk. This allows agents to proceed from free-form reasoning and make semi-quantitative decisions. We show how tools can enforce consistency and decision discipline in agent outputs. check it out full code here,

def handoff_to_sre():
   return sre_agent


def handoff_to_comms():
   return comms_agent


def handoff_to_handoff_writer():
   return handoff_writer_agent


def handoff_to_critic():
   return critic_agent

We define explicit handoff functions that enable one agent to transfer control to another. This snippet demonstrates how we model delegation and specialization within the swarm. This makes agent-to-agent routing transparent and easy to extend. check it out full code here,

triage_agent = Agent(
   name="Triage",
   model="gpt-4o-mini",
   instructions="""
Decide which agent should handle the request.
Use SRE for incident response.
Use Comms for customer or executive messaging.
Use HandoffWriter for on-call notes.
Use Critic for review or improvement.
""",
   functions=(search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic)
)


sre_agent = Agent(
   name="SRE",
   model="gpt-4o-mini",
   instructions="""
Produce a structured incident response with triage steps,
ranked mitigations, ranked hypotheses, and a 30-minute plan.
""",
   functions=(search_kb, estimate_mitigation_impact)
)


comms_agent = Agent(
   name="Comms",
   model="gpt-4o-mini",
   instructions="""
Produce an external customer update and an internal technical update.
""",
   functions=(search_kb)
)


handoff_writer_agent = Agent(
   name="HandoffWriter",
   model="gpt-4o-mini",
   instructions="""
Produce a clean on-call handoff document with standard headings.
""",
   functions=(search_kb)
)


critic_agent = Agent(
   name="Critic",
   model="gpt-4o-mini",
   instructions="""
Critique the previous answer, then produce a refined final version and a checklist.
"""
)

We configure several specialized agents, each of which has a clearly limited set of responsibilities and instructions. By separating triage, incident response, communications, handoff writing, and criticism, we demonstrate a clean division of labor. check it out full code here,

def run_pipeline(user_request: str):
   messages = ({"role": "user", "content": user_request})
   r1 = client.run(agent=triage_agent, messages=messages, max_turns=8)
   messages2 = r1.messages + ({"role": "user", "content": "Review and improve the last answer"})
   r2 = client.run(agent=critic_agent, messages=messages2, max_turns=4)
   return r2.messages(-1)("content")


request = """
Production p95 latency jumped from 250ms to 2.5s after a deploy.
Errors slightly increased, DB CPU stable, upstream timeouts rising.
Provide a 30-minute action plan and a customer update.
"""


print(run_pipeline(request))

We assemble a full orchestration pipeline that executes triage, expert reasoning, and critical refinements in sequence. This snippet shows how we run an end-to-end workflow with a single function call. It connects all agents and devices into a coherent, production-style agentic system.

In conclusion, we have established a clear pattern for designing agent-oriented systems with OpenAI Swarm that emphasizes clarity, separation of responsibilities, and iterative refinement. We showed how to intelligently route tasks, enrich agent logic with local tools, and improve output quality through a critic loop while maintaining a simple, collab-friendly setup. This approach allows us to scale from experiments to real operational use cases, making Swarm a powerful foundation for building reliable, production-grade agentic AI workflows.

check it out full code hereAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

How to build a production-ready multi-agent incident response system using OpenAI Swarm and tool-augmented agents

Venezuelan capital is in shock due to Donald Trump’s attacks

T-Mobile is giving away a free Samsung Galaxy S25 – here’s how the deal works

Related Articles

Leave a Comment Cancel Reply