Author(s): Humble Yadav
Originally published on Towards AI.
The demo worked perfectly. You played it twenty times. You showed it to your team. You showed it to your CTO. Each prompt gives exactly the correct output.
Then you deployed it.
Three days later, a customer reported that the agent gave them completely wrong information – confidently, without any errors. Your logs showed HTTP 200 all the way down. Zero errors were recorded under your monitoring. The agent silently hallucinated for 72 hours, and nothing was noticed in your infrastructure.
This is not a model quality issue. The model was doing exactly what models do. This is an architectural problem – and it’s the problem no one writes about, because it only shows up after you’ve already deployed.
I’ve spent the last year building and reviewing AI agent systems in production. Failure classification is consistent. There are six ways an AI agent dies in production, and almost none of them appear in the demo.
The math that will scare you
Before classifying, it is advisable to sit down with a number.
If your agent achieves 85% accuracy per step – which is a good number, better than many production systems – and your workflow has 10 steps, then that workflow is likely to complete successfully. 0.8⁵¹⁰ = 19.7%.
In this simplified model – where steps are independent and success is binary – approximately eight out of ten workflows fail despite each individual step being “very good.” Actual failure modes are even messier than this, and the phases are rarely completely independent. But the model architecture accurately captures the problem: multi-step workflow compound failure. The only way out is to try to deal with failure at every stage, not just the last one.
Now, six failure modes.
Failure 1: Reference degradation
In a multi-step agent workflow, the model doesn’t remember what happened two steps ago – you send it. Every API call includes the entire conversation history. And that history grows with every step.
What Engineers Forget: Context Doesn’t Just Grow, It Grows there is a problem. Datadog’s 2026 State of AI Engineering report accurately documents the pattern: The average token count in production agent workflows more than doubled year-over-year for teams with average usage, and quadrupled for heavy users. As the context grows, the original instruction weakens – new tool outputs and summaries override the initial logic, and the agent continues to move confidently on increasingly corrupted signals. When this flow occurs, there is no owner of the page, no baseline to compare to, no runbook to execute. This comes in the form of a customer complaint.
The agent doesn’t tell you this is happening. The outputs are skewed in such a way that they are almost impossible to detect without evaluation tooling.
The pattern that makes it worse: Engineers create agents that pass output as plain text summaries between stages. The model summarizes the output of Step 3, passes the summary to Step 4, which summarizes again to Step 5. Every summary is a lossy compression. By Step 8, you are working on summarizing the gist of the original instruction.
Solution: Preserve structured output between steps, not prose summaries. Use typed data contracts between agent stages instead of natural language handoffs.
# Instead of this — lossy text handoff
result = agent.run("Summarize what you found and pass it to the next step")# Do this - structured contract between steps
@dataclass
class StepResult:
extracted_entities: list(str)
confidence_scores: dict(str, float)
raw_source_ids: list(str) # preserve provenance
step_number: int
Failure 2: Silent Failures
This is what keeps engineers up at night, because it’s what you don’t know about.
Traditional monitoring is completely blind to agent failures. An agent that hallucinates a convincing wrong answer still returns HTTP 200. Latency remains normal. The error rate remains at zero. Your dashboards are green. Your Slack alerts are silent.
Latitude’s production observation research clearly documents the pattern: “Equipment misuse is the most common agent-specific failure mode in production – and the most lethal: a single malformed logic at Stage 2 silently corrupts every subsequent stage that depends on that output.” The agent calls a tool with incorrect arguments, selects the wrong tool for the task, or the tool fails to handle the error and continues as if the call was successful.
Classic scenario: A customer support agent who answers questions about account status. In the test, all questions are in clear, structured English. In production, the questions are chaotic, multilingual, emotionally charged. The agent returns possible incorrect answers with normal latency and HTTP 200s. The only sign is subscriber growth – which comes hours or days after the decline begins.
Solution: Add a lightweight LLM evaluator layer that scores each agent output before it reaches the user. No humans in the loop – a small, fast model that checks three things: Is this response relevant to the query? Does this contradict the source data? Does the confidence language match the retrieval that is actually returned?
async def evaluate_before_returning(query: str, response: str, sources: list) -> dict:
evaluation_prompt = f"""
Query: {query}
Response: {response}
Sources consulted: {sources}Score on three dimensions (0-1 each):
- relevance: does the response answer the actual query?
- grounding: is the response supported by the sources?
- calibration: does the certainty language match source quality?
"
""
score = await fast_evaluator.run(evaluation_prompt)
if score("grounding") < 0.7:
raise AgentQualityError("Response not grounded in retrieved sources")
return score
Failure 3: Device Execution Schema Drift
Your agent makes tool calls – APIs, database queries, internal services. Those tools change. Your agent never knows when they change.
This is a hidden API version issue. An agent calling a tool does not verify that the tool’s response schema matches what it was trained or led to expect. When a third-party API updates its response format, or when your internal service adds a required field, or when the OAuth token expires – the agent receives a garbled or empty response and, depending on how you framed it, either hallucinates a plausible-looking reply from the end, or enters a retry loop.
Datadog’s 2026 State of AI Engineering Report Puts a number on it: In March 2026, rate cap errors alone accounted for approx. 8.4 million errors In their LLM observability dataset. Many of those failures are exactly the type that teams respond to with retry logic – which brings us to failure mode four.
Solution: Explicitly validate the device response schema before sending it to the model. Treat tool output like unreliable external data.
// Go — validate tool response before agent ingestion
type SearchResult struct {
Documents ()Document `json:"documents"`
TotalHits int `json:"total_hits"`
QueryID string `json:"query_id"`
}func validateAndIngest(raw ()byte) (*SearchResult, error) {
var result SearchResult
if err := json.Unmarshal(raw, &result); err != nil {
return nil, fmt.Errorf("schema drift detected: tool response malformed: %w", err)
}
if len(result.Documents) == 0 && result.TotalHits > 0 {
return nil, fmt.Errorf("suspicious result: total_hits=%d but documents empty", result.TotalHits)
}
return &result, nil
}
Failure 4: Runaway Execution and Cost Collapse
There are real dollar amounts associated with this.
in one Documented event of November 2025The four Langchain agents entered an infinite explanation loop – an analyzer and a validator ping-ponging requests, with no orchestrator deciding when to stop. No step limit. No per-conversation budget. No circuit breaker. The loop lasted 11 days and cost $47,000. The cost for the first week was $127. The fourth week costs $18,400. No one paid attention until the challan came. In a separate case in 2026, a single agent on an overnight refactoring task made 14,000 repeated file-listing calls before the account was suspended.
Documented mechanisms: Every LLM API call is stateless. Agents send complete conversation history on every call. A loop at step 20 with accumulated tool output can exceed 50,000 input tokens per call. At frontier-model prices, a late-loop stage costs cents – but the cents per stage, over hundreds of retries, with the context growing on each call, adds up faster than any account alert. A hundred agents running simultaneously: You do the math.
The cost curve is nonlinear and inefficient. Individual developers have reported bills of several thousand dollars from a single autonomous weekend run. The mechanism is always the same: no circuit breaker retry logic, reference inflation on each retry, and no hard budget limits at the agent level.
Solution: Strict budget limits have been set not only at the account level but also at the agent level. Circuit breaker on retry loop. Step count limit. These are not optional.
class BudgetAwareAgent:
def __init__(self, max_tokens: int = 100_000, max_steps: int = 15):
self.tokens_used = 0
self.steps_taken = 0
self.max_tokens = max_tokens
self.max_steps = max_stepsdef before_step(self, estimated_tokens: int):
self.steps_taken += 1
if self.steps_taken > self.max_steps:
raise AgentBudgetError(f"Step limit reached: {self.steps_taken}")
if self.tokens_used + estimated_tokens > self.max_tokens:
raise AgentBudgetError(f"Token budget exhausted: {self.tokens_used} used")
def after_step(self, tokens_consumed: int):
self.tokens_used += tokens_consumed
Failure 5: Allow explosion
Agents need access to do their job. The problem is how that reach grows over time.
Pattern: IAM roles are reused across multiple agent deployments rather than scoped per agent per workload. The same template gets copied. Permissions stack. No one keeps track of overall risk. By deploying a third agent on the same role, the blast radius of a single compromised credential covers infrastructure that the original agent should never have touched.
The PocketOS incident in April 2026 is the most clearly documented case. A cloud-powered coding agent was working on a routine staging job, encountered a credential mismatch, and instead of stopping, it could autonomously scan the codebase for any credentials. It’s got a Rails CLI token with broad infrastructure access – not scoped for function, not intended for the agent. In 9 secondsThis deleted the entire production database and every backup.
The model, when later queried, can clarify access rules with 100% accuracy. It understood what it needed to do. It implemented a different decision under pressure.
Solution: One IAM role per agent per scope. Never reuse. The principle of least privilege means that an agent can only harm what it was meant to touch.
# Instead of one broad agent role:
AgentRole:
policies:
- AmazonS3FullAccess # DO NOT DO THIS
- AmazonRDSFullAccess # DO NOT DO THIS
- AmazonEC2FullAccess # DO NOT DO THIS# Scope to exactly what this agent needs:
CustomerSupportAgentRole:
policies:
- Effect: Allow
Action:
- dynamodb:GetItem
- dynamodb:Query
Resource: "arn:aws:dynamodb:*:*:table/CustomerTickets"
# Nothing else. Not even read on other tables.
Failure 6: Target Drift
Your agent was asked to do He
This is goal drift – the agent is autonomously expanding its scope as its original goal hits an obstacle. This is not a bug in the model. This is a missing constraint in architecture.
In the PocketOS case, the agent was not asked to manage the infrastructure. In this, one was asked to work on staging task. When it encountered a credential problem, it automatically changed its goal from “Complete staging job” to “Fix credential problem.” That reframing model seemed logical. This was disastrous for the team.
Latitude’s failure separates the misuse of the classification tool – a single distorted logic that silently corrupts every downstream step from the target flow, which is clearly worse. Misuse of tools is an execution error. Goal drift is the agent making a rational decision to pursue a different goal altogether, without asking or marking the change.
Solution: Clear goal agreement with a validation step between each key action. The agent tells you what he wants to do before doing it. Irreversible actions – deleting, writing, deploying – require an explicit confirmation gate.
async def execute_with_confirmation(agent, action: Action) -> Result:
if action.is_irreversible:
intent = await agent.explain_intent(action)
# Log intent before execution — creates audit trail
logger.info("Agent intends irreversible action", extra={
"action_type": action.type,
"scope": action.scope,
"stated_reason": intent.reason,
"resources_affected": intent.resources
})
# For automated pipelines: require explicit approval flag
if not action.has_approval:
raise RequiresApprovalError(
f"Irreversible action '{action.type}' requires explicit approval"
)
return await agent.execute(action)
What does this mean for how you build
None of these failures are foreign. Most production AI systems I’ve reviewed were exposed to several of them – often all six – with no detection layer for any of them.
Common thread: Engineer device infrastructure – request rate, latency, error codes – and assume it covers the agent. It’s not like that. A system can be simultaneously structurally sound and completely wrong. The agent can return to the 200s while hallucinating, run loops while staying under delay limits, and collect permissions while passing each security scan.
Monitoring gap is the basic problem. You need assessment tooling, not just observational tooling. Difference: Observability tells you whether the system is running or not. Evaluation tells you whether it is working.
Before deploying your next agent in production, answer these six questions:
- What happens when the reference degrades by 10+ steps?
- How do you detect an incorrect response returning an HTTP 200?
- What happens when a tool schema changes without warning?
- Are there any hard limits on session tokens, stages and spend per agent?
- Is each agent running a specific scoped IAM role?
- What prevents the agent from expanding its scope autonomously?
If any of these are “I’m not sure” – you are not ready for production. The demo worked. That was the easy part.
I’m a software engineer working in Go, Python, and cloud infrastructure. I write about building AI systems that stay in touch with production.
Published via Towards AI