Infrastructure as intention – O’Reilly

by
0 comments
Infrastructure as intention – O'Reilly

There’s an open secret in the world of DevOps: no one trusts a CMDB, The Configuration Management Database (CMDB) is considered the “source of truth” – the central map of every server, service, and application in your enterprise, In theory, it is the foundation of security audits, cost analysis, and incident response, In practice, it is a work of fiction, As soon as you fill up a CMDB, it starts to rot, Engineers deploy a new microservice but forget to register it, An autoscaling group spins up 20 new nodes, but the database only records the original three, , ,

we call it configuration flowAnd for decades, our industry’s solution has been to throw more scripts at the problem. We write huge, brittle ETL (extract-transform-load) pipelines that attempt to scrape the world and put it into a relational database. It never works. “The world” – especially the modern cloud native world – moves very fast.

We realized that we couldn’t solve this problem by writing better scripts. We had to change the basic structure of how data was synced. We stopped trying to boil the ocean and fix the entire enterprise at once. Instead, we focused on an extremely difficult environment: kubernetesIf we can build an autonomous agent that is able to reason about the complex, short-term state of a Kubernetes cluster, we can prove a pattern that works everywhere, This article explains how we used the new open-source codex cli and thisModel Reference Protocol (MCP) To create that agent. In this process, we moved from passive code generation to active infrastructure operation, turning the “old CMDB” problem from a data entry task into a logic puzzle.

Transformation: From Code Generation to Infrastructure Operation with Codex CLI and MCP

The reason most CMDB initiatives fail is because of ambition. They try to track every switch port, virtual machine, and SaaS license simultaneously. The result is a data swamp – too much noise, not enough signal. We took a different approach. We drew a small circle around a specific domain: Kubernetes workloadsKubernetes is the perfect testing ground for AI agents because it is high-velocity and declarative, Things keep changing constantly, The beans die; deployment rollover; Services change selectors, A static script struggles to distinguish between a crashloopbackoff (a temporary error condition) and a purposeful scale-down. We hypothesized that a large language model (LLM), acting as an operator, could understand this nuance. This won’t just copy the data; it will happen Explanation it.

codex cli Transformed this vision into a tangible architecture by enabling the shift from “code generation” to “infrastructure operation”. Instead of treating the LLM as a junior programmer who writes scripts for humans to review and run, Codex empowers the model to execute the code. We provide it with tools—executable tasks that act as its hands and eyes Model Reference ProtocolMCP defines a clear interface between AI models and the outside world, allowing us to expose high-level capabilities cmdb_stage_transaction Without teaching the model the complex internal API of our CMDB. The model learns using the tool, not the underlying API.

agency architecture

Our system, which we call k8s-agent, consists of three different layers. It is not a single script running from top to bottom; It is a cognitive architecture.

Cognitive layer (codecs + contextual instructions): This Codex CLI is running a specific system prompt. We do not fix model weights. The infrastructure for fine-tuning moves very quickly: a model trained on Kubernetes v1.25 will hallucinate by v1.30. Instead, we use context engineering – the art of designing the environment in which AI operates. This includes tool design (creating atomic, deterministic functions), prompt architecture (structuring the system prompt), and information architecture (deciding what information to hide or expose). We feed the model continuously reference file (AGENTS.md) which defines its personality: “You are a meticulous infrastructure auditor. Your goal is to ensure that the CMDB accurately reflects the state of the Kubernetes cluster. You should prioritize security: do not delete records unless you have positive confirmation; they are orphans.”

tool layer: Using MCP, we expose deterministic Python functions to the agent.

  • sensor, k8s_list_workloads, cmdb_query_service, k8s_get_deployment_spec
  • actuator, cmdb_stage_create, cmdb_stage_update, cmdb_stage_delete

Note that we track workload (deployments, statefulsets), not pods. Pods are short-lived; Tracking them in the CMDB is an antipattern that creates noise. The agent understands this distinction – a semantic rule that is difficult to implement in a rigid script.

State Layer (Safety Net):LLM are probable; Infrastructure must be deterministic. We bridge this gap with a scaffolding Sample. The agent never writes directly to the production database. This one writes to Staged difference. This allows a human (or policy engine) to review proposed changes before committing them.

OODA Loop in action

How is it different from the standard sync script? A script follows a linear path: Connect → Fetch → WriteIf any step fails or returns unexpected data, the script crashes or corrupts the data, Our agent follows this Observe-Orient-Decision-Act (OODA) HoroscopePopularized by military strategists. Unlike a linear script that executes blindly, the OODA loop forces the agent to stop and synthesize information before taking action. This cycle allows it to handle incomplete data, verify assumptions, and adapt to changing conditions – essential traits for operating in a distributed system.

Let’s go over a real scenario we encountered during our pilot Evil spirit deploymentTo explore the benefits of using the OODA Loop. A developer had removed a deployment named payment-processor-v1 from the cluster but had forgotten to delete the record from the CMDB. A standard script could pull a list of deployments, see that payment-processor-v1 is missing, and immediately issue a DELETE to the database. The risk is obvious: What if the API server times out? What if there is a bug in the pagination logic of the script? The script blindly destroys data based on lack of evidence.

The agent’s approach is fundamentally different. First of all, this believe in:calling k8s_list_workloads And cmdb_query_serviceConsidering the discrepancy. Second, this Orients: “Verify orphans before deletion” and checking its reference instructions to decide when to call k8s_get_event_historyThird, it decides: Seeing the “delete” event in the log, it means that the resource is missing and a deletion event has occurred. Ultimately, this Acts:calling cmdb_stage_delete With a comment confirming the deletion. The agent didn’t just sync data; It was investigated. This handled the ambiguity that usually plagues automation.

Solving the “Semantic Gap”

This specific Kubernetes use case highlights a widespread problem in IT operations: the “semantic gap.” The data in our infrastructure (JSON, YAML, logs) is full of implicit meaning. A label “env:production” changes the criticality of a resource. a situation crashloopbackoff means “broken”, but Completed Means “finished successfully.” Traditional scripts require us to hardcode every permutation of this logic, resulting in thousands of lines of if/else statements that are unmaintainable. With the Codex CLI, we replace those thousands of lines of code in the system prompt with a few sentences in English: “Ignore jobs that completed successfully. Sync unsuccessful jobs so we can track instability.” LLM bridges the semantic gap. Understands what “instability” means in the context of a job situation. We are describing our intent, and the agent is handling the implementation.

Scaling beyond Kubernetes

We started with Kubernetes because it is the “hard mode” of configuration management. In a production environment with thousands of workloads, things are constantly changing. A standard script looks at a snapshot and often gets it wrong. However, an agent can work regardless of complexity. It can run its OODA loop multiple times to solve the same problem – by checking logs, verifying dependencies, and validating rules before ever making changes. This ability to add logical steps allows it to handle the scale and uncertainty that breaks traditional automation.

But the pattern we established, Agentic OODA Loops through MCPIs universal. Once we proved that the model worked for pods and services, we realized we could scale it. For legacy infrastructure, we can provide agent tools to ssh into a Linux VM. For SaaS management, we can give it access to Salesforce or GitHub APIs. For cloud governance, we can ask it to audit AWS security groups. The beauty of this architecture is that the “brain” (codec CLI) remains the same. To support new environments, we don’t need to rewrite the engine; We simply hand it a new set of tools. However, the shift to an agentic model forces us to face new trade-offs. The most immediate is cost versus context. We learned the hard way that you shouldn’t feed raw YAML of a Kubernetes deployment to AI – it consumes too many tokens and clutters the model with irrelevant details. Instead, you create a tool that returns a digest – a simplified JSON object that contains only the fields that matter. it is context adaptationAnd this is the key to running agents cost effectively.

Conclusion: Human in the Cockpit

There are fears that AI will replace the DevOps engineer. Our experience with Codex CLI suggests the opposite. This technology does not remove the human being; It lifts them up. This promotes the engineer from “script writer” to “mission commander”. Stale CMDB was never really a data problem; This was a labor problem. It was too much work for humans to track it manually and it was too complex to automate simple scripts. By introducing an agent that can reason, we finally have a system that is able to keep pace with the cloud.

We started with a small Kubernetes cluster. But the destination is an infrastructure that is self-documenting, self-healing, and fundamentally accessible. The era of brittle sync scripts is over. era of infrastructure as intended It has started!

Related Articles

Leave a Comment