Cloudflare has released Agent SDK v0.5.0 To address the limitations of stateless serverless functions in AI development. In standard serverless architectures, the session context needs to be recreated for each LLM call, which increases latency and token consumption. The latest version of the Agent SDK (Agent SDK v0.5.0) provides a vertically integrated execution layer where computation, state, and inference co-exist at the network edge.
The SDK allows developers to create agents that maintain state over long periods of time, moving beyond simple request-response cycles. This is achieved through 2 primary technologies: durable objects, which provide persistent state and identity, and Infire, a custom-built Rust inference engine designed to optimize edge resources. For developers, this architecture removes the need to manage external database connections or WebSocket servers for state synchronization.
State management through durable goods
The Agent SDK relies on durable objects (DOs) to provide persistent identity and memory for each agent instance. In the traditional serverless model, functions have no memory of past events unless they query an external database like RDS or DynamoDB, which often adds 50ms to 200ms of latency.
Durable Object is a stateful micro-server that runs on Cloudflare’s network with its own private storage. When an agent is instantiated using the Agent SDK, it is assigned a static ID. All subsequent requests for that user are routed to the same physical instance, allowing the agent to keep its state in memory. Each agent includes an embedded SQLite database with a 1GB storage limit per instance, enabling zero-latency reads and writes to conversation history and task logs.
Durable objects are single-threaded, which simplifies concurrency management. This design ensures that only 1 event is processed at a time for a specific agent instance, eliminating race conditions. If an agent receives multiple inputs simultaneously, they are queued and processed atomically, ensuring that the state remains consistent during complex operations.
Ignite: Optimizing Estimation with Rust
For the inference layer, Cloudflare developed Infire, a VLM engine written in Rust, which replaces Python-based stacks like VLM. The Python engine often faces performance bottlenecks due to global interpreter locks (GIL) and garbage collection interruptions. Infire is designed to maximize GPU utilization on H100 hardware by minimizing CPU overhead.
The engine uses granular CUDA graphs and just-in-time (JIT) compilation. Instead of launching GPU kernels sequentially, Infire compiles a dedicated CUDA graph for each possible batch size. This allows the driver to execute tasks as a single monolithic structure, reducing CPU overhead by 82%. Benchmarks show that Infire is 7% faster than VLLM 0.10.0 on unloaded machines, using only 25% of the CPU compared to VLLM’s >140%.
| metric | VLLM 0.10.0 (Python) | fire (rust) | Improvement |
| throughput speed | basic | 7% faster | +7% |
| cpu overhead | >140% CPU usage | 25% cpu usage | -82% |
| startup latency | high (cold start) | <4 seconds (Llama 3 8B) | Important |
Infire also uses paged KV caching, which breaks memory into non-contiguous blocks to prevent fragmentation. This enables ‘continuous batching’, where the engine processes new signals simultaneously while completing previous generations without degradation in performance. This architecture allows Cloudflare to maintain a predictable 99.99% hot request rate.
Code mode and token efficiency
Standard AI agents typically use ‘tool calling’, where the LLM outputs a JSON object to trigger a function. This process requires back-and-forth between the LLM and the execution environment for each tool used. Cloudflare’s ‘Code Mode’ changes this by asking LLMs to write a TypeScript program that orchestrates multiple tools at once.
This code executes in a secure V8 isolated sandbox. For complex tasks, such as searching 10 different files, code mode provides an 87.5% reduction in token usage. Because intermediate results remain within the sandbox and are not sent back to the LLM for every step, the process is both faster and more cost-effective.
Code mode also improves security through ‘secure binding’. There is no internet access in the sandbox; It can only interact with Model Context Protocol (MCP) servers through specific bindings in environment objects. These bindings hide sensitive API keys from the LLM, preventing the model from accidentally leaking credentials into its generated code.
February 2026: v0.5.0 release
Agent SDK version reached 0.5.0. This release introduced several utilities for production-ready agents:
- this.retry(): A new method for retrying asynchronous operations with exponential backoff and jitter.
- protocol suppression:Developers can now suppress JSON text frames on a per-connection basis using
shouldSendProtocolMessageshook. This is useful for IoT or MQTT clients that cannot process JSON data. - static ai chat: The
@cloudflare/ai-chatThe package reached version 0.1.0, adding message persistence to SQLite and a “row size guard” that performs automatic compression when messages reach the 2MB SQLite limit.
| Speciality | Description |
| this.retry() | Automatic retry for external API calls. |
| data part | Attaching typed JSON blobs to chat messages. |
| equipment approval | A state of constant approval that survives hibernation. |
| Synchronous Getters | getQueue() And getSchedule() There is no need for promises now. |
key takeaways
- Stateful persistence at the edge: Unlike traditional stateless serverless functions, the Agent SDK uses durable objects to provide persistent identity and memory to agents. This allows each agent to maintain its state in an embedded SQLite database with 1GB of storage, enabling zero-latency data access without external database calls.
- high efficiency corrosion estimation: Cloudflare’s Inferior inference engine, written in Rust, optimizes GPU utilization using granular CUDA graphs to reduce CPU overhead by up to 82%. Benchmarks show that it is 7% faster than Python-based VLLM 0.10.0 and uses paged KV caching to maintain a 99.99% hot request rate, significantly reducing cold start latency.
- Token Customization via Code Mode: ‘Code Mode’ allows agents to write and execute TypeScript programs in a secure V8 isolate instead of calling multiple individual tools. This deterministic approach reduces token consumption by up to 87.5% for complex tasks and keeps intermediate data within a sandbox to improve both speed and security.
- Universal Tool Integration: The platform fully supports the Model Context Protocol (MCP), a standard that acts as a universal translator for AI tools. Cloudflare has deployed 13 official MCP servers that allow agents to securely manage infrastructure components such as DNS, R2 storage, and worker KVs through natural language commands.
- Production Ready Utilities (v0.5.0): February, 2026, release introduces critical reliability features, including
this.retry()Utility for asynchronous operations with exponential backoff and jitter. It also adds protocol suppression, which allows agents to communicate with binary-only IoT devices and lightly embedded systems that cannot process standard JSON text frames.
check it out technical details. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
