Organizations are increasingly deploying custom large language models (LLMs) on Amazon SageMaker AI real-time endpoints using their preferred serving framework – such as SGLang, VLLM, or TorchServe – to help gain more control over their deployments, optimize costs, and align with compliance requirements. However, this flexibility presents a significant technical challenge: response format inconsistency With Strands Agents. Although these custom serving frameworks typically return responses in OpenAI-compliant formats to facilitate broader environment support, strands agent Expect model responses to be aligned with the Bedrock Message API format.
The challenge is particularly important because support for the messaging API is not guaranteed for models hosted on the SageMaker AI real-time endpoint. While the Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging formats since December 2025, the flexibility of SageMaker AI allows customers to host different Foundation models – some requiring esoteric signal and response formats that do not conform to standard APIs. This creates a gap between the serving framework’s output structure and what Strands expects, which prevents seamless integration even though both systems are technically functional. The solution lies in implementation custom model parser that expands SageMakerAIModel And translate Model Server’s response format into what Strands expects, enabling organizations to leverage their preferred service framework without sacrificing compatibility with the Strands Agent SDK.
This post demonstrates how to create a custom model parser for Strands agents when working with LLMs hosted on SageMaker that do not natively support the Bedrock Message API format. We will learn how to deploy Llama 3.1 with SGlang on SageMaker awslabs/ml-container-makerThen implementing a custom parser to integrate it with Strands agents.
strands custom parsers
Strands agents expect model responses in a specific format aligned with the Bedrock Message API. When you deploy models using custom serving frameworks like SGLang, vLLM, or TorchServe, they typically return responses in their own formats – which are often OpenAI-compatible for broader environment support. Without a custom parser, you will encounter errors like the following:
TypeError: 'NoneType' object is not subscriptable
This happens because Strands agents are the default SageMakerAIModel The class attempts to parse responses assuming a specific structure that your custom endpoint does not provide. In this post and companion code base, we explain how to extend SageMakerAIModel A class with custom parsing logic that translates your model server’s response format into what Strands expects.
Implementation Overview
Our implementation has three layers:
- model deployment layer: Llama 3.1 served on SageMaker by SGLang, returning OpenAI-compliant responses
- parser layer: custom
LlamaModelProviderthe class that extendsSageMakerAIModelTo handle the response format of Llama 3.1 - agent layer: Strands agent that uses a custom provider for conversational AI to appropriately parse the model’s response
We start with the experiment awslabs/ml-container-makerAn AWS Labs open-source Yeoman generator that automates the creation of SageMaker BYOC (bring your own container) deployment projects. It generates the artifacts required to build the LLM serving container, including Dockerfiles, CodeBuild configuration, and deployment scripts.
Install ml-container-maker
The first step we need to take is to create a serving container for our model. We use an open-source project to create a container and generate deployment scripts for that container. The following commands explain how to install awslabs/ml-container-maker and its dependencies, including npm And firmer. For more information, review the project readme And wiki To start.
Generate deployment project
Once installed and linked, the yo command allows you to run the installed generator, yo ml-container-creator allows you to run the generator we need for this exercise.
The generator creates a complete project structure:
build and deploy
Projects built by awslabs/ml-container-maker Include templated build and deployment scripts. ./deploy/submit_build.sh and ./deploy/deploy.sh The script is used to create the image, push the image to the Amazon Elastic Container Registry (ECR), and deploy it to the Amazon SageMaker AI Real-Time endpoint.
Deploy Process:
- CodeBuild builds Docker image with SGLang and Llama 3.1
- Image has been moved to Amazon ECR
- SageMaker creates a real-time endpoint
- SGLang downloads the model from HuggingFace and loads it into GPU memory
- Endpoint reaches InService status (about 10-15 minutes)
We can test the endpoint using ./test/test_endpoint.shOr with direct invocation:
Understanding Response Format
Llama 3.1 returns OpenAI-compliant responses. Strands expects model responses to follow the Bedrock Message API format. Until late last year, this was a standard compatibility mismatch. Starting in December 2025, the Amazon Bedrock Mantle distributed inference engine supports OpenAI messaging formats:
However, support for the Messaging API is not guaranteed for models hosted on the SageMaker AI Real-Time Endpoint. SageMaker AI allows customers to host a variety of Foundation models on managed GPU-accelerated infrastructure, some of which may require esoteric signal/response formats. For example, default SageMakerAIModel Uses the legacy Bedrock message API format and attempts to access fields that are not present in the standard OpenAI message format, leading to TypeError Failure of style.
Implementing a Custom Model Parser
A feature of custom model parsers is Strands Agent SDK Which provides strong adaptability and flexibility for customers building agents powered by LLM hosted on SageMaker AI. Here, we describe how to create a custom provider that extends SageMakerAIModel: :
Stream overrides method behavior SageMakerAIModel and allows the agent to parse responses based on the requirements of the underlying model. While most models support OpenAI’s messaging API protocol, this capability enables power-users to leverage highly specified LLMs on SageMaker AI for power agent workloads using the Strands Agents SDK. Once the custom model response logic is created, the Strands Agents SDK makes it easy to initialize agents with custom model providers:
The complete implementation for this custom parser, including a Jupyter notebook and ml-container-creator deployment project with detailed explanations, is available in the companion GitHub repository.
conclusion
Building custom model parsers for Strands agents helps users take advantage of various LLM deployments on SageMaker, regardless of its response format. by expanding SageMakerAIModel and implementing stream() method, you can integrate custom-hosted models while maintaining Strands’ clean agent interface.
key takeaways: :
- awslabs/ml-container-maker SageMaker simplifies BYOC deployment with production-ready infrastructure code
- Custom parsers bridge the gap between model server response formats and Strands expectations
- The stream() method is a key integration point for custom providers.
About the authors
dan ferguson is a Senior Solutions Architect at AWS based in New York, United States. As a Machine Learning Services Specialist, Dan works to assist customers in their journey to integrate ML workflows efficiently, effectively, and sustainably.
