Building a custom model provider for Strands agents with LLM hosted on a SageMaker AI endpoint

by
0 comments
Building a custom model provider for Strands agents with LLM hosted on a SageMaker AI endpoint

Organizations are increasingly deploying custom large language models (LLMs) on Amazon SageMaker AI real-time endpoints using their preferred serving framework – such as SGLang, VLLM, or TorchServe – to help gain more control over their deployments, optimize costs, and align with compliance requirements. However, this flexibility presents a significant technical challenge: response format inconsistency With Strands Agents. Although these custom serving frameworks typically return responses in OpenAI-compliant formats to facilitate broader environment support, strands agent Expect model responses to be aligned with the Bedrock Message API format.

The challenge is particularly important because support for the messaging API is not guaranteed for models hosted on the SageMaker AI real-time endpoint. While the Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging formats since December 2025, the flexibility of SageMaker AI allows customers to host different Foundation models – some requiring esoteric signal and response formats that do not conform to standard APIs. This creates a gap between the serving framework’s output structure and what Strands expects, which prevents seamless integration even though both systems are technically functional. The solution lies in implementation custom model parser that expands SageMakerAIModel And translate Model Server’s response format into what Strands expects, enabling organizations to leverage their preferred service framework without sacrificing compatibility with the Strands Agent SDK.

This post demonstrates how to create a custom model parser for Strands agents when working with LLMs hosted on SageMaker that do not natively support the Bedrock Message API format. We will learn how to deploy Llama 3.1 with SGlang on SageMaker awslabs/ml-container-makerThen implementing a custom parser to integrate it with Strands agents.

strands custom parsers

Strands agents expect model responses in a specific format aligned with the Bedrock Message API. When you deploy models using custom serving frameworks like SGLang, vLLM, or TorchServe, they typically return responses in their own formats – which are often OpenAI-compatible for broader environment support. Without a custom parser, you will encounter errors like the following:

TypeError: 'NoneType' object is not subscriptable

This happens because Strands agents are the default SageMakerAIModel The class attempts to parse responses assuming a specific structure that your custom endpoint does not provide. In this post and companion code base, we explain how to extend SageMakerAIModel A class with custom parsing logic that translates your model server’s response format into what Strands expects.

Implementation Overview

Our implementation has three layers:

  1. model deployment layer: Llama 3.1 served on SageMaker by SGLang, returning OpenAI-compliant responses
  2. parser layer: custom LlamaModelProvider the class that extends SageMakerAIModel To handle the response format of Llama 3.1
  3. agent layer: Strands agent that uses a custom provider for conversational AI to appropriately parse the model’s response

We start with the experiment awslabs/ml-container-makerAn AWS Labs open-source Yeoman generator that automates the creation of SageMaker BYOC (bring your own container) deployment projects. It generates the artifacts required to build the LLM serving container, including Dockerfiles, CodeBuild configuration, and deployment scripts.

Install ml-container-maker

The first step we need to take is to create a serving container for our model. We use an open-source project to create a container and generate deployment scripts for that container. The following commands explain how to install awslabs/ml-container-maker and its dependencies, including npm And firmer. For more information, review the project readme And wiki To start.

# Install Yeoman globally
npm install -g yo

# Clone and install ml-container-creator
git clone https://github.com/awslabs/ml-container-creator
cd ml-container-creator
npm install && npm link

# Verify installation
yo --generators # Should show ml-container-creator

Generate deployment project

Once installed and linked, the yo command allows you to run the installed generator, yo ml-container-creator allows you to run the generator we need for this exercise.

# Run the generator
yo ml-container-creator

# Configuration options:
# - Framework: transformers
# - Model Server: sglang
# - Model: meta-llama/Llama-3.1-8B-Instruct
# - Deploy Target: codebuild
# - Instance Type: ml.g6.12xlarge (GPU)
# - Region: us-east-1

The generator creates a complete project structure:

/
├── Dockerfile # Container with SGLang and dependencies
├── buildspec.yml # CodeBuild configuration
├── code/
│ └── serve # SGLang server startup script
├── deploy/
│ ├── submit_build.sh # Triggers CodeBuild
│ └── deploy.sh # Deploys to SageMaker
└── test/
└── test_endpoint.sh # Endpoint testing script

build and deploy

Projects built by awslabs/ml-container-maker Include templated build and deployment scripts. ./deploy/submit_build.sh and ./deploy/deploy.sh The script is used to create the image, push the image to the Amazon Elastic Container Registry (ECR), and deploy it to the Amazon SageMaker AI Real-Time endpoint.

cd llama-31-deployment

# Build container with CodeBuild (no local Docker required)
./deploy/submit_build.sh

# Deploy to SageMaker
./deploy/deploy.sh arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole

Deploy Process:

  1. CodeBuild builds Docker image with SGLang and Llama 3.1
  2. Image has been moved to Amazon ECR
  3. SageMaker creates a real-time endpoint
  4. SGLang downloads the model from HuggingFace and loads it into GPU memory
  5. Endpoint reaches InService status (about 10-15 minutes)

We can test the endpoint using ./test/test_endpoint.shOr with direct invocation:

import boto3
import json

runtime_client = boto3.client('sagemaker-runtime', region_name="us-east-1")

payload = {
"messages": (
    {"user", "content": "Hello, how are you?"}
  ),
  "max_tokens": 100,
  "temperature": 0.7
}

response = runtime_client.invoke_endpoint(
  EndpointName="llama-31-deployment-endpoint",
  ContentType="application/json",
  Body=json.dumps(payload)
)

result = json.loads(response('Body').read().decode('utf-8'))
print(result('choices')(0)('message')('content'))

Understanding Response Format

Llama 3.1 returns OpenAI-compliant responses. Strands expects model responses to follow the Bedrock Message API format. Until late last year, this was a standard compatibility mismatch. Starting in December 2025, the Amazon Bedrock Mantle distributed inference engine supports OpenAI messaging formats:

{
  "id": "cmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": ({
    "index": 0,
    "message": {"role": "assistant", "content": "I'm doing well, thank you for asking!"},
    "finish_reason": "stop"
  }),
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 12,
    "total_tokens": 35
  }
}

However, support for the Messaging API is not guaranteed for models hosted on the SageMaker AI Real-Time Endpoint. SageMaker AI allows customers to host a variety of Foundation models on managed GPU-accelerated infrastructure, some of which may require esoteric signal/response formats. For example, default SageMakerAIModel Uses the legacy Bedrock message API format and attempts to access fields that are not present in the standard OpenAI message format, leading to TypeError Failure of style.

Implementing a Custom Model Parser

A feature of custom model parsers is Strands Agent SDK Which provides strong adaptability and flexibility for customers building agents powered by LLM hosted on SageMaker AI. Here, we describe how to create a custom provider that extends SageMakerAIModel: :

def stream(self, messages: List(Dict(str, Any)), tool_specs: list, system_prompt: Optional(str), **kwargs):
  # Build payload messages
  payload_messages = ()
  if system_prompt:
    payload_messages.append({"role": "system", "content": system_prompt})
    # Extract message content from Strands format
    for msg in messages:
      payload_messages.append({"role": "user", "content": msg('content')(0)('text')})
      
      # Build complete payload with streaming enabled
      payload = {
        "messages": payload_messages,
        "max_tokens": kwargs.get('max_tokens', self.max_tokens),
        "temperature": kwargs.get('temperature', self.temperature),
        "top_p": kwargs.get('top_p', self.top_p),
        "stream": True
      }

      try:
        # Invoke SageMaker endpoint with streaming
        response = self.runtime_client.invoke_endpoint_with_response_stream(
          EndpointName=self.endpoint_name,
          ContentType="application/json",
          Accept="application/json",
          Body=json.dumps(payload)
        )

        # Process streaming response
        accumulated_content = ""
          for event in response('Body'):
            chunk = event('PayloadPart')('Bytes').decode('utf-8')
            if not chunk.strip():
              continue
    
            # Parse SSE format: "data: {json}n"
            for line in chunk.split('n'):
              if line.startswith('data: '):
                try:
                  json_str = line.replace('data: ', '').strip()
                  if not json_str:
                    continue
                  
                  chunk_data = json.loads(json_str)
                  if 'choices' in chunk_data and chunk_data('choices'):
                    delta = chunk_data('choices')(0).get('delta', {})

                    # Yield content delta in Strands format
                    if 'content' in delta:
                      content_chunk = delta('content')
                      accumulated_content += content_chunk
                      yield {
                        "type": "contentBlockDelta",
                        "delta": {"text": content_chunk},
                        "contentBlockIndex": 0
                      }

                    # Check for completion
                    finish_reason = chunk_data('choices')(0).get('finish_reason')
                    if finish_reason:
                      yield {
                        "type": "messageStop",
                        "stopReason": finish_reason
                      }

                    # Yield usage metadata
                    if 'usage' in chunk_data:
                      yield {
                        "type": "metadata",
                        "usage": chunk_data('usage')
                      }

                except json.JSONDecodeError:
                  continue

      except Exception as e:
        yield {
          "type": "error",
          "error": {
            "message": f"Endpoint invocation failed: {str(e)}",
            "type": "EndpointInvocationError"
          }
      }

Stream overrides method behavior SageMakerAIModel and allows the agent to parse responses based on the requirements of the underlying model. While most models support OpenAI’s messaging API protocol, this capability enables power-users to leverage highly specified LLMs on SageMaker AI for power agent workloads using the Strands Agents SDK. Once the custom model response logic is created, the Strands Agents SDK makes it easy to initialize agents with custom model providers:

from strands.agent import Agent

# Initialize custom provider
provider = LlamaModelProvider(
  endpoint_name="llama-31-deployment-endpoint",
  region_name="us-east-1",
  max_tokens=1000,
  temperature=0.7
)

# Create agent with custom provider
agent = Agent(
  name="llama-assistant",
  model=provider,
  system_prompt=(
    "You are a helpful AI assistant powered by Llama 3.1, "
    "deployed on Amazon SageMaker. You provide clear, accurate, "
    "and friendly responses to user questions."
  )
)

# Test the agent
response = agent("What are the key benefits of deploying LLMs on SageMaker?")
print(response.content)

The complete implementation for this custom parser, including a Jupyter notebook and ml-container-creator deployment project with detailed explanations, is available in the companion GitHub repository.

conclusion

Building custom model parsers for Strands agents helps users take advantage of various LLM deployments on SageMaker, regardless of its response format. by expanding SageMakerAIModel and implementing stream() method, you can integrate custom-hosted models while maintaining Strands’ clean agent interface.

key takeaways: :

  1. awslabs/ml-container-maker SageMaker simplifies BYOC deployment with production-ready infrastructure code
  2. Custom parsers bridge the gap between model server response formats and Strands expectations
  3. The stream() method is a key integration point for custom providers.

About the authors

dan ferguson is a Senior Solutions Architect at AWS based in New York, United States. As a Machine Learning Services Specialist, Dan works to assist customers in their journey to integrate ML workflows efficiently, effectively, and sustainably.

Related Articles

Leave a Comment