Deploy Mistral AI’s Voxtral on Amazon SageMaker AI

by
0 comments
Deploy Mistral AI's Voxtral on Amazon SageMaker AI

Configure your model in Code/serving.properties:

  1. To deploy voxtral-mini, use the following code:
    option.model_id=mistralai/Voxtral-Mini-3B-2507
    option.tensor_parallel_degree=1

  2. To deploy voxtral-small, use the following code:
    option.model_id=mistralai/Voxtral-Small-24B-2507
    option.tensor_parallel_degree=4

  1. Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy your endpoint and test it with text, audio, and function calling capabilities.

docker container configuration

GitHub repo Full Dockerfile is included. The following code snippet highlights the key parts:

# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:latest
# Set environment variables for SageMaker
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Install audio processing dependencies
RUN pip install --no-cache-dir 
"mistral_common>=1.8.1" 
librosa>=0.10.2 
soundfile>=0.12.1 
pydub>=0.25.1

This Dockerfile creates a special container that extends the official VLLM server with Voxtral-specific capabilities by adding the necessary audio processing libraries (mistral_common For tokenization, librosa/soundfile/pydub for audio handling) while configuring the appropriate SageMaker environment variables for model loading and caching. This approach separates the infrastructure from the business logic by keeping the container generic and allowing SageMaker to dynamically inject model-specific code (model.py And serving.properties) from Amazon S3 at runtime, enabling flexible deployment of different Voxtral variants without the need to rebuild containers.

model configuration

The complete model configuration is in serving.properties The file is located in the code folder. The following code snippet highlights key configurations:

# Model configuration
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
option.dtype=bfloat16
# Voxtral-specific settings (as per official documentation)
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
option.trust_remote_code=true
# Audio processing (Voxtral specifications)
option.limit_mm_per_prompt=audio:8
option.mm_processor_kwargs={"audio_sampling_rate": 16000, "audio_max_length": 1800.0}
# Performance optimizations (vLLM v0.10.0+ features)
option.enable_chunked_prefill=true
option.enable_prefix_caching=true
option.use_v2_block_manager=true

This configuration file provides Voxtral-specific customizations that follow Mistral’s official recommendations for VLLM server deployment, setting appropriate tokenization modes, audio processing parameters (supporting up to eight audio files per prompt with 30-minute transcription capability), and using the latest VLLM v0.10.0+ performance features such as chunked prefill and prefix caching. The modular design supports seamless switching between Voxtral-Mini and Voxtral-Small by simply changing model_id And tensor_parallel_degree parameters, while maintaining optimal memory usage and enabling advanced caching mechanisms for better estimation performance.

custom estimate handler

The full custom estimation code is in the models.py file located in the code folder. The following code snippet highlights the key functions:

# FastAPI app for SageMaker compatibility
app = FastAPI(title="Voxtral vLLM Inference Server", version="1.1.0")
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
	"""Start vLLM server with Voxtral-specific configuration"""
	config = load_serving_properties()

	cmd = (
	"vllm", "serve", config.get("option.model_id"),
	"--tokenizer-mode", "mistral",
	"--config-format", "mistral",
	"--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
	"--host", "127.0.0.1",
	"--port", "8000"
	)

	vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
	server_ready = wait_for_server()
	return server_ready
@app.post("/invocations")
async def invoke_model(request: Request):
	"""Handle chat, transcription, and function calling"""
	# Transcription requests
	if "transcription" in request_data:
		audio_source = request_data("transcription")("audio")
	return transcribe_audio(audio_source)

# Chat requests with multimodal support
messages = format_messages_for_openai(request_data("messages"))
tools = request_data.get("tools")

# Generate via vLLM OpenAI client
response = openai_client.chat.completions.create(
	model=model_config("model_id"),
	messages=messages,
	tools=tools if supports_function_calling() else None
	)
	return response

This custom inference handler creates a FastAPI-based server that integrates directly with the VLLM server for optimal voxel performance. Handler processes multimodal content, including Base64-encoded audio and audio URLs, dynamically loads model configuration serving.properties Supports advanced features like file, and function calling for voxtral-small deployments.

SageMaker deployment code

Voxtral-vLLM-BYOC-SageMaker.ipynb notebook included Voxtral-vllm-byoc The folder organizes the entire deployment process for both Voxtral models:

import boto3
import sagemaker
from sagemaker.model import Model
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = "your-s3-bucket"
# Upload model artifacts to S3
byoc_config_uri = sagemaker_session.upload_data(
path="./code",
bucket=bucket,
key_prefix="voxtral-vllm-byoc/code"
)
# Configure custom container image
account_id = boto3.client('sts').get_caller_identity()('Account')
region = boto3.Session().region_name
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/voxtral-vllm-byoc:latest"
# Create SageMaker model
voxtral_model = Model(
	image_uri=image_uri,
	model_data={
		"S3DataSource": {
		"S3Uri": f"{byoc_config_uri}/",
		"S3DataType": "S3Prefix",
		"CompressionType": "None"
		}
	},
	role=role,
	env={
		'MODEL_CACHE_DIR': '/opt/ml/model',
		'TRANSFORMERS_CACHE': '/tmp/transformers_cache',
		'SAGEMAKER_BIND_TO_PORT': '8080'
		}
	)
# Deploy to endpoint
predictor = voxtral_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g6.12xlarge", # For Voxtral-Small
	container_startup_health_check_timeout=1200,
	wait=True
	)

model use cases

The Voxtral model supports various text and speech-to-text use cases, and the Voxtral-Small model supports tool use with voice input. refer to GitHub repository For complete code. In this section, we provide code snippets for various use cases supported by the model.

text only

The following code shows a basic text-based interaction with the model. The user sends a text query and receives a structured response:

payload = {
	"messages": (
	{
		"role": "user",
		"content": "Hello! Can you tell me about the advantages of using vLLM for model inference?"
		}
	),
	"max_tokens": 200,
	"temperature": 0.2,
	"top_p": 0.95
}
response = predictor.predict(payload)

transcription only

The following example focuses on speech-to-text transcription by setting the temperature to 0 for deterministic output. The model processes an audio file URL or converting an audio file to Base64 code, then returns written text without additional interpretation:

payload = {
	"transcription": {
		"audio": "https://audiocdn.frenchtoday.com/file/ft-public-files/audiobook-samples/AMPFE/AMP%20FE%20Ch%2002%20Story%20Slower.mp3",
		"language": "fr",
		"temperature": 0.0
		}
	}
response = predictor.predict(payload)

Understanding text and audio

The following code combines both text instructions and audio input for multimodal processing. The model can follow specific text commands when analyzing an audio file provided in an inference pass, enabling more complex interactions such as guided transcription or audio analysis tasks:

payload = {
	"messages": (
	{
		"role": "user",
		"content": (
			{
				"type": "text",
				"text": "Can you summarise this audio file"
			},
			{
				"type": "audio",
				"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"
			}
			)
		}
	),
	"max_tokens": 300,
	"temperature": 0.2,
	"top_p": 0.95
}
response = predictor.predict(payload)

use of equipment

The following code demonstrates function calling capabilities, where the model can interpret voice commands and execute predefined tools. The example demonstrates weather queries via voice input, with the model automatically calling the appropriate functions and returning structured results:

# Define weather tool configuration
WEATHER_TOOL = {
    "type": "function",
	"function": {
		"name": "get_current_weather",
		"description": "Get the current weather for a specific location",
		"parameters": {
			"type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "format": {
                    "type": "string",
                    "enum": ("celsius", "fahrenheit"),
                    "description": "The temperature unit to use."
                }
            },
		    "required": ("location", "format")
        }
	}
}
# Mock weather function
def mock_weather(location, format="celsius"):
	"""Always returns sunny weather at 25°C/77°F"""
	temp = 77 if format.lower() == "fahrenheit" else 25
	unit = "°F" if format.lower() == "fahrenheit" else "°C"
	return f"It's sunny in {location} with {temp}{unit}"
# Test payload with audio
payload = {
	"messages": (
	{
		"role": "user",
		"content": (
			{
				"type": "audio",
				"path": "https://huggingface.co/datasets/patrickvonplaten/audio_samples/resolve/main/fn_calling.wav"
            }
            )	
		}
	),
	"temperature": 0.2,
	"top_p": 0.95,
	"tools": (WEATHER_TOOL)
}
response = predictor.predict(payload)

Strands Agent Integration

The following example shows how to integrate Voxtral with strands structure Creating intelligent agents capable of using multiple devices. The agent can automatically select and execute appropriate tools (such as calculators, file operations, or shell commands from Strands prebuilt tools) based on the user’s queries, enabling complex multi-step workflows through natural language interactions:

# SageMaker integration with Strands agents
# from strands import Agent
from strands import Agent
from strands.models.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell
model = SageMakerAIModel(
	endpoint_config={
		"endpoint_name": endpoint_name,
		"region_name": "us-west-2",
	},
	payload_config={
		"max_tokens": 1000,
		"temperature": 0.7,
		"stream": False,
	}
)
agent = Agent(model=model, tools=(calculator, current_time, file_read, shell))
response = agent("What is the square root of 12?")

cleanliness

When you finish experimenting with this example, delete the SageMaker endpoints created in the notebook to avoid unnecessary costs:

# Delete SageMaker endpoint
print(f" Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print(" Endpoint deleted successfully")

conclusion

In this post, we showed how to successfully self-host Mistral’s open source Voxtral model on SageMaker using a BYOC approach. We have created a production-ready system that utilizes the latest VLLM framework and official Voxtral optimizations for both mini and small model variants. The solution supports the full spectrum of Voxtral capabilities, including text-only conversations, audio transcription, sophisticated multimodal understanding, and function calling directly from voice input. With this flexible architecture, you can switch between Voxtral-Mini and Voxtral-Small models through simple configuration updates without the need to rebuild containers.

Take your multimodal AI applications to the next level by trying out complete code GitHub repository To host the Voxtral model on SageMaker and start building your own voice-enabled applications. Explore the full potential of Voxtral by visiting Mistral official website Exploring detailed capabilities, performance benchmarks and technical specifications. Finally, explore strands agent Framework for seamlessly building agentic applications that execute complex workflows.


About the authors

Ying Hou, PhD, is a Senior Specialist Solutions Architect for GenAI at AWS, where she collaborates with model providers to incorporate the latest and most intelligent AI models on the AWS platform. With deep expertise in General AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with clients to design and build cutting-edge ML and GenAI applications.

Related Articles

Leave a Comment