- To deploy voxtral-mini, use the following code:
- To deploy voxtral-small, use the following code:
- Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy your endpoint and test it with text, audio, and function calling capabilities.
docker container configuration
GitHub repo Full Dockerfile is included. The following code snippet highlights the key parts:
This Dockerfile creates a special container that extends the official VLLM server with Voxtral-specific capabilities by adding the necessary audio processing libraries (mistral_common For tokenization, librosa/soundfile/pydub for audio handling) while configuring the appropriate SageMaker environment variables for model loading and caching. This approach separates the infrastructure from the business logic by keeping the container generic and allowing SageMaker to dynamically inject model-specific code (model.py And serving.properties) from Amazon S3 at runtime, enabling flexible deployment of different Voxtral variants without the need to rebuild containers.
model configuration
The complete model configuration is in serving.properties The file is located in the code folder. The following code snippet highlights key configurations:
This configuration file provides Voxtral-specific customizations that follow Mistral’s official recommendations for VLLM server deployment, setting appropriate tokenization modes, audio processing parameters (supporting up to eight audio files per prompt with 30-minute transcription capability), and using the latest VLLM v0.10.0+ performance features such as chunked prefill and prefix caching. The modular design supports seamless switching between Voxtral-Mini and Voxtral-Small by simply changing model_id And tensor_parallel_degree parameters, while maintaining optimal memory usage and enabling advanced caching mechanisms for better estimation performance.
custom estimate handler
The full custom estimation code is in the models.py file located in the code folder. The following code snippet highlights the key functions:
This custom inference handler creates a FastAPI-based server that integrates directly with the VLLM server for optimal voxel performance. Handler processes multimodal content, including Base64-encoded audio and audio URLs, dynamically loads model configuration serving.properties Supports advanced features like file, and function calling for voxtral-small deployments.
SageMaker deployment code
Voxtral-vLLM-BYOC-SageMaker.ipynb notebook included Voxtral-vllm-byoc The folder organizes the entire deployment process for both Voxtral models:
model use cases
The Voxtral model supports various text and speech-to-text use cases, and the Voxtral-Small model supports tool use with voice input. refer to GitHub repository For complete code. In this section, we provide code snippets for various use cases supported by the model.
text only
The following code shows a basic text-based interaction with the model. The user sends a text query and receives a structured response:
transcription only
The following example focuses on speech-to-text transcription by setting the temperature to 0 for deterministic output. The model processes an audio file URL or converting an audio file to Base64 code, then returns written text without additional interpretation:
Understanding text and audio
The following code combines both text instructions and audio input for multimodal processing. The model can follow specific text commands when analyzing an audio file provided in an inference pass, enabling more complex interactions such as guided transcription or audio analysis tasks:
use of equipment
The following code demonstrates function calling capabilities, where the model can interpret voice commands and execute predefined tools. The example demonstrates weather queries via voice input, with the model automatically calling the appropriate functions and returning structured results:
Strands Agent Integration
The following example shows how to integrate Voxtral with strands structure Creating intelligent agents capable of using multiple devices. The agent can automatically select and execute appropriate tools (such as calculators, file operations, or shell commands from Strands prebuilt tools) based on the user’s queries, enabling complex multi-step workflows through natural language interactions:
cleanliness
When you finish experimenting with this example, delete the SageMaker endpoints created in the notebook to avoid unnecessary costs:
conclusion
In this post, we showed how to successfully self-host Mistral’s open source Voxtral model on SageMaker using a BYOC approach. We have created a production-ready system that utilizes the latest VLLM framework and official Voxtral optimizations for both mini and small model variants. The solution supports the full spectrum of Voxtral capabilities, including text-only conversations, audio transcription, sophisticated multimodal understanding, and function calling directly from voice input. With this flexible architecture, you can switch between Voxtral-Mini and Voxtral-Small models through simple configuration updates without the need to rebuild containers.
Take your multimodal AI applications to the next level by trying out complete code GitHub repository To host the Voxtral model on SageMaker and start building your own voice-enabled applications. Explore the full potential of Voxtral by visiting Mistral official website Exploring detailed capabilities, performance benchmarks and technical specifications. Finally, explore strands agent Framework for seamlessly building agentic applications that execute complex workflows.
About the authors
Ying Hou, PhD, is a Senior Specialist Solutions Architect for GenAI at AWS, where she collaborates with model providers to incorporate the latest and most intelligent AI models on the AWS platform. With deep expertise in General AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with clients to design and build cutting-edge ML and GenAI applications.