Scaling MLflow for Enterprise AI: What’s New in SageMaker AI with MLflow

by
0 comments
Scaling MLflow for Enterprise AI: What's New in SageMaker AI with MLflow

Today we’re announcing Amazon SageMaker AI with MLflow, which now includes a serverless capability that dynamically manages the provisioning, scaling, and operations of infrastructure for artificial intelligence and machine learning (AI/ML) development tasks. It maximizes resources during intensive use and reduces them to zero when not in use, thereby reducing operational overhead. It introduces enterprise-scale features including cross-account sharing, automatic version upgrades, and seamless access management with integration with SageMaker AI capabilities like model optimization and pipelines. With no administrator configuration required and no additional cost, data scientists can immediately begin tracking experiments, applying observations, and evaluating model performance without infrastructure delay, making it easy to scale MLFlow workloads across your organization while maintaining security and governance.

In this post, we explore how these new capabilities help you run large MLFlow workloads – from generative AI agents to large language model (LLM) experiments – with improved performance, automation, and security using SageMaker AI with MLFlow.

Enterprise Scale Features in SageMaker AI with MLflow

The new MLFlow Serverless capability in SageMaker AI provides enterprise-grade management with automatic scaling, default provisioning, seamless version upgrades, simplified AWS Identity and Access Management (IAM) authorization, resource sharing through AWS Resource Access Manager (AWS RAM), and integration with both Amazon SageMaker pipelines and model optimization. Word mlflow apps replaces previous mlflow tracking server The terminology reflects a simplified, application-focused approach. You can access the new MLFlow Apps page in Amazon SageMaker Studio, as shown in the following screenshot.

When you create a SageMaker Studio domain, a default MLFlow app is automatically provisioned, streamlining the setup process. It is enterprise ready out of the box, requiring no additional provisioning or configuration. The MLflow app scales rapidly with your usage, reducing the need for manual capacity planning. Your training, tracking, and experimentation workloads can automatically get the resources they need, simplifying operations while maintaining performance.

Administrators can define a maintenance window during MLflow app creation, during which the in-place version of the MLflow app is upgraded. This helps standardize, secure, and continuously update MLflow apps, reducing manual maintenance overhead. MLflow version 3.4 is supported with this launch, and as shown in the following screenshot, extends MLflow to ML, generative AI applications, and agent workloads.

Simplified Identity Management with MLflow Apps

We’ve simplified access control and IAM permissions for ML teams with the new MLFlow app. A well-organized permissions set, e.g. sagemaker:CallMlflowAppApiNow covers common MLFlow operations – from creating and finding experiments to updating trace information – making access controls much simpler to implement.

By enabling simplified IAM permission boundaries, users and platform administrators can standardize IAM roles across teams, individuals, and projects, facilitating consistent and auditable access to MLflow experiments and metadata. For complete IAM permission and policy configuration, see Set IAM permissions for MLflow apps.

Cross-account sharing of MLflow apps using AWS RAM

Administrators want to centrally manage their MLflow infrastructure while provisioning access to different AWS accounts. MLflow apps support AWS cross-account sharing for collaborative enterprise AI development. Using AWS RAM, this feature helps AI platform administrators seamlessly share MLflow apps between data scientists with consumer AWS accounts, as shown in the following figure.

diagram

Platform administrators can maintain a centralized, governed SageMaker domain that provisions and manages MLflow apps, and data scientists in separate consumer accounts can securely launch and interact with MLflow apps. With new simplified IAM permissions, enterprises can launch and manage MLflow apps from a centralized administrative AWS account. Using the shared MLflow app, a downstream data scientist consumer can log their MLflow experiments and generative AI workloads while maintaining governance, audit, and compliance from a single platform administrator control plane. To learn more about cross-account sharing, see Getting Started with AWS RAM.

SageMaker pipeline and MLflow integration

SageMaker pipelines integrate with MLFlow. SageMaker Pipelines is a serverless workflow orchestration service purpose-built for MLOPS and LLMOPS automation. You can seamlessly create, execute, and monitor repeatable end-to-end ML workflows with the intuitive drag-and-drop UI or Python SDK. From the SageMaker pipeline, a default MLFlow app will be created if one does not already exist, an MLFlow experiment name can be defined, and metrics, parameters, and artifacts are logged to the MLFlow app as defined in your SageMaker pipeline code. The following screenshot shows an example ML pipeline using MLflow.

SageMaker Model Optimization and MLFlow Integration

By default, SageMaker model optimization integrates with MLFlow, providing automatic linking between model optimization jobs and MLFlow experiments. When you run the model optimization fine-tuning task, the default MLflow app is used, an experiment is selected, and metrics, parameters, and artifacts are automatically logged for you. On the SageMaker Model Optimization task page, you can view metrics obtained from MLFlow and drill down into additional metrics within the MLFlow UI, as shown in the following screenshot.

View full metrics in MLflow

conclusion

These features make the new MLflow apps in SageMaker AI ready for enterprise-scale ML and generative AI workloads with minimal administrative burden. You can start with the examples given in GitHub Samples Repository And AWS Workshop,

MLflow apps are generally available in AWS regions, except China and US GovCloud regions where SageMaker Studio is available. We invite you to explore the new capabilities and experience the increased efficiency and control it brings to your ML projects. Get started now by visiting the SageMaker AI with MLflow product detail page and accelerating generative AI development using MLflow managed on Amazon SageMaker AI, and send your feedback AWS re:post for SageMaker Or through your normal AWS support contacts.


About the authors

Sandeep Ravish GenAI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey to scale generic AI applications such as model training, agents, and generic AI use cases. He also focuses on go-to-market strategies that help AWS build and align products to solve industry challenges in the generic AI space. You can connect with Sandeep on LinkedIn to learn about Generative AI solutions.

Rahul Ishwar ISenior Product Manager at AWS, leading managed MLflow and partner AI apps within the Amazon SageMaker AIOps team. With over 20 years of experience spanning startup to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations around the world. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

jessica liao is a Senior UX Designer at AWS who leads the design for MLflow, model governance, and inference within Amazon SageMaker AI, shaping the way data scientists evaluate, operate, and deploy models. She brings expertise in tackling complex problems and driving human-centered innovation from her experience in designing DNA life science systems, which she now applies to make machine learning tools more accessible and intuitive through cross-functional collaboration.

Related Articles

Leave a Comment