Scaling data annotation using vision-language models to empower physical AI systems

by
0 comments
Scaling data annotation using vision-language models to empower physical AI systems

Severe labor shortage is hampering growth in manufacturing, logistics, construction and agriculture sectors. The problem is especially acute in construction: approximately 500,000 posts remained incomplete in the United States 40% Current workforce headed for retirement within a decade. These workforce limitations result in project delays, increased costs and delayed development plans. To overcome these barriers, organizations are developing autonomous systems that can perform tasks that fill capacity gaps, expand operational capabilities, and provide the added benefit of around-the-clock productivity.

Building autonomous systems requires large, annotated datasets to train AI models. Effective training determines whether these systems provide business value. Bottleneck: High cost of data preparation. Critically, the task of labeling video data – identifying information about equipment, actions, and environment – ​​is essential to ensuring that the data is useful for model training. The move could hinder model deployment, slowing down the delivery of AI-powered products and services to customers. For production companies managing millions of hours of video, manual data preparation and annotation becomes impractical. Vision-language models (VLMs) help address this by interpreting images and video, answering natural language questions, and producing descriptions at a speed and scale that manual processes cannot match, providing a cost-effective alternative.

In this post, we examine how Bedrock Robotics Addresses this challenge. By connecting with AWS Physical AI Fellowship, The startup partnered with the AWS Generative AI Innovation Center to improve data preparation for autonomous construction equipment, applying vision-language models to analyze construction video footage, extract operational details, and generate large-scale labeled training datasets.

Bedrock Robotics: A case study in accelerating autonomous construction

Starting in 2024, Bedrock Robotics is developing autonomous systems for construction equipment. The company’s product, Bedrock Operator, is a retrofit solution that combines hardware with AI models to enable excavators and other machinery to operate with minimal human intervention. These systems can perform tasks such as digging, grading and material handling with centimeter-level accuracy. Training these models requires equipment, tasks, and surrounding environments capturing massive amounts of video footage – a highly resource-intensive process that limits scalability.

VLMs provide a solution by analyzing this image and video data and generating text descriptions. This makes them suitable for annotation tasks, which is important for teaching models how to associate visual patterns with human language. Bedrock Robotics used this technology to streamline data preparation for training AI models, enabling autonomous operation for devices. Additionally, through proper model selection and accelerated engineering, the company improved tool identification from 34% to 70%. It transformed a manual, time-intensive process into an automated, scalable data pipeline solution. This success accelerated the deployment of autonomous devices.

This approach provides an exemplary framework for organizations facing similar data challenges and demonstrates how strategic investments in Foundation Models (FM) can deliver measurable operational results and competitive advantage. Foundation models are models trained on massive amounts of data using self-supervised learning techniques that learn general representations that can be adapted to many downstream tasks. VLMs leverage these large-scale pretraining techniques to combine visual and textual modalities, enabling them to understand, analyze, and generate content in both images and language.

In the following sections, we look at the process Bedrock Robotics used to annotate millions of hours of video footage and accelerate innovation using a VLM-based solution.

From unstructured video data to strategic asset using VLM

Enabling autonomous manufacturing equipment requires extracting useful information from millions of hours of unstructured operational footage. Specifically, bedrock robotics needs to identify tool attachments, tasks, and workplace conditions in various scenarios. The following images are example video frames from this dataset.

Construction equipment operates with multiple tool attachments, each of which requires accurate classification to train reliable AI models. Working with the Innovation Center, Bedrock Robotics focused its innovation efforts by addressing a few critical equipment categories: lifting hooks for material handling, hammers for concrete demolition, grading beams for leveling the surface, and trenching buckets for narrow excavations.

These labels allow Bedrock Robotics to select relevant video segments and assemble training datasets that represent a variety of equipment configurations and operating conditions.

Accelerating AI deployment through strategic model optimization

Off-the-shelf VLMs (VLMs without quick optimization) struggle with construction video data because they are trained on web images, not operator footage from the excavator cabin. They can’t handle unusual angles, device-specific scenes, or poor visibility from dust and weather. They also lack the domain knowledge to differentiate visually similar equipment such as digging buckets and trenching buckets.

The Bedrock Robotics and Innovation Center addressed this through targeted model selection and rapid optimization. Teams evaluated several VLMs – including open source options and FMs available in Amazon Bedrock – then refined prompts with detailed visual descriptions of each tool, guidance for commonly confused tool pairs, and step-by-step instructions for analyzing video frames.

These modifications increased classification accuracy from 34% to 70% at $10 per hour of video processing on a test set consisting of 130 videos. These results demonstrate how accelerated engineering optimizes VLMs for specific tasks. For Bedrock Robotics, this optimization provided faster training cycles, less time to deployment, and a cost-effective scalable annotation pipeline that evolves with operational needs.

The way forward: Addressing labor shortages through automation

Competitive advantage. For Bedrock Robotics, the vision-language system enabled rapid identification and extraction of critical datasets, providing essential insights from large-scale construction video footage. With an overall accuracy of 70%, this cost-effective approach provides a practical basis for scaling data preparation for model training. It shows how strategic AI innovation can transform workforce barriers and accelerate industry transformations. Organizations that streamline data preparation can accelerate autonomous system deployment, reduce operating costs and explore new areas for growth in industries hit by labor shortages. With this repeatable framework, manufacturing and industrial automation leaders facing similar challenges can apply these principles to drive competitive differentiation within their own domains.

To know more, visit Bedrock Robotics Or explore physical AI resources on AWS.


About the authors

Laura Kulowsky

Laura Kulowsky is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where she works to develop physical AI solutions. Before joining Amazon, Laura completed her PhD in Harvard’s Department of Earth and Planetary Sciences investigating Jupiter’s deep zonal flow and magnetic field using Juno data.

Alla Simoneau

Alla Simoneau is a technology and commercial leader with over 15 years of experience, currently serving as the Emerging Technology Physical AI Lead at Amazon Web Services (AWS), where she drives global innovation at the intersection of AI and real-world applications. With over a decade at Amazon, Alla is a recognized leader in strategy, team building, and operational excellence, adept at translating cutting-edge technologies into real-world transformations for startups and enterprise customers.

paramida atigechian

paramida atigechian is a senior data scientist at the AWS Generative AI Innovation Center. With over 10 years of experience in Deep Learning and Generative AI, Paramida brings deep expertise in AI and customer centric solutions. Paramida has led and co-authored highly influential scientific papers focused on domains such as computer vision, interpretability, video, and image generation. With a strong focus on scientific practices, Paramida helps clients in the practical design of systems using generative AI in robust and scalable pipelines.

dan volk

dan volk is a senior data scientist at the AWS Generative AI Innovation Center. He has 10 years of experience in machine learning, deep learning, and time series analysis and holds a master’s degree in Data Science from UC Berkeley. He is passionate about turning complex business challenges into opportunities by leveraging cutting-edge AI technologies.

paul amadeo

paul amadeo is an experienced technology leader with over 30 years of experience in artificial intelligence, machine learning, IoT systems, RF design, optics, semiconductor physics, and advanced engineering. As the technical lead for Physical AI in the AWS Generative AI Innovation Center, Paul specializes in translating AI capabilities into tangible physical systems, guiding enterprise customers through complex implementations from concept to production. His diverse background includes designing computer vision systems for edge environments, designing robotic smart card manufacturing technologies that have produced billions of devices globally, and leading cross-functional teams in both commercial and defense sectors. Paul holds an MS in Applied Physics from the University of California, San Diego, a BS in Applied Physics from Caltech, and six patents in optical systems, communications devices, and manufacturing technologies.

mr ilaprolu

mr ilaprolu is the Director of the AWS Generative AI Innovation Center, where he leads a global team implementing cutting-edge AI solutions for enterprise and government organizations. During his 13-year tenure at AWS, he has led ML science teams partnering with global enterprises and public sector organizations. Prior to AWS, he spent 14 years at Northrop Grumman in product development and software engineering leadership roles. Shri has a Master’s degree in Engineering Science and an MBA.

Related Articles

Leave a Comment