Scaling data annotation using vision-language models to empower physical AI systems

Severe labor shortage is hampering growth in manufacturing, logistics, construction and agriculture sectors. The problem is especially acute in construction: approximately 500,000 posts remained incomplete in the United States 40% Current workforce headed for retirement within a decade. These workforce limitations result in project delays, increased costs and delayed development plans. To overcome these barriers, organizations are developing autonomous systems that can perform tasks that fill capacity gaps, expand operational capabilities, and provide the added benefit of around-the-clock productivity.

Building autonomous systems requires large, annotated datasets to train AI models. Effective training determines whether these systems provide business value. Bottleneck: High cost of data preparation. Critically, the task of labeling video data – identifying information about equipment, actions, and environment – is essential to ensuring that the data is useful for model training. The move could hinder model deployment, slowing down the delivery of AI-powered products and services to customers. For production companies managing millions of hours of video, manual data preparation and annotation becomes impractical. Vision-language models (VLMs) help address this by interpreting images and video, answering natural language questions, and producing descriptions at a speed and scale that manual processes cannot match, providing a cost-effective alternative.

In this post, we examine how Bedrock Robotics Addresses this challenge. By connecting with AWS Physical AI Fellowship, The startup partnered with the AWS Generative AI Innovation Center to improve data preparation for autonomous construction equipment, applying vision-language models to analyze construction video footage, extract operational details, and generate large-scale labeled training datasets.

Bedrock Robotics: A case study in accelerating autonomous construction

Starting in 2024, Bedrock Robotics is developing autonomous systems for construction equipment. The company’s product, Bedrock Operator, is a retrofit solution that combines hardware with AI models to enable excavators and other machinery to operate with minimal human intervention. These systems can perform tasks such as digging, grading and material handling with centimeter-level accuracy. Training these models requires equipment, tasks, and surrounding environments capturing massive amounts of video footage – a highly resource-intensive process that limits scalability.

VLMs provide a solution by analyzing this image and video data and generating text descriptions. This makes them suitable for annotation tasks, which is important for teaching models how to associate visual patterns with human language. Bedrock Robotics used this technology to streamline data preparation for training AI models, enabling autonomous operation for devices. Additionally, through proper model selection and accelerated engineering, the company improved tool identification from 34% to 70%. It transformed a manual, time-intensive process into an automated, scalable data pipeline solution. This success accelerated the deployment of autonomous devices.

This approach provides an exemplary framework for organizations facing similar data challenges and demonstrates how strategic investments in Foundation Models (FM) can deliver measurable operational results and competitive advantage. Foundation models are models trained on massive amounts of data using self-supervised learning techniques that learn general representations that can be adapted to many downstream tasks. VLMs leverage these large-scale pretraining techniques to combine visual and textual modalities, enabling them to understand, analyze, and generate content in both images and language.

In the following sections, we look at the process Bedrock Robotics used to annotate millions of hours of video footage and accelerate innovation using a VLM-based solution.

From unstructured video data to strategic asset using VLM

Enabling autonomous manufacturing equipment requires extracting useful information from millions of hours of unstructured operational footage. Specifically, bedrock robotics needs to identify tool attachments, tasks, and workplace conditions in various scenarios. The following images are example video frames from this dataset.

Construction equipment operates with multiple tool attachments, each of which requires accurate classification to train reliable AI models. Working with the Innovation Center, Bedrock Robotics focused its innovation efforts by addressing a few critical equipment categories: lifting hooks for material handling, hammers for concrete demolition, grading beams for leveling the surface, and trenching buckets for narrow excavations.

These labels allow Bedrock Robotics to select relevant video segments and assemble training datasets that represent a variety of equipment configurations and operating conditions.

Accelerating AI deployment through strategic model optimization

Off-the-shelf VLMs (VLMs without quick optimization) struggle with construction video data because they are trained on web images, not operator footage from the excavator cabin. They can’t handle unusual angles, device-specific scenes, or poor visibility from dust and weather. They also lack the domain knowledge to differentiate visually similar equipment such as digging buckets and trenching buckets.

The Bedrock Robotics and Innovation Center addressed this through targeted model selection and rapid optimization. Teams evaluated several VLMs – including open source options and FMs available in Amazon Bedrock – then refined prompts with detailed visual descriptions of each tool, guidance for commonly confused tool pairs, and step-by-step instructions for analyzing video frames.

These modifications increased classification accuracy from 34% to 70% at $10 per hour of video processing on a test set consisting of 130 videos. These results demonstrate how accelerated engineering optimizes VLMs for specific tasks. For Bedrock Robotics, this optimization provided faster training cycles, less time to deployment, and a cost-effective scalable annotation pipeline that evolves with operational needs.

The way forward: Addressing labor shortages through automation

Competitive advantage. For Bedrock Robotics, the vision-language system enabled rapid identification and extraction of critical datasets, providing essential insights from large-scale construction video footage. With an overall accuracy of 70%, this cost-effective approach provides a practical basis for scaling data preparation for model training. It shows how strategic AI innovation can transform workforce barriers and accelerate industry transformations. Organizations that streamline data preparation can accelerate autonomous system deployment, reduce operating costs and explore new areas for growth in industries hit by labor shortages. With this repeatable framework, manufacturing and industrial automation leaders facing similar challenges can apply these principles to drive competitive differentiation within their own domains.

To know more, visit Bedrock Robotics Or explore physical AI resources on AWS.