Unlocking Video Understanding with TwelveLabs Marengo on Amazon Bedrock

by
0 comments
Unlocking Video Understanding with TwelveLabs Marengo on Amazon Bedrock

Media and entertainment, advertising, education and enterprise training materials combine visual, audio and motion elements to tell stories and convey information, making it far more complex than text where individual words have clear meanings. This creates unique challenges for AI systems that need to understand video content. Video content is multidimensional, including visual elements (scenes, objects, actions), temporal dynamics (motion, transitions), audio components (dialogue, music, sound effects), and text overlays (subtitles, captions). This complexity creates significant business challenges as organizations struggle to search through video archives, locate specific scenes, automatically classify content, and extract insights from their media assets to make effective decisions.

The model addresses this problem with a multi-vector architecture that creates different embeddings for different content modalities. Instead of forcing all the information into a single vector, the model generates specialized representations. This approach preserves the rich, multidimensional nature of video data, enabling more accurate analysis across visual, temporal, and audio dimensions.

Amazon Bedrock has expanded its capabilities to support TwelveLabs Marengo Embed 3.0 models with real-time text and image processing through synchronous inference. With this integration businesses can implement faster video search functionality using natural language queries, while also supporting interactive product discovery through sophisticated image similarity matching.

In this post, we will show how the TwelveLabs Marengo embedding model, available on Amazon Bedrock, enhances video understanding through multimodal AI. We will build a video semantic search and analytics solution using embeddings from Marengo models with Amazon OpenSearch Serverless as a vector database, for semantic search capabilities that go beyond simple metadata matching to provide intelligent content discovery.

Understanding Video Embedding

Embeddings are dense vector representations that capture the semantic meaning of data in a high-dimensional space. Think of these as numerical fingerprints that encode the essence of the content in a way that machines can understand and compare. For text, the embedding may hold that “king” and “queen” are related concepts, or that “Paris” and “France” have a geographical relationship. For images, embeddings can be understood as a Golden Retriever And labrador Both are dogs, even though they look different. The following heat map shows the semantic similarity scores between these sentence fragments: “Two people are talking,” “A man and a woman are talking,” and “Cats and dogs are cute animals.”

Video Embedding Challenges

Video presents unique challenges because it is inherently multimodal:

  • visual information: objects, scenes, people, actions and visual aesthetics
  • audio information: speech, music, sound effects and ambient noise
  • text information: Captions, on-screen text, and written speech

Traditional single-vector approaches compress all this rich information into a single representation, often losing important nuances. This is where TwelveLabs Marengo’s approach is unique in effectively addressing this challenge.

TwelveLabs Marengo: A Multimodal Embedding Model

The Marengo 3.0 model generates several specialized vectors, each of which captures different aspects of the video content. A typical film or TV show combines visual and auditory elements to create an integrated storytelling experience. Marengo’s multi-vector architecture provides significant advantages for understanding this complex video content. Each vector captures a specific modality, thereby avoiding information loss caused by compressing different data types into a single representation. This enables flexible searches targeting specific content aspects–visual-only, audio-only, or combined queries. Distinct vectors provide superior accuracy in complex multimodal scenarios while maintaining efficient scalability for large enterprise video datasets.

Solution Overview: Marengo Model Capabilities

In the following section, we will demonstrate the power of Marengo’s embedding technique through code samples. Examples show how Marengo processes a variety of content and delivers exceptional search accuracy. The complete code sample can be found in GitHub repository,

Prerequisites

Before you begin, verify that you have:

sample video

netflix open content is an open source content available under the Creative Commons Attribution 4.0 International License. We will use one of the videos called midday To demonstrate the TwelveLabs Marengo model on Amazon Bedrock.

Create a video embedding

Amazon Bedrock Marengo uses asynchronous APIs for video embedding generations. The following is a Python code snippet that shows an example of implementing an API that takes a video from an S3 bucket location. Please see the documentation for full supported functionality.

bedrock_client = boto3.client("bedrock-runtime")
model_id = 'us.twelvelabs.marengo-embed-3-0-v1:0'
video_s3_uri = "" # Replace by your s3 URI
aws_account_id = "" # Replace by bucket owner ID
s3_bucket_name = "" # Replace by output S3 bucket name
s3_output_prefix = "" # Replace by output prefix

response = bedrock_client.start_async_invoke(
            modelId=model_id,
            modelInput={
                "inputType": "video",
		        "video": {
                    "mediaSource": {
                      "s3Location": {
                        "uri": video_s3_uri,
                        "bucketOwner": aws_account_id
                       }
         	     }
		 }
           },
           outputDataConfig={
               "s3OutputDataConfig": {
                  "s3Uri": f's3://{s3_bucket_name}/{s3_output_prefix}'
                }
           }
        )

The above example produces 280 individual embeddings from the same video – one for each segment, enabling accurate temporal detection and analysis. Types of embeddings for multi-vector output from video may include the following:

( 
 {'embedding': (0.053192138671875,...), 'embeddingOption': "visual", 'embeddingScope' : "clip", "startSec" : 0.0, "endSec" : 4.3 },
 {'embedding': (0.053192138645645,...), 'embeddingOption': "transcription", 'embeddingScope' : "clip", "startSec" : 3.9, "endSec" : 6.5 },
 {'embedding': (0.3235554er443524,...), 'embeddingOption': "audio", 'embeddingScope' : "clip", "startSec" : 4.9, "endSec" : 7.5 }
)

  • picture – Visual embedding of video
  • TRANSCRIPTION – embedding of transcribed text
  • audio – embedding of audio in video

When processing audio or video content, you can determine how long each clip segment should be for embedding creation. By default, video clips are automatically split at natural scene changes (shot boundaries). The audio clip is divided into as many even segments of 10 seconds as possible – for example, a 50 second audio file becomes 5 segments of 10 seconds each, while a 16 second file becomes 2 segments of 8 seconds each. By default, a single Marengo video embedding API generates visual-text, visual-image, and audio embeddings. You can also change the default setting to only output specific embedding types. Use the following code snippet to generate an embedding for a video with configurable options with the Amazon Bedrock API:

response = bedrock_client.start_async_invoke(
            modelId=model_id,
            modelInput={
               "modelId": model_id,
               "modelInput": {
                    "inputType": "video",
                    "video": {
                       "mediaSource": {
                            "base64String": "base64-encoded string", // base64String OR s3Location, exactly one
                            "s3Location": {
                               "uri": "s3://amzn-s3-demo-bucket/video/clip.mp4",
                               "bucketOwner": "123456789012"
                             }
                       },
                      "startSec": 0,
                      "endSec": 6,
                      "segmentation": {
                         "method": "dynamic", // dynamic OR fixed, exactly one
                         "dynamic": {
                           "minDurationSec": 4
                         }
                         "method": "fixed",
                         "fixed": {
                           "durationSec": 6
                         }
                      },
                      "embeddingOption": (
                          "visual",
                          "audio", 
                          "transcription"
                       ), // optional, default=all
                      "embeddingScope": (
                          "clip",
                          "asset"
                       ) // optional, one or both
               },
               "inferenceId": "some inference id"
            }
          }
         )

Vector Database: Amazon OpenSearch Serverless

In our example, we will use Amazon OpenSearch Serverless as a vector database to store text, images, audio, and video embeddings generated from videos served through the Marengo model. As a vector database, OpenSearch Serverless allows you to quickly find similar content using semantic search without worrying about managing servers or infrastructure. The following code snippet demonstrates how to create an Amazon OpenSearch serverless repository:

aoss_client = boto3_session.client('opensearchserverless')

try:
    collection = self.aoss_client.create_collection(
        name=collection_name, type="VECTORSEARCH"
    )
    collection_id = collection('createCollectionDetail')('id')
    collection_arn = collection('createCollectionDetail')('arn')
except self.aoss_client.exceptions.ConflictException:
    collection = self.aoss_client.batch_get_collection(
        names=(collection_name)
    )('collectionDetails')(0)
    pp.pprint(collection)
    collection_id = collection('id')
    collection_arn = collection('arn')

Once the OpenSearch serverless collection is created, we will create an index that contains the properties, including vector fields:

index_mapping = {
    "mappings": {
        "properties": {
            "video_id": {"type": "keyword"},
            "segment_id": {"type": "integer"},
            "start_time": {"type": "float"},
            "end_time": {"type": "float"},
            "embedding": {
                "type": "dense_vector",
                "dims": 1024,
                "index": True,
                "similarity": "cosine"
            },
            "metadata": {"type": "object"}
        }
    }
}
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, 'aoss')
oss_client = OpenSearch(
            hosts=({'host': host, 'port': 443}),
            http_auth=self.awsauth,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection,
            timeout=300
        )
response = oss_client.indices.create(index=index_name, body=index_mapping)

Index Marengo Embedding

The following code snippet demonstrates how to include embedding output from a Marengo model into the OpenSearch index:

documents = ()  
for i, segment in enumerate(video_embeddings):
    document = {
        "embedding": segment("embedding"),
        "start_time": segment("startSec"),
        "end_time": segment("endSec"),
        "video_id": video_id,
        "segment_id": i,
        "embedding_option": segment.get("embeddingOption", "visual")
    }
documents.append(document)

# Bulk index documents
bulk_data = ()
for doc in documents:
    bulk_data.append({"index": {"_index": self.index_name}})
    bulk_data.append(doc)

# Convert to bulk format
bulk_body = "n".join(json.dumps(item) for item in bulk_data) + "n"
response = oss_client.bulk(body=bulk_body, index=self.index_name)

Cross-modal semantic search

With Marengo’s multi-vector design you can explore different modalities that is impossible with single-vector models. By creating separate but aligned embeddings for visual, audio, motion, and contextual elements, you can search for videos using the input type of your choice. For example, a text query for “jazz music playing” returns video clips of musicians performing, jazz audio tracks, and concert hall views.

The following examples demonstrate Marengo’s exceptional search capabilities across a variety of modalities:

text search

Here’s a code snippet that demonstrates cross modal semantic search capabilities using text:

text_query = "a person smoking in a room"
modelInput={
    "inputType": "text",
    "text": {	
       "inputText": text_query
    }
}
response = self.bedrock_client.invoke_model(
    modelId="us.twelvelabs.marengo-embed-3-0-v1:0",
    body=json.dumps(modelInput))
        
result = json.loads(response("body").read())
query_embedding = result("data")(0)("embedding")

# Search OpenSearch index
search_body = {
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": top_k
            }
        }
    },
    "size": top_k,
    "_source": ("start_time", "end_time", "video_id", "segment_id")
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f"n✅ Found {len(response('hits')('hits'))} matching segments:")

results = ()
for hit in response('hits')('hits'):
    result = {
        "score": hit("_score"),
        "video_id": hit("_source")("video_id"),
        "segment_id": hit("_source")("segment_id"),
        "start_time": hit("_source")("start_time"),
        "end_time": hit("_source")("end_time")
    }
results.append(result)

The top search results from the text query: “A person smoking in a room” produces the following video clips:

image search

The following code snippet demonstrates cross modal semantic search capabilities for a given image:

s3_image_uri = f's3://{self.s3_bucket_name}/{self.s3_images_path}/{image_path_basename}'
s3_output_prefix = f'{self.s3_embeddings_path}/{self.s3_images_path}/{uuid.uuid4()}'

modelInput={
    "inputType": "image",
    "image": {
    "mediaSource": {
        "s3Location": {
            "uri": s3_image_uri,
            "bucketOwner": self.aws_account_id
        }
    }
  }
}
response = self.bedrock_client.invoke_model(
    modelId=self.cris_model_id,
    body=json.dumps(modelInput),
)

result = json.loads(response("body").read())

...
query_embedding = result("data")(0)("embedding")

# Search OpenSearch index
search_body = {
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": top_k
            }
        }
    },
    "size": top_k,
    "_source": ("start_time", "end_time", "video_id", "segment_id")
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f"n✅ Found {len(response('hits')('hits'))} matching segments:")
results = ()

for hit in response('hits')('hits'):
    result = {
        "score": hit("_score"),
        "video_id": hit("_source")("video_id"),
        "segment_id": hit("_source")("segment_id"),
        "start_time": hit("_source")("start_time"),
        "end_time": hit("_source")("end_time")
    }
    results.append(result)

The top search result from the image above produces the following video clip:

In addition to semantic search using text and images on videos, the Marengo model can also search videos using audio embeddings that focus on dialogue and speech. Audio search capabilities help users find videos based on specific speakers, dialogue content, or spoken topics. This creates a comprehensive video search experience that combines text, image, audio to understand the video.

conclusion

The combination of TwelveLabs Marengo and Amazon Bedrock opens up exciting new possibilities for understanding video through its multi-vector, multimodal approach. Throughout this post, we have explored practical examples such as image-to-video search with temporal precision and detailed text-to-video matching. With just a single Bedrock API call, we transformed a video file into 336 searchable segments that answer text, visual, and audio questions. These capabilities create opportunities for natural language content discovery, streamlined media asset management, and other applications that can help organizations better understand and use their video content at scale.

As video continues to dominate digital experiences, models like Marengo provide a solid foundation for building more intelligent video analytics systems. check it out sample code Learn more about how multimodal video understanding can transform your applications.


About the authors

weee teh Machine Learning Solutions Architect at AWS. He is passionate about helping clients achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities such as camping, fishing, and hiking with his family.

Author – Lana Zhang Lana Zhang is a Senior Expert Solutions Architect for Generative AI at AWS within the Worldwide Expert Organization. He specializes in AI/ML with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with clients in various industries including media and entertainment, gaming, sports, advertising, financial services, and healthcare to help transform their business solutions through AI.

yanyan zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she is working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use Generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Apart from work, she loves travelling, working out and exploring new things.

Related Articles

Leave a Comment