Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): Audiovisual encoder powering SAM audio and large-scale multimodal retrieval

Meta Researchers Introduce Perception Encoder Audiovisual, PE_AVAs a new family of encoders for joint audio and video understanding. The model learns aligned audio, video and text representations in a single embedding space using large-scale adversarial training on approximately 100M audio video pairs with text captions.

From Perception Encoder to PE_AV

Perception Encoder,PE,is the main vision stack in Meta’s Perception Model project. It is a family of encoders for images, video and audio that reach the state of the art on multiple vision and audio benchmarks using a unified contrastive pretraining recipe. PE Core outperforms SigLIP2 on image tasks and InterVideo2 on video tasks. Py Lang powers perception language models for multimodal reasoning. PE Spatial is designed for intensive prediction tasks such as detection and depth estimation.

PE_AV Builds on this backbone and extends it to full audio video text alignment. In the Perception Model Repository, PE Audio Visual is listed as the branch that embeds audio, video, audio video, and text into a single joint embedding space for cross modal understanding.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Architecture, separate towers and fusion

PE_AV The architecture is composed of a frame encoder, a video encoder, an audio encoder, an audio video fusion encoder, and a text encoder.

The video path uses the existing PE frame encoder on the RGB frame, then applies a temporal video encoder on top of the frame level features.
The audio path uses a DAC VAE as a codec to convert the raw waveform into discrete audio tokens at a fixed frame rate, approximately one embedding every 40 milliseconds.

These towers feed an audio video fusion encoder that learns a shared representation for both streams. A text encoder projects text queries into several specific locations. In practice this gives you a single backbone that can be queried in multiple ways. You can get text to video, text to audio, video to audio, or retrieve text descriptions conditioned on any combination of modalities without retraining task specific heads.

Data engine, large-scale synthetic audiovisual captions

The research team proposed a two-stage audiovisual data engine that generates high-quality synthetic captions for unlabeled clips. The team describes a pipeline that first uses multiple weak audio caption models, their confidence scores, and individual video captioners as inputs to a larger language model. This LLM generates three caption types per clip, one for the audio content, one for the visual content, and one for the combined audio visual content. An initial PE AV model is trained on this synthetic supervision.

In the second phase, this initial P.E._AV It is integrated with the Perception Language Model Decoder. Together they refine the captions to make better use of the audiovisual correspondence. The two-stage engine delivers reliable captions for approximately 100M audio video pairs and uses approximately 92M unique clips for Stage 1 pretraining and an additional 32M unique clips for Stage 2 fine tuning.

Compared to previous work that often focuses on speech or narrow sound domains, this corpus is designed to balance speech, general sound, music, and diverse video domains, which is important for general audiovisual retrieval and comprehension.

Contradictory Objectives in Ten Modality Pairs

PE_AV Uses sigmoid based contrastive loss in audio, video, text and fused representations. The research team explains that the model uses eight adversarial loss pairs during pretraining. These cover combinations such as audio text, video text, audio video text and fusion related pairs. During fine tuning, two additional pairs are added, bringing the total to ten loss pairs between different modalities and caption types.

This objective is similar to the contrast objectives used in recent vision language encoders but generalized to audio video text tri-modal training. By aligning all these sequences in one place, a single encoder can support classification, retrieval, and correspondence tasks with simple dot product similarity.

Display in audio, speech, music and video

On benchmark, PE_AV Target zero shot retrieval and classification for multiple domains. PE AV achieves state-of-the-art performance on multiple audio and video benchmarks compared to recent audio text and audio video text models such as CLAP, Audio Flamingo, ImageBind and LanguageBind.

Concrete benefits include:

On AudioCaps, audio retrieval from text increases from 35.4 R at 1 to 45.8 R at 1.
On VGGSound, clip level classification accuracy increased from 36.0 to 47.1.
For speech retrieval on VCTK style tasks, the PE AV reaches 85.6 accuracy while earlier models are closer to 0.
On ActivityNet, text-to-video retrieval increases from 60.4 R at 1 to 66.5 R at 1.
On the Kinetics 400, the zero shot video classification has increased from 76.9 to 78.9, outperforming the larger model by 2 to 4 times.

PE_frameframe level audio text alignment

along with PE_AVMeta releases Perception Encoder Audio Frame, PE_frameFor sound event localization. PE A frame is an audio text embedding model that outputs one audio embedding per 40 millisecond frame and one text embedding per query. The model can return temporal spans that mark where each described event in the audio occurs.

PE_frame Uses frame level contrastive learning to align audio frames with text. This enables precise localization of events such as specific speakers, instruments, or transient sounds in long audio sequences.

Perception models and SAM’s role in the audio ecosystem

PE_AV and pe_frame Sit inside the comprehensive Perception Model stack, which connects the PE Encoder with the Perception Language Model for multimodel generation and reasoning.

PE_AV It is also the main perception engine behind Meta’s new SAM audio model and its Judge evaluator. SAM uses audio PE_AV Embedding to connect visual cues and text cues to sound sources in complex mixes and score the quality of separate audio tracks.

key takeaways

PE_AV There is a unified encoder for audio, video, and text, trained with contrastive learning on over 100M videos, and embeds audio, video, audio video, and text into a single joint space for cross modal retrieval and understanding.
The architecture uses separate video and audio towers with PE based visual encoding and DAC VAE audio tokenization, followed by an audio visual fusion encoder and specialized text heads associated with different modality pairs.
The Stage 2 data engine generates synthetic audio, visual and audio visual captions using weak captioners and LLM in Stage 1 and PE._AV Plus perception language models in phase 2, enabling large-scale multimodal supervision without manual labels.
PE_AV With six public checkpoints ranging from small 16 frame to large all frame variants, it sets new state of the art on a wide range of audio and video benchmarks through a sigmoid contrastive objective on multiple modality pairs, where the average retrieval improves from approximately 45 to 51.6.
PE_AVframe level with PE_frame The variant forms the perception backbone for Meta’s SAM audio system, providing instantiation based audio separation and fine embeddings used for fine sound event localization in speech, music and general sounds.

check it out paper, repo And model weightAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): Audiovisual encoder powering SAM audio and large-scale multimodal retrieval

From Perception Encoder to PEAV

Architecture, separate towers and fusion

Data engine, large-scale synthetic audiovisual captions

Contradictory Objectives in Ten Modality Pairs

Display in audio, speech, music and video

PEframeframe level audio text alignment

Perception models and SAM’s role in the audio ecosystem

key takeaways

Zuckerberg is already spoiling relations with the new head of AI, he paid ten billion dollars for the appointment

Building Your First Streamlight App: Upload, Chart, and Filter (Part 1)

Related Articles

Leave a Comment Cancel Reply

From Perception Encoder to PE_AV

PE_frameframe level audio text alignment