Meta has released SAM Audio, a quick-driven audio separation model that targets a common editing constraint, separating a sound from a real-world mix without creating a custom model per sound class. Meta released 3 main sizes, sam-audio-small, sam-audio-baseAnd sam-audio-largeThe model is available to download and try out in the segment Anything Playground,
architecture
SAM Audio uses separate encoders for each conditioning signal, an audio encoder for mixing, a text encoder for natural language description, a span encoder for time anchors, and a visual encoder that consumes the visual prompt derived from the video and object masks. The encoded streams are combined into time aligned features, then processed by a diffusion transformer that applies self attention to the time aligned representation and cross attention to the textural feature, then a DACVAE decoder reconstructs the waveforms and emits 2 outputs, target audio and residual audio.
What does SAM audio do, and what does ‘segment’ mean here,
SAM audio takes an input recording that contains multiple overlapping sources, for example speech plus traffic plus music, and isolates the target source based on a single signal. In the public inference API, the model produces 2 outputs, result.target And result.residualThe research team describes target as isolated sounds, and residual Like everything else.
That target plus residual interface maps directly onto editor operations. If you want to remove dog bark on a podcast track, you can treat the bark as a target, then subtract it by keeping only the residues. If you want to extract a guitar part from a concert clip, you have the target waveform instead. The meta uses these exact types of examples to explain what the model is intended to enable.
3 prompt types meta shipping
Meta positions SAM Audio as a single integrated model that supports 3 prompt types, and says that these prompts can be used alone or combined.
- Text prompting: You describe a sound in natural language, for example “dog barking” or “singing sound”, and the model isolates that sound from the mix. The meta lists the text prompt as one of the main interaction modes, and the open source repo includes an end-to-end example of usage.
SAMAudioProcessorAndmodel.separate, - Visual cues: You click on a person or object in a video and ask the model to isolate the audio associated with that visual object. The Meta team describes visual prompting as selecting an object with a sound in the video. In the issued code path, visual prompting is implemented by passing the video frame plus mask to the processor.
masked_videos, - Span Prompting: The Meta team calls span prompting an industry first. You mark time segments where the target sound occurs, then the model uses those spans to guide the separation. This matters for ambiguous cases, for example when the same instrument appears in multiple routes, or when a sound is only present briefly and you want to prevent the models from diverging too much.

Result
The Meta team positions SAM Audio as achieving cutting-edge performance in diverse, real-world scenarios, and offers it as an integrated alternative to single-purpose audio tools. The team publishes a subjective evaluation table in the categories General, SFX, Speech, Speaker, Music, Instr (Wild), Instr (Pro), in which the general score reaches 3.62 for Sam Audio Small, 3.28 for Sam Audio Bass and 3.50 for Sam Audio Large, and the Instr (Pro) score reaches 4.49 for Sam Audio Large.
key takeaways
- SAM Audio is a unified audio separation modelIt fragments the sound using complex mixers Text prompts, visual prompts, and time period prompts,
- Core API generates two waveforms per request,
targetfor isolated sound andresidualFor everything else, that maps neatly to common editing tasks like removing noise, removing stems, or maintaining atmosphere. - Meta released several checkpoints and variantsInvolved
sam-audio-small,sam-audio-base,sam-audio-largePlustvFor variants that Repo says perform better for visual cues, Repo also publishes a subjective evaluation table based on category. - Release includes tooling beyond expectationsMeta provides a
sam-audio-judgeModel that scores the separation results against text description with overall quality, recall, accuracy and reliability.
check it out technical details And GitHub pageFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

