Skip to content(if available)orjump to list(if available)

Our New Sam Audio Model Transforms Audio Editing

ks2048

I recently discovered Audacity includes plug-ins for audio separation that work great (e.g. split into vocals track and instruments track). The model it uses also originated at Facebook (demucs).

tantalor

Is "demucs" a pun on demux (demultiplexer)?

yunwal

This is hilariously bad with music. Like I can type in the most basic thing like "string instruments" which should theoretically be super easy to isolate. You can generally one-shot this using spectral analysis libraries. And it just totally fails.

duped

what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with, it takes years to train one of them to do it mildly well. Computers are even worse - blind source separation and the cocktail party problem have been the white whale of audio DSP for decades (and only very recently did tools become passable).

teeray

I wonder if this would be nice for hearing aid users for reducing the background restaurant babble that overwhelms the people you want to hear.

yjftsjthsd-h

> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

How does that work? Correlating sound with movement?

yodon

Think about it conceptually:

Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.

Could you point out who is lead guitar and who is rhythm guitar? So can AI.

ajcp

Given TikToks insane creator adoption rate is Meta developing these models to build out a content creation platform to compete?

mgraczyk

I doubt it, although it's possible these models will be used for creator tools, I believe the main idea is to use them for data labeling.

At the time the first SAM was created, Meta was already spending over 2B/year on human labelers. Surely that number is higher now and research like this can dramatically increase data labeling volume

null

[deleted]

ac2u

I wonder if the segmentation would work with a video of a ventriloquist and a dummy?

m3kw9

Can I create a continuous “who farted” detector? Would be great at parties