FFmpeg 8.0 adds Whisper support

292 comments

·August 13, 2025

kmfrk

Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.

People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.

HOWTO:

Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.

EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.

EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.

https://www.nikse.dk/subtitleedit

https://www.nikse.dk/donate

https://github.com/SubtitleEdit/subtitleedit/releases

notatallshaw

> uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):

> uv pip install torch torchvision torchaudio --torch-backend=auto

More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...

This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.

xrd

I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.

But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.

The team at Astral should be nominated for a Nobel Peace Prize.

j45

Agreed, making the virtual environment management and so much else disappear lets so much more focus go to python itself.

eigenvalue

They’ve definitely saved me many hours of wasted time between uv and ruff.

danudey

> "uv add"

One life-changing thing I've been using `uv` for:

System python version is 3.12:

    $ python3 --version
    Python 3.12.3

A script that requires a library we don't have, and won't work on our local python:

    $ cat test.py
    #!/usr/bin/env python3

    import sys
    from rich import print

    if sys.version_info < (3, 13):
        print("This script will not work on Python 3.12")
    else:
        print(f"Hello world, this is python {sys.version}")

It fails:

    $ python3 test.py
    Traceback (most recent call last):
    File "/tmp/tmp/test.py", line 10, in <module>
        from rich import print
    ModuleNotFoundError: No module named 'rich'

Tell `uv` what our requirements are

    $ uv add --script=test.py --python '3.13' rich
    Updated `test.py`

`uv` updates the script:

    $ cat test.py
    #!/usr/bin/env python3
    # /// script
    # requires-python = ">=3.13"
    # dependencies = [
    #     "rich",
    # ]
    # ///

    import sys
    from rich import print

    if sys.version_info < (3, 13):
        print("This script will not work on Python 3.12")
    else:
        print(f"Hello world, this is python {sys.version}")

`uv` runs the script, after installing packages and fetching Python 3.13

    $ uv run test.py
    Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB)
    Downloading cpython-3.13.5-linux-x86_64-gnu (download)
    Installed 4 packages in 7ms
    Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]

And if we run it with Python 3.12, we can see that errors:

    $ uv run --python 3.12 test.py
    warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13`
    Installed 4 packages in 7ms
    This script will not work on Python 3.12

Works for any Python you're likely to want:

    $ uv python list
    cpython-3.14.0b2-linux-x86_64-gnu                 <download available>
    cpython-3.14.0b2+freethreaded-linux-x86_64-gnu    <download available>
    cpython-3.13.5-linux-x86_64-gnu                   /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13
    cpython-3.13.5+freethreaded-linux-x86_64-gnu      <download available>
    cpython-3.12.11-linux-x86_64-gnu                  <download available>
    cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3.12
    cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3 -> python3.12
    cpython-3.11.13-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11
    cpython-3.10.18-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10
    cpython-3.9.23-linux-x86_64-gnu                   <download available>
    cpython-3.8.20-linux-x86_64-gnu                   <download available>
    pypy-3.11.11-linux-x86_64-gnu                     <download available>
    pypy-3.10.16-linux-x86_64-gnu                     <download available>
    pypy-3.9.19-linux-x86_64-gnu                      <download available>
    pypy-3.8.16-linux-x86_64-gnu                      <download available>
    graalpy-3.11.0-linux-x86_64-gnu                   <download available>
    graalpy-3.10.0-linux-x86_64-gnu                   <download available>
    graalpy-3.8.5-linux-x86_64-gnu                    <download available>

tossit444

Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.

BrunoJo

Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/

pawelduda

Can you give an example why it made your life that much better?

3036e4

I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.

I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.

peterleiser

You'll probably like Whisper Live and it's browser extensions: https://github.com/collabora/WhisperLive?tab=readme-ov-file#...

Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.

kmfrk

Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.

But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.

Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.

There's also a great podcast app opportunity here I hope someone seizes.

shrx

As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.

dylan604

IF the dialog is badly recorded or unintelligible speech, how would a transcription process get it correct?

3036e4

I did this as recently as today, for that reason, using ffmpeg and whisper.cpp. But not on the fly. I ran it on a few videos to generate VTT files.

joshvm

I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.

10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.

You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.

Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.

taminka

whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?

briansm

I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)

ec109685

You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).

kanemcgrath

Subtitle edit is great, and their subtitle library libse was exactly what I needed for a project I did.

codedokode

Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.

Morizero

You don't happen to know a whisper solution that combines diarization with live audio transcription, do you?

peterleiser

Check out https://github.com/jhj0517/Whisper-WebUI

I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.

kmfrk

Proper diarization still remains a white whale for me, unfortunately.

Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].

[1]: https://github.com/pyannote/pyannote-audio

peterleiser

I used diarization in https://github.com/jhj0517/Whisper-WebUI last night and once it downloads the model from HuggingFace it runs offline (it claims).

jduckles

WhipserX's diarization is great imo:

    whisperx input.mp3 --language en --diarize --output_format vtt --model large-v2

Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.

Morizero

> input.mp3

Thanks but I'm looking for live diarization.

Lio

Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.

I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.

It seems so unnecessary if you're not making novelty videos about cats.

Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.

ambicapter

They do that because it increases “engagement”, not because they care about the user’s experience with the subtitles.

iAMkenough

Also some social media platforms don't offer subtitle functionality, so burned-in is the only way if you want to serve your content to people that require subtitles or refuse to unmute their phones while they watch from their toilet.

jiehong

Those burned in subtitles still aren’t as cool as theme-matched anime subtitles during intro music sequences from fansubs 15 years ago.

Those are still cool IMO

trenchpilgrim

Or how the fansubbers will create masks to translate diegetic text like signage and written notes

whywhywhywhy

Algorithm boosts it that’s why they do it. Even if every device had real time 100% accurate subtitling built in they’d still do it if they video performs better with it.

HPsquared

The other problem with burned-in subtitles is you can't change the language.

LorenDB

The other other problem with burned-in subtitles is that they normally have horrible formatting. Who wants to try to read single words that only flash on-screen while they are being spoken?

rkomorn

True, but (as someone who not infrequently has to rewind content on just about all streaming apps because it decided one particular subtitle only needed to be display for less than 200ms this time around) sometimes burned-in seems like a good idea.

I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.

t-3

It's a newer problem IME, so I'd guess it's cause by people using auto-transcription/translation tools to generate subtitles. For eg. Chinese content, I'll see stuff on Viki where the OG Mandarin subs are formatted sanely and the English is piecemeal follow-the-audio style. I can't imagine this happening in any other way than use of a transcription+translation tool without review.

absoflutely

I think this trend is partially driven by the silent auto play that happens on YouTube. Baked in subtitles help draw people into the video.

preisschild

They could also just upload those transcriptions as normal closed-captioning srt subtitles...

jimkleiber

not all social media will show subtitles/captions tho, which is the challenge. YouTube Shorts, TikTok videos, IG reels, FB reels, Whatsapp statuses, and more. I think some allow cc but some don't, and if someone reshares to another platform, it may not be there, so some of us burn them in begrudgingly :-)

dzhiurgis

It's just so annyoing how someone like Netflix offers like 3-4 languages for most of its content when you can basically get it for free via browser extensions (if you watch on browser).

Must be union thing.

dewey

That Netflix who would need to pay more to license more subtitles can't compete with pirated or unlicensed auto-generated subtitles shouldn't really be a surprise.

It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.

thunderfork

[dead]

londons_explore

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

yvdriess

A good opportunity to point people to the paper with my favorite title of all time:

"How to wreck a nice beach you sing calm incense"

https://dl.acm.org/doi/10.1145/1040830.1040898

abound

For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"

strken

Thank you! "Calm incense" makes very little sense when said in an accent where calm isn't pronounced like com.

wdaher

This is the correct parsing of it. (I can't take credit for coming up with the title, but I worked on the project.)

codedokode

I only got the "How to recognize" part. Also I think "using" should sound more like "you zinc" than "you sing".

efilife

Thanks. Now I know that I'm not that stupid and this actually makes no sense

fiatjaf

Thank you very much!

fmx

The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...

(Agree that the title is awesome, by the way!)

ThinkingGuy

Also relevant: The Two Ronnies - "Four Candles"

https://www.youtube.com/watch?v=gi_6SaqVQSw

xyse53

My favorite is:

"Threesomes, with and without blame"

https://dl.acm.org/doi/10.1145/1570506.1570511

(From a professor I worked with a bit in grad school)

brcmthrowaway

Do AI voice recognition still use markov models for this?

sva_

Whisper uses an encoder-decoder transformer.

Fluorescence

It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.

Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?

It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!

0cf8612b2e1e

The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.

smallpipe

A good subtitle isn't a perfect copy of what was said.

dylan604

I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?

spauldo

Writing in the vernacular, I believe it's called. I do something like that if I'm texting.

The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.

ph4evers

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.

jeroenhd

The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):

    queue
    
         The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"

londons_explore

so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!

I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.

The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.

anonymousiam

Whisper is excellent, but not perfect.

I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."

JohnKemeny

Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.

t-3

That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.

0points

So, yes, and also no.

lgessler

I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf

I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".

ec109685

The I is emphasized more in I scream than ice cream I think.

But it’s great point that you need context to be sure.

shaunpud

I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun

DiogenesKynikos

This is what your brain does when it processes language.

I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.

mockingloris

A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.

I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.

│

└── Dey well; Be well

cyphar

This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.

A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.

Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).

(This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)

didacusc

what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k

JohnKemeny

Related, a blog article by the author of the patch:

Run Whisper audio transcriptions with one FFmpeg command

https://medium.com/@vpalmisano/run-whisper-audio-transcripti...

Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254

eXpl0it3r

Link is broken, full link: https://medium.com/@vpalmisano/run-whisper-audio-transcripti...

NiekvdMaas

Correct URL: https://medium.com/@vpalmisano/run-whisper-audio-transcripti...

sorenjan

I hope this is the start of more ML filters in ffmpeg. They added the sr (super resolution) filter years ago, but it's old and it's difficult to get the weights so you can run it, since they're not included. They have added support for multiple inference libraries like libtorch, but again, it's difficult to even get started. Hopefully they can get behind a consistent ML strategy, ideally with a "models" directory with ready to use models for upscaling, temporal upscaling, noise cancelling, etc. A lot of audio and video filter research use ML now, new codecs will probably also use it soon.

voxadam

Am I correct in understanding that Whisper is a speech recognition AI model originally created by OpenAI?

https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...

Maxious

yep, there's a c++ implementation to run it https://github.com/ggml-org/whisper.cpp

oezi

Isn't WhisperX the canonical choice for running Whisper?

0points

While whisper and whisperx is python implementations, the whisper.cpp wins the benchmarks.

sampullman

Maybe for running locally? whisper.cpp is nice because you can embed it pretty easily in apps for various targets like iOS, OSX, Android, wasm, etc.

johnisgood

Yes.

From the documentation:

> It runs automatic speech recognition using the OpenAI's Whisper model.

voxadam

Thanks, I was being tripped up by DDOS protection on code.ffmpeg.org for a minute and couldn't read the patch. The combo of Firefox and the fact that Quantum/Lumen/CenturyLink seems to get off by rotating my dynamic IP for no reason occasionally triggers various DDOS protections schemes.

johnisgood

No problem. :) Yeah, it took me 8 seconds to get through. It seems your issue was worse.

acidburnNSA

Yes, according to the comments in the patch, you are correct.

null

[deleted]

cess11

Kind of, it's a family of audio transcription models.

https://huggingface.co/search/full-text?q=whisper

AlienRobot

I think so, if I remember correctly PotPlayer also supports it for automatic subtitling.

kwar13

yes.

donatj

I know nothing about Whisper, is this usable for automated translation?

I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.

A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.

ethan_smith

Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like `ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt`.

waltbosz

I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural.

I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma...

Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.

numpad0

I was recently just playing around with Google Cloud ASR as well as smaller Whisper models, and I can say it hasn't gotten to that point: Japanese ASRs/STTs all generate final kanji-kana mixed text, and since kanji:pronunciation is n:n maps, it's non-trivial enough that it currently need hands from human native speakers to fix misheard texts in a lot of cases. LLMs should be theoretically good at this type of tasks, but they're somehow clueless about how Japanese pronunciation works, and they just rubber-stamp inputs as written.

The conversion process from pronunciation to intended text is not deterministic either, so it probably can't be solved by "simply" generating all-pronunciation outputs. Maybe a multimodal LLM as ASR/STT, or a novel dual input as-spoken+estimated-text validation model could be made? I wouldn't know, though. It seemed like a semi-open question.

chazeon

I do japanese transcription + gemini translations. It’s worse than fansub, but its much much better than nothing. First thing that could struggle is actually the vad, then is special names and places, prompting can help but not always. Finally it’s uniformity (or style). I still feel that I can’t control the punctuation well.

neckro23

In my experience it works ok. The "English" model actually knows a lot of languages and will translate directly to English.

You can also transcribe it to Japanese and use a translator to convert to English. This can sometimes help for more semantically complex dialogue.

For example, using faster-whisper-xxl [1]:

Direct translation:

    faster-whisper-xxl.exe --language English --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>

Use Japanese, then translate:

    faster-whisper-xxl.exe --language Japanese --task translate --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>

1. https://github.com/Purfview/whisper-standalone-win

prmoustache

My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used.

It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.

okdood64

It's curious how YouTube's is so bad still given the current state of the art; but it has got a lot better in the last 6 months.

BetterWhisper

Hey, indeed Whisper can do the transcription of Japanese and even the translation (but only to English). For the best results you need to use the largest model which depending on your hardware might be slow or fast.

Another option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for

trenchpilgrim

Whisper has quite bad issues with hallucination. It will inject sentences that were never said in the audio.

It's decent for classification but poor at transcription.

neckro23

Pre-processing with a vocal extraction model (bs-rofomer or similar) helps a lot with the hallucinations, especially with poor quality sources.

trenchpilgrim

I'm working with fairly "clean" audio (voice only) and still see ridiculous hallucinations.

_def

May I ask which movies? I'm just curious

poglet

Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words.

hbn

I wonder if Apple's upcoming speech APIs can be added too. Would be cool to have it just work out of the box on Macs, without needing to source a model.

https://developer.apple.com/documentation/speech/speechtrans...

https://developer.apple.com/documentation/speech/speechanaly...

https://www.macstories.net/stories/hands-on-how-apples-new-s...

manca

The only problem with this PR/diff is that it creates just a avfilter wrapper around whisper.cpp library and requires the user to manage the dependencies on their own. This is not helpful for novice users who will first need to:

1. git clone whisper.cpp

2. Make sure they have all dependencies for `that` library

3. Hope the build passes

4. Download the actual model

AND only then be able to use `-af "whisper=model...` filter.

If they try to use the filter without all the prereqs they'll fail and it'll create frustration.

It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.

slhck

While that would be nicer from an end-user perspective, it's something hard to maintain for FFmpeg itself. Consider the velocity of the whisper-cpp project. I'm sure that – just like with filters such as vmaf, which also require building a dependency and downloading a model – precompiled versions will become available for novice users to directly download. Especially considering whisper-cpp is MIT-licensed.

webinar

I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.

Xunjin

Is this website open? Would love to see your work :P

webinar

somerville.votolab.com

mkayokay

Looks like this is a nice case were the LLM thinks that silence is "thanks for watching" which was discussed on here a few days ago.

jaster

All the "Thanks for watching!" gave me a good chuckle.

Remind me of one of my own experiences with one of the Whisper model, where some random noise in the middle of the conversation was translated into "Don't forget to like and subscribe".

Really illustrate where the training data is coming from.

waltbosz

I wanted to do this for my local county council meetings. I think in this context speaker recognition would be important.

instagraham

Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc

ks2048

If they want to support it out-of-the box, they'll still have to embed a model file (roughly 500 MB - 3GB, varying size and quality)

einpoklum

Can't you point ffmpeg to a model file using some preferences dialog?

zoobab

Not sure it will be packaged in Debian, with an external binary model god knows how it was produced...

majewsky

It looks like the model file needs to be supplied at invocation time, so the binary blob would not be required for packaging.

zoobab

so 'apt install ffmpeg' won't be enough to have the feature?

SahAssar

You'd have the feature, but you also need to supply the model. The feature seems to just be that ffmpeg has the ability to run the model, it does not include the model.

HN

FFmpeg 8.0 adds Whisper support

FFmpeg 8.0 adds Whisper support