Ask HN: What API or software are people using for transcription?

48 comments

·June 9, 2025

What API or software are people using for transcription? If remote API, would like it to be fast, cheap, and support summarization. Groq actually looks good as it apparently supports remote URLs for the audio file, which I actually would prefer. If local, would need to work on a base M4 mini. Looking at llamafile/whisperfile, as I'd want to be able to either cli batch it or use it as a local API/server.

Visit

TachyonicBytes

I use whisperfile[1] directly. The whisper-large-v3 model seems good with non-English transcription, which is my main use-case.

I am also eyeing whisperX[2], because I want to play some more with speaker diarization.

Your use-case seems to be batch transcription, so I'd suggest you go ahead and just use whisperfile, it should work well on an M4 mini, and it also has an HTTP API if you just start it without arguments.

If you want more interactivity, I have been using Vibe[3] as an open-source replacement of SuperWhisper[4], but VoiceInk from a sibling comment seems better.

Aside: It seems that so many of the mentioned projects use whisper at the core, that it would be interesting to explicitly mark the projects that don't use whisper, so we can have a real fundamental comparison.

[1] https://huggingface.co/Mozilla/whisperfile

[2] https://github.com/m-bain/whisperX

[3] https://github.com/thewh1teagle/vibe/

[4] https://superwhisper.com/

levocardia

I have used whisperX with success in a variety of languages, but not with diarization. If the goal is to use the transcript for something else, you can often feed the transcript into a text LLM and say "this is an audio transcript and might have some mistakes, please correct them." I played around with transcribing in original language vs. having whisper translate it, and it seems to work better transcribing in the original language, then feeding into an LLM and having that model do the translation. At least for french, spanish, italian, and norwegian. I imagine a text-based LLM could also clean up any diarization weirdness.

TachyonicBytes

Yes, this is exactly where I am going. The LLM also has an advantage, because you can give it the context of the audio (e.g. "this is an audio transcript from a radio show about etc. etc."). I can foresee this working for a future whisper-like model as well.

There are two ways to parse your first sentence. Are you saying that you used whisperX and it doesn't do well with diarization? Because I am curious of alternative ways of doing that.

satvikpendem

DiCoW-v2 seems to work better than whisperX for diarization, by the way.

https://pccnect.fit.vutbr.cz/gradio-demo/

TachyonicBytes

It seems that both use / leverage pyannote. I wonder if the whisperX pipeline can be combined with DiCoW-v2.

codeptualize

Whisper large v3 from openai, but we host it ourselves on Modal.com. It's easy, fast, no rate limits, and cheap as well.

If you want to run it locally, I'd still go with whisper, then I'd look at something like whisper.cpp https://github.com/ggml-org/whisper.cpp. Runs quite well.

pramodbiligiri

I second this (whisper.cpp). I've had a good experience running whisper.cpp locally. I wrote a Python wrapper for invoking its whisper-cli: https://github.com/pramodbiligiri/annotate-subs/blob/main/ge... (that repo's readme might have more details).

Mind you, this is from a few months back! Not sure if this is still the best approach ¯\_(ツ)_/¯

Tsarp

I'd love for you to try https://carelesswhisper.app

- Locally running, wrapper around whisper.cpp

- I've done a lot of work on noise profiling, stitching the segments. So when you are speaking for anything >2-3mins, its actually faster than cloud transcriptions. (Accuracy is a few WER off since they are quantized models).

- You can try without paying or putting in CC. After that ~19$ one time. No need to sign up or login.

- BYOK to use your groq, gemini free daily credits to rewrite. Support for thinking models too. can also plug into any locally running LLM.

- Works on my 1st gen M1 without a sweat.

onemoresoop

How much do you pay on average for an hour of transcription?

null

[deleted]

meepmorp

simultaneously related and off topic:

https://arxiv.org/abs/2402.08021

ivm

Just configured VoiceInk yesterday and it's been flawless for all the languages I speak: https://tryvoiceink.com

It runs a small local model and has optional Power Modes that pass the transcript to a remote or local LLM for further enhancements, based on your currently opened apps or websites. Also the app is open-source, but with a one-time license purchase option (instabuy for me, of course).

swyx

i use https://voicebraindump.com/ which seems to do similar (but i happen to know the dev which is nice for support haha)

jmward01

A combination of engines generally gets the best WER with additional cost. hosted whisper + gemini 2.5 flash lite with custom deconfliction based on what each one does best is a reasonable path. Gemini does general conversation and silence better than whisper v3 large but whisper v3 large does better specialty vocab. Of course both after and before the merge, common transcription errors are fixed with a dictionary based lookup (that preserves punctuation, etc). This combo stays multi-lingual and is pretty cheap but is complex. There are better single source transcription vendors out there but they generally fail to either provide multi-lingual, or to provide timing info or are ridiculously expensive, or or or... I think the next gen of multi-modal models will make this all moot as they will likely crush transcription. Gemini shows that direction right now. OpenAI does a bad job of it but is in the game. Anthropic is surprisingly not really engaged in this yet (but they did just announce real time audio so they gotta be thinking about it).

satvikpendem

What are people using for realtime transcription and diarization specifically? I'm thinking something like Zoom's transcript feature but Zoom itself has the advantage of knowing exactly who is speaking at what time so they don't need to diarize from raw speech at all.

So far I've seen DiCoW-v2 work pretty well, it's a diarization finetuned Whisper [0], also paid options like Speechmatics work well and are fairly cheap.

[0] https://pccnect.fit.vutbr.cz/gradio-demo/

illright

A very worthwhile mention is also Stable-TS: https://github.com/jianfch/stable-ts

Out of the box it can transcribe with Whisper or Faster-Whisper, but it can also align audio with an existing human-written transcript, providing time information without losing accuracy. This last feature was something I really needed, and my attempt at building it myself ended up much worse, so I'm glad I found this

I self-host it using Modal.com, as do some other commenters

simonw

I really like the MacWhisper macOS desktop app - https://goodsnooze.gumroad.com/l/macwhisper

It runs Whisper (or the newer Whisper Turbo) really well, and you can both drop MP3/MP4/etc files into it or paste in URLs to a YouTube video/podcast URL to kick off a transcription. It exports to text or VTT subtitles or a bunch of other formats. I use it several times a week.

droopyEyelids

I was surprised to find my old PC with a GTX1080 could transcribe/diarize about 10x faster than my m1 Mac. If anyone reading this is looking to transcribe 100s of hours of audio, do the extra work to get it set up on desktop with a dedicated graphics card.

ashryan

Thumbs up for Wispr Flow. Their iOS app was just released last week, and is an interesting addition to the product.

I needed to do an inventory of stuff in our house over the weekend, and I used Wispr Flow on iOS to take a very very long and rambly note in their app. Then the transcription text appeared on their Mac app, ready to be pasted into ChatGPT for parsing.

Wispr Flow handles languages switches quite well in my experience using it in both English and Japanese.

io84

Cheap hack I use for transcribing in-person customer sessions:

1. record the audio on your phone audio recorder

2. send the mp3 to yourself in Slack

3. a few minutes later the transcription will appear on Slack

I then feed that to an LLM for summary and actions. Quality has been great for this workflow, all in English.

devoutsalsa

For a web UI, I've used TurboScribe and liked it: http://turboscribe.ai/. Their free tier allows 3 transcriptions per day for audio/video files up to 30 minutes in length. That was nice as many competing services limit their free tier to 10 minutes.

tcdent

OpenAI speech-to-text. Every time I try to use an open source model I am left unimpressed. If I'm going through the hassle of creating a system it might as well work correctly. At this point the closed source world is still miles ahead of the open world; hopefully that changes in the next year or two.

(Wispr Flow is the best for general TTS on desktop as well.)

sexyman48

I also used to wish open source would catch up with closed source. Then I realized kids and vacations cost money.