Skip to content(if available)orjump to list(if available)

OpenAI Charges by the Minute, So Make the Minutes Shorter

w-m

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

behnamoh

> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

janalsncm

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file

Transcribe it locally using whisper and output tokens/sec?

btown

Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.

echelon

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.

nand4011

https://www.science.org/doi/10.1126/sciadv.aaw2594

Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.

varispeed

It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).

cookingrobot

There are fonts designed to be legibly at really small size. I wonder if there are voices that are especially understandable at extreme speeds.

Could use an “auctioneer” voice to playback text at 10x speed.

narratives1

I use a Chrome extension that lets you take any video player (including embedded) to 10x speed. Turn most things to 3-4x. It works on ads too

mrmuagi

All audiobooks are like this for me. I tried it for lectures but if I'm taking handwritten notes, I can't keep up my writing.

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

dpcx

https://github.com/codebicycle/videospeed has been a wonderful addition for me.

null

[deleted]

lofaszvanitt

Robot in a human body identified :D.

georgemandis

Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!

w-m

In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y

snickerdoodle12

I love ffmpeg but the documentation is often close to incomprehensible.

pragmatic

No not really? The talk where he babbles about OSes and everyone is somehow impressed?

null

[deleted]

heeton

A point on skimming vs taking the time to read something properly.

I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.

Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.

This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.

Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.

Slower is usually better for thinking.

pluc

Seriously this is bonkers to me. I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.

Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

bisby

> I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you

"This specific knowledge format doesnt work for me, so I'm asking OpenAI to convert this knowledge into a format that is easier for me to digest" is exactly what this is about.

I'm not quite sure what you're upset about? Unless you're referring to "one size fits all knowledge" as simplified topics, so you can tackle things at a surface level? I love having surface level knowledge about a LOT of things. I certainly don't have time to have go deep on every topic out there. But if this is a topic I find I am interested in, the full talk is still available.

Breadth and depth are both important, and well summarized talks are important for breadth, but not helpful at all for depth, and that's ok.

colechristensen

University didn't agree with me mostly because I can't pay attention to the average lecturer. Getting bored in between words or while waiting for them to write means I absorbed very little and had to teach myself nearly everything.

Audiobooks before speed tools were the worst (are they trying to speak extra slow?) But when I can speed things up comprehension is just fine.

hooverd

If you're not listening to summaries of different audiobooks at 2x speed in each ear you're not contentmaxing.

isaacremuant

> We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".

This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.

Consume it however you want and come up with actual criticisms next time?

itsoktocry

>Slower is usually better for thinking.

Yeah, I see people talking about listening to podcasts or audiobooks on 2x or 3x.

Sometimes I set mine to 0.8x. I find you get time to absorb and think. Am I an outlier?

georgemandis

For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.

There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.

In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!

++ to "Slower is usually better for thinking"

mutagen

Not to discount slower speeds for thinking but I wonder if there is also value in dipping into a talk or a subject and then revisiting (re-watching) with the time to ponder on the thoughts a little more deeply.

tass

This is similar to strategies in “how to read a book” (Adler).

By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.

conradev

Was it the speed or the additional information vended by the audio and video? If someone is a compelling speaker, the same message will be way more effective in an audiovisual format. The audio has emphasis on certain parts of the content, for example, which is missing from the transcript or summary entirely. Video has gestural and facial cues, also often utilized to make a point.

bongodongobob

You'd love where I work. Everything is needlessly long bloviating power point meetings that could easily be ingested in a 5 minute email.

georgemandis

I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.

Felt like a fun trick worth sharing. There’s a full script and cost breakdown.

bravesoul2

You could have kept quiet and started a cheaper than openai transcription business :)

ilyakaminsky

I've already done that [1]. A fraction of the price, 24-hour limit per file, and speedup tricks like the OP's are welcome. :)

[1] https://speechischeap.com

behnamoh

Sure, but now the world is a better place because he shared something useful!

4b11b4

Pre-processing of the audio still a valid biz, multiple types of pre-processing might be valid

hn8726

Or openai will do it themselves for transcription tasks

timerol

> Is It Accurate?

> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.

This is a great bit of work, and the author accurately summarizes my discomfort

simonw

There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!

dataviz1000

I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]

The last thing in the world I want to do is listen or watch presidential social media posts, but, on the hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.

[0] https://github.com/huggingface/transformers.js/tree/main/exa...

[1] https://github.com/adam-s/doomberg-terminal

kgc

Impressive

rob

For anybody trying to do this in bulk, instead of using OpenAI's whisper via their API, you can also use Groq [0] which is much cheaper:

[0] https://groq.com/pricing/

Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.

We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.

pzo

there is also cloudflare workers ai where you can have whisper-large-v3-turbo for around $0.03 per hour:

https://developers.cloudflare.com/workers-ai/models/whisper-...

georgemandis

Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.

> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube

Doesn't YouTube do this for you automatically these days within a day or so?

rob

> Doesn't YouTube do this for you automatically these days within a day or so?

Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:

https://github.com/jdepoix/youtube-transcript-api

The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.

Probably would be a good idea to add a delay to it and wait for the automatic ones though :)

jerjerjer

> I wonder if this "speed up the audio" trick would save you even more.

At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.

ks2048

> Doesn't YouTube do this for you automatically these days within a day or so?

Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.

colechristensen

If you have a recent macbook you can run the same whisper model locally for free. People are really sleeping on how cheap the compute you own hardware for already is.

rob

I don't. I have a MacBook Pro from 2019 with an Intel chip and 16 GB of memory. Pretty sure when I tried the large whisper model it took like 30 minutes to an hour to do something that took hardly any time via Groq. It's been a while though so maybe my times are off.

colechristensen

Ah, no, Apple silicon Mac required with a decent amount of memory. But this kind of machine has been very common (a mid to high range recent macbook) at all of my employers for a long time.

fragmede

It's been roughly six years since that MacBook was top of the line, so your times are definitely off.

alok-g

>> by jumping straight to the point ...

Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.

If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.

Tepix

Why would you give up your privacy by sending what interests you to OpenAI when whisper doesn't need that much computer in the first place?

With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.

anigbrowl

I came here to ask the same question. This is a well-solved problem, red queen racing it seems utterly pointless, a symptom of reflexive adversarialism.

55555

This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F

brendanfinan

would this also work for my video consisting of 10,000 PDFs?

https://news.ycombinator.com/item?id=44125598

jasonjmcghee

I can't tell if this is a meme or not.

And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"

karpathy

Omg long post. TLDR from an LLM for anyone interested

Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.

;)

georgemandis

Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)

karpathy

I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.

georgemandis

I watched your talk. There are so many more interesting ideas in there that resonated with me that the summary (unsurprisingly) skipped over. I'm glad I watched it!

LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.

I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.

My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)

Anyway, thanks for the time and thanks for the talk!

bravesoul2

This is the sort of content I want to see in Tweets and LinkedIn posts.

I have been thinking for a while how do you make good use of the short space in those places.

LLM did well here.