Skip to content(if available)orjump to list(if available)

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic

cyp0633

The same happens with whisper-large-v3 on Chinese transcription: silence is transcribed to something like "please upvote, share and favourite this video". I suspect they trained the model on some random YouTube video without carefully picking really useful data.

ttflee

In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.

codedokode

Fair, if AI companies are allowed to download pirated content for "learning", why ordinary people cannot.

snickerdoodle12

There is so much damning evidence that AI companies have committed absolutely shocking amounts of piracy, yet nothing is being done.

It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.

Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg

null

[deleted]

kgeist

Interesting, in Russian, it often ends with "Subtitles by %some_username%"

cyp0633

That is not the case here - I never encountered this with whisper-large-v3 or similar ASR models. Part of the reason, I guess, is that those subs are burnt into the movie, which makes them hard to extract. Standalone subs need the corresponding video resource to match the audio and text. So nothing is better than YouTube videos which are already aligned.

simsla

At least for English, those "fansubs" aren't typically burnt into the movie*, but ride along in the video container (MP4/MKV) as subtitle streams. They can typically be extracted as SRT files (plain text with sentence level timestamps).

*Although it used to be more common for AVI files in the olden days.

isoprophlex

Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...

rollcat

When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

KeplerBoy

Denormals are flushed to zero by default on most GPUs by the way.

inglor_cz

Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.

wahnfrieden

whisper MUST be combined with silence detection / VAD

pferde

Ah, the good old "you're holding it wrong".

What good is a speech recognition tool that literally hears imaginary voices?

cmiles74

If that's truly the case then they should make it part of the product, IMHO.

DANmode

What's VAD?

xigoi

Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?

madcaptenor

I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

philipwhiuk

> I suspect they trained the model on some random YouTube video without carefully picking really useful data.

They trained the model on every YouTube video they could, and hoped the aggregate was useful data.

indrora

When YouTube began building automatic transcriptions for captions, it regularly flagged any noise or music -- typically industrial noise -- with "[foreign]"

If it couldn't understand it, it was "foreign" for the longest time.

the_af

Hey, Netflix occasionally still puts in its English subtitles "[foreign music]", it always cracks me up.

stndef

Yeah, I can confirm seeing that a fair bit specifically during non-verbal parts of videos when someone is using a tool.

mmcwilliams

Similar in the English model. Pretty clear they trained on YouTube videos where creators will put that in otherwise silent sections to ensure it shows up for people with CC on.

probably_wrong

The number one hallucination in my transcriptions was "Subtitles by the Amara.org community".

st_goliath

That's interesting, the few times I tried playing with whisper, I had the impression that YouTube style videos or random cellphone videos was something it did particularly bad with (compared to movies). My guess at the time was that most of the training material might be sub titles and raw screen plays.

The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.

That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.

In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).

null

[deleted]

dlcarrier

Classic overfitting

It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5

stingraycharles

How is this overfitting, rather than a data quality / classification issue?

bGl2YW5j

If the model was able to generalise, you’d expect it to output something like “[silence]” or “…”, in response to silence.

Instead, it reverted to what it has seen before (in the training data), hence the overfit.

stingraycharles

Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.

But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.

But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?

null

[deleted]

hsn915

ُThe Arabic text is the translator's self credit

"Translated by Nancy Qanfar"

efitz

I know it’s off topic, but it reminded me that translators like to put in Easter eggs, or at least they used to: https://learn.microsoft.com/en-us/archive/blogs/ericfitz/i-a...

wongarsu

And the German is “subtitles of [public broadcaster] for [content network], 2017

I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits

mort96

Isn't overfitting just when the model picks up on an unintended pattern in the training data? Isn't that precisely what this is?

null

[deleted]

maxbond

It is a data quality issue which caused the model to overfit.

nottorp

Title should be changed to "OpenAI publishes evidence they trained on pirated movies".

pjc50

Of course. Piracy is legal when you have a bigger pile of money than the studios.

codedokode

Let's not forget that some of real pirates (for example, corsairs) also were legal and performed legitimate pirate activities to ships of foreign countries.

berkes

How is this evidence of that fact? Honest question.

I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

jcranmer

> How is this evidence of that fact?

The contention is that the specific translated text appears largely from illegal translations (i.e., fansubs) and not from authorized translations. And from a legal perspective, that would basically mean there's no way they could legally have appropriated that material.

> But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

Technically, everything is copyrighted. But your question is really about permission. Some of the known corpuses for AI training include known pirate materials (e.g., libgen), but it's not known whether or not the AI companies are filtering out those materials from training. There's a large clutch of cases ongoing right now about whether or not AI training is fair use or not, and the ones that have resolved at this point have done so on technical grounds rather than answering the question at stake.

0points

> I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used.

Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.

> But isn't it already known and admitted (and allowed?)

No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.

1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...

2: https://www.reuters.com/legal/litigation/openai-hit-with-new...

lcnPylGDnU4H9OF

> I don't see where you got that from

It’s been determined by the judge in the Meta case that training on the material is fair use. The suit in that case is ongoing to determine the extent of the copyright damages from downloading the material. I would not be surprised if there is an appeal to the fair use ruling but that hasn’t happened yet, as far as I know. Just saying that there is good reason for them to think it’s been allowed because it kind of has; that can be reversed but it happened.

skeezyboy

> Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it. Unless you qualify for one of the many exceptions, such as fair use

nemomarx

The Chinese subtitles for silence use a common mark for pirated media in that language, according to other commentors here. In general it's pretty likely that if you're finding non professional subtitles they were distributed with pirated media in some form, that's where you get the most fan subs after all

Hnrobert42

HN is pretty strict about not editorializing titles. Even if you statement was unequivocably correct, the post would get flagged.

sivers

to save you a lookup:

The Arabic text "رجمة نانسي قنقر" translates to English as: "Nancy Qanqar's translation" or "Translation by Nancy Qanqar"

"رجمة" means "translation" and "نانسي قنقر" is the name "Nancy Qanqar"

mormegil

In Czech, Whisper usually transcribes music as "Titulky vytvořil JohnyX" ("subtitles made by JohnyX") for the same reason.

actionfromafar

Haha, trained on torrented movies! :-D

The MPA must be so proud.

Incipient

It's absolutely insane that these companies can't be held liable for what is obvious piracy.

beshrkayali

You've got a little typo, it's not "رجمة", it's "ترجمة" that means translation, the ت at the beginning is missing.

aprilthird2021

And it seems to be because the training data is largely unofficial subtitles from movies. Which often have a string like "Translated by X" at the end of the movie which is often silent while credits roll.

rob74

Looks like they used more official sources for German - there, silence is apparently hallucinated as "Untertitelung des ZDF für funk, 2017" according to one of the comments on the issue. Which makes sense, as the public broadcasters' "Mediathek" is probably the largest freely available resource of subtitled videos in Germany. I wonder if the ZDF gave its approval for it being used for LLM training though?

MrGilbert

> I wonder if the ZDF gave its approval for it being used for LLM training though?

I am pretty sure they didn't get asked.

unusual-name

Most content from Funk (youtubers funded by public german broadcasters) is available on youtube without any geoblocking or other limitations.

Zacharias030

definitely not! The media platform of the German public television networks is even geoblocking anyone outside of Germany.

https://www.ardmediathek.de/

bigiain

A more appropriate output might be ``4'33" -- John Cage, 1952``

null

[deleted]

4gotunameagain

I'm sure they totally did not pirate the audio of said movies.

iqfareez

make sense..

Hobadee

Little did you all know, this is just being mechanical turked by Nancy Qunqar.

Way to go Nancy! Keep up the good work, ya crazy bastard!

whamlastxmas

Is this spam? That name only shows as an instagram account and this thread. If you pay for insta followers is this how they get them now? Haha

DAlperin

That’s the name in the Arabic text hallucinated by the model :)

dandiep

Whisper is unusable IMO because of the hallucinations. Widely documented. Removing silence from audio clips helps, but even then it will auto correct grammar, translating bilingual speech, etc. Improved in the latest audio models but not solved [1]

1. https://news.ycombinator.com/item?id=43427376

ilyakaminsky

I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

[1] https://speechischeap.com

eric-burel

That's the problem with raws large models, it should always be coupled with satellite small models and logic. It's (probably) easier to detect hallucinations using a traditional ML/DL model that can catch mismatches (it's easy to build a synthetic dataset for this) than transcribing. And the simplest piece of code can detect a silence and that it should match no text.

haiku2077

I've noticed this also happens in english Whisper models with the phrases:

"[ sub by sk cn2 ]"

or

"Anyways, thanks for watching! Please subscribe and like! Thanks for watching! Bye!"

or

"This is the end of the video. Thank you for watching. If you enjoyed this video, please subscribe to the channel. Thank you."

OSDeveloper

Because they train on pirated media and or youtube videos, good method, until you get slop, or get caught

flexagoon

In Russian it often hallucinates "Субтитры сделал DimaTorzok" ("Subtitles by DimaTorzok") at the end of things. Interestingly, I wasn't able to find any YouTube videos with that name in the subtitles, so it's not like it's in a lot of training data.

berkes

Could it be someone distributing subs online, e.g. showing up in the opensubtitles.org dataset?

voidUpdate

Or possibly someone subtitling pirated movies? That seems to be a common thing according to other comments

arnejenssen

io84

"In the future, everyone will be world-famous for 15 minutes" _in a microniche techno-linguistic community, at a time and choosing of the swirling AI clouds_

shadowgovt

[delayed]

flkiwi

> [In English] it also happens a lot with hallucinations saying stuff like "This is the end of the video, remember to like and subscribe

Well now I know how I’m going to start filling awkward silences in meetings.

boredumb

Since it says "Translated by Nancy Qanqar" i'd be willing to bet they're training on some audiobooks with a transcript and somewhere in there it consistently has "Translated by Nancy Qanqar" in the transcript where there is dead air in the audiobook.