OpenAI fails to deliver opt-out system for photographers

165 comments

·January 15, 2025

toddmorey

No way OpenAI will ever “good citizen” this. Tools to opt out of training sets will only come if they are legally compelled. Governments will have to make respecting some sort of training preference header on public content mandatory I think.

The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.

andrei_says_

Reminds me of the time when p2p music sharing became popular and the record companies had to submit every song they did not want to get shared along with an explanation to every person who had Napster installed.

Or was it that the record companies got to sue individuals for astronomic amounts of made up damages for every song potentially shared?

Which one was it?

DaiPlusPlus

> and the record companies had to submit every song they did not want to get shared along with an explanation to every person who had Napster installed.

...when did that ever happen? The post-Napster-but-pre-BitTorrent era (coincidentally, the same time-period as the BonziBuddy-era) was when Morpheus, KaZaA, eDonkey, Limewire, et cetera were relevant, and they got-away-with-it, in-part, by denying they had any ability to moderate their users' file-sharing; there was no "submitting of every song" to an exclusion-list because there was no exclusion-list or filtering in the first place.

orra

This almost got me too, but you missed GP’s point. See their next paragraph beginning “Or”. (It's about double standards for individual versus corporate copyright infringement.)

recursivecaveat

100%, like most opt-outs this exists as a checklist feature that proponents can point to and hopefully convince bystanders. You muddy the waters by allowing someone to with great effort technically possibly achieve the thing they want, maybe, for now, until you close it in 2 years and everyone says "well that makes sense nobody used that feature anyways".

dylan604

> The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.

That's bloody brilliant. If you don't want us to scrape your content, please send us your content with all of the training data already provided so we will know not to scrape it if we come across it in the wild. FFS

nicbou

The tech industry’s understanding of consent is terrifying.

dylan604

Understanding is a curious choice of words. I’d have gone with total disregard

isoprophlex

Mirrors that of a sexual predator.

"Oh I'm not groping you today? No worries, I'll be back tomorrow."

Obscurity4340

It mirrors the rest of society's lack of understanding of consent. Sunrise, sunset

stonogo

They learned from Google, who to this day requires you to suffix your wifi network name with _NOMAP if you do not want it to be used by their mapping services.

htrp

Sounds like they want photographers to do the data labeling for them....

echelon

Insofar as data for diffusion / image / video models are concerned, the rise of synthetic data and data efficiency will mean that none of this really matters anyway. We were just in the bootstrapping phase.

You can bolt on new functional modules and train them with very limited data you acquire from Unreal Engine or in the field.

toddmorey

I don’t entirely agree. For example, it’s a very popular scheme on Etsy right now to use LLMs to generate posters in the style of popular artists. Any artist should be able to say hey I don’t want my works to be part of your training set to power derivative generations.

And I think it should even apply retroactively so that they have to retrain their models that are already generating works from training data consumed without permission. Of course, OpenAI would fight that tooth & nail but they put themselves in this position with a clear “take first ask permission later” mentality.

pj_mukh

Dumb question: Why does Etsy allowed clearly reproduced/copied works? AI or not.

Like selling it for money seems like a clear line crossed, and Etsy is the perfect gatekeeper here.

protocolture

Style isnt protected?

tomrod

Impossible to put the toothpaste back in the tube.

llm_trw

Should any artist be able to tell another artist: hey don't copy my work when you're learning, I don't want competition?

It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.

simonw

Has synthetic data become a big part of image/video models?

I understand why it's useful and popular for training LLMs, but I didn't think it was applicable to generative image/video work.

llm_trw

I haven't had the chance to train diffusion models but for detection models synthetic data is absolutely how you get state of the art performance now. You just need a relatively tiny extremely high quality dataset to bootstrap from.

null

[deleted]

toddmorey

For clarity, I do agree that synthetic data is huge for training AI to do certain tasks or skills. But I don’t think creative work generation is powered by synthetic data and may not be for a quite while.

numpad0

Isn't that just weird cope? I mean, why not just LLM automate UE if that's the goal & how isn't that itself going to get torpedoed by Epic?

oraphalous

I don't even understand why it's everyone elses problem to opt-out.

Eventually - for how many of these AI companies would a person have to track down their opt-out processes just to protect their work from AI? That's crazy.

OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work. How they are getting away with this is beyond me.

munchler

Copyright doesn't prevent anyone from "using" a person's work. You can use copyrighted material all day long without a license or penalty. In particular, anyone is allowed to learn from copyrighted material by reading, hearing, or seeing it.

soared

There is an argument to be made that ChatGPT mildly rewording/misquoting info directly from my blog is copying.

Aeolun

And it is. And you can sue them for that. What you can’t do is get upset they (or their AI) read it.

munchler

Sure, but that's a different claim and a different argument.

null

[deleted]

amiantos

I think to make that argument you would need evidence that someone prompted ChatGPT to reword/misquote info directly from your blog, at which point the argument would be that that person is rewording/misquoting info directly from your blog, not ChatGPT.

jillesvangurp

That would fall under fair use.

Legally, you'd struggle to prove any form of infringement happened. Making a copy is fine. Distributing copies is what infringes. You'd need to prove that is happening.

That's why there aren't a lot of court cases from pissed off copyright holders with deep pockets demanding compensation.

23B1

> Copyright doesn't prevent anyone from "using" a person's work.

It should. The 'free and open internet' is finished because nobody is going to want to subject their IP to rampant laundering that makes someone else rich.

Tragedy of the commons.

munchler

I can see this both ways. For the sake of argument, please explain why using IP to train an AI is evil, but using the same IP to train a human is good.

Note that humans use someone else's IP to get rich all the time. E.g. Doctors reading medical textbooks.

amiantos

Under this mentality, every search engine index would be shut down.

griomnib

Napster had a moment too, but then they got steamrolled in court.

Courts are slow, so it seems like nothing is happening, but there’s tons of cases in the pipeline.

The media industry has forced many tech firms to bend the knee, OpenAI will follow suit. Nobody rips off Disney IP and lives to tell the tale.

tiahura

If your business model depends on the Roberts' court kneecapping AI, pivot. Training does not constitute "copying" under copyright law because it involves the creation of intermediate, non-expressive data abstractions that do not reproduce or communicate the copyrighted work's original expression. This process aligns with fair use principles, as it is transformative, serves a distinct purpose (machine learning innovation), and does not usurp the market for the original work.

paranoidrobot

I believe there are some other issues other than just "is it transformative".

I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".

Similarly, I can't take a sample of a Taylor Swift song and use it myself in my own music - I have to give Taylor credit, and probably some portion of the revenue too.

There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.

mitthrowaway2

There was a time when it did not usurp the market for the original work, but as the technology improves and becomes more accessible, that seems to be changing.

toddmorey

In my experience when existing laws allow an outcome that causes enough significant harm to groups with influence, the laws gets changed.

23B1

> Training does not constitute "copying" under copyright law

It should.

llm_trw

And yet Micky Mouse is in the public domain. Something those of us who remember the 90s thought would never happen.

timcobb

Just the oldest Mickey. They gave up on it because the cost/benefit wasn't deemed worth it anymore.

CamperBob2

I don't even understand why it's everyone elses problem to opt-out.

Because the work being done, from the point of view of people who believe they are on the verge of creating AGI, is arguably more important than copyright.

Less controversially: if the courts determine that training an ML model is not fair use, then anyone who respects copyright law will end up with an uncompetitive model. As will anyone operating in a country where the laws force them to do so. So don't expect the large players to walk away without putting up a massive fight.

SketchySeaBeast

Of note here is the reason it's "important" is it will make a shit-ton of money.

CamperBob2

That, coupled with the obvious ideological motivations. Success could alter the course of human history, maybe even for the better.

If you feel that what you're doing is that important, you're not going to let copyright law get in the way, and it would be silly to expect you to.

paulcole

> OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work

This is the problem of thinking that everyone “has” to do something.

I assure you that I (and you) can use someone else’s work without asking for permission.

Will there be consequences? Perhaps.

Is the risk of the consequences enough to get me to ask for permission? Perhaps.

Am I a nice enough guy to feel like I should do the right thing and ask for permission? Perhaps.

Is everyone like me? No.

> How they are getting away with this is beyond me.

Is it really beyond you?

I think it’s pretty clear.

They’re powerful enough that the political will to hold them accountable is nonexistent.

farrarstan

[dead]

griomnib

I think it’s safe to assume anything Sam A says is an outright lie by now.

maeil

It's depressing that this understanding hasn't been the status quo for years now. It's not like this is his first gig, it's been publicly verifiable what kind of person he is for ages, long before GPT became famous. You don't need to be part of some insider Silicon Valley cabal to find out.

hashxyz

Can you back that up with anything? I’ve gotten this as a vague sense, but it seems hard to find much actual background about how he manages to continuously fail upward.

null

[deleted]

dgfitz

Eventually the headline will be the first 2 words.

The tech is neat, there is value in a sense, LLMs are a fun tech. They are not going to invent AGI with LLMs.

wilg

who cares if they do it with LLMs or not? how do you define agi?

mschuster91

> how do you define agi?

An AI that has enough sense of self-awareness to not hallucinate and to recognize the borders of its knowledge on its own. That is fundamentally impossible to do with LLMs because in the end, they are all next-token predictors while humans are capable of a much more complex model of storing and associating information and context, and most importantly, develop "mental models" from that information and context.

And anyway, there are other tasks than text generation. Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars", but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.

og_kalu

>Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars"

These are just post-hoc rationalizations. No-one making those split-second decisions under those circumstances has those chains-of-thoughts. The brain doesn't 'think' that fast.

>but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.

We're talking about LLMs right ? They can make these sort of inferences.

https://wayve.ai/thinking/lingo-2-driving-with-language/

wilg

again i don't care whether its done with an LLM or not. there's no reason to think openai will only build LLMs. recognizing borders of its knowledge is a reasonable thing to include in an agi definition i suppose, but does not seem intractable.

for the second one, ai drivers like tesla's current version is already skipping the object detection/classification and instead uses deep learning on the entire video frame and could absolutely use the ball or any other context to change behavior, even without the particular internal monologue describe there.

PittleyDunkin

> An AI that has enough sense of self-awareness to not hallucinate

It's not entirely clear that this is meaningful. Humans engage in confabulation, too.

portaouflop

We have this discussion every minute -.-

lm28469

I care because it's brought to us by same same deranged brains who promised self driving cars "in two years" every year since 2012, and a fully autonomous mars city "by 2030".

We're all wasting time and resources on what basically amounts to alchemy while we could tackle real problems.

Tech solutionists keep making promises for the next 5-10-20 years and never deliver, AI, electric planes, clean fuel, autonomous cars, the meta verse, brain implents. You'd expect the internet would have made people smarter but we fall for the same snake oil as 100 years ago, en masse this time

wilg

i mean there’s progress on all those things and that’s good and there’s no downside really?

goatlover

Whatever makes Open AI enough money?

dgfitz

… very carefully?

hnburnsy

Maybe the task to implement it was scheduled by ChatGPT...

https://news.ycombinator.com/item?id=42716744

Bilal_io

Sorry the task failed for unknown reasons.

thrance

Another one of these daily reminders that we live in a two-tiered justice system, everything you ever created is fair game to them, but don't you dare use a leak of their weights lest you want to be thrown in jail.

jsheard

According to OpenAI you're not even allowed to use GPT output to train a competing model, so they believe that AI models are the only thing worthy of protection from being trained on. Llama used to have a similar clause, which was partially walked back to "you must credit us if you train on Llama output" in later versions, but that's still a double standard since they don't credit anything that Llama was trained on. For obvious reasons now we know that Zuck personally greenlit feeding it pirated books.

umeshunni

Well that hasn't stopped Deepseek.

pton_xd

Honestly, good for them. This whole "we can use your output for our input, but don't even think about doing the same" is just absurd.

DidYaWipe

Shocking news about the company that fraudulently left "open" in its name after ripping off donors.

I think the headline is too generous here. More accurate would be "OpenAI neglects to deliver opt-out system..."

HeatrayEnjoyer

Sorry, who did they rip off?

All their investors stand to profit handsomely (if they live).

hansvm

They ripped off everyone they lied to. The took money under the premise that they'd put humanity first as this AI transition happened (both in safety and in knowledge sharing), and they instead used that money to build a moat that'll make it harder for anyone else to accomplish those same goals. Investors in the original vision would have been better off had they not contributed any funds, and the monetary profit they're receiving in exchange won't be enough to offset those damages (in the sense that it's not enough to fund somebody attempting to execute the same mission now that OpenAI exists -- at least not with the same chance of success they anticipated when OpenAI was younger).

DidYaWipe

Are you saying that donors to their "non-profit" received shares in the now-for-profit enterprise?

And if so, do you have a citation for that?

DidYaWipe

devit

Aren't lawsuits the proper way to address this?

Seems like there's an argument that model weights are a derivative work of the training data, at least if the model is capable of producing output that would be ruled to be such a derivative work given minimal prompting.

Although it may not work with photography since the model might just almost exclusively learn how the object of the photo looks in general and how photos work in general, rather than memorizing anything about specific photos.

fenomas

I think that argument falls down though, because a derivative work is an expressive work in its own right, and model weights aren't.

It would seem more coherent to argue that a model output could be a derivative work, though it would need to include a significant portion of some given source. But even then, since the copyright office's position is that they're not copyrightable, I'm not sure they could qualify.

devit

Model weights, if they can reproduce something like the original, are just a form of lossy compression (or even lossless for text), where the LLM answering the prompt is a more powerful version of asking software to retrieve a specific file from a Zip archive (or a webserver answering an HTTP query) of such lossy compressed data.

So if model weights don't infringe, that would also imply that saving an image as a JPG or a video using AV-1 doesn't infringe, which would obviously effectively implies that copyright doesn't apply to images or videos on the web, which is not current law/policy, so I think that reasoning cannot possibly work.

fenomas

That comparison would only make sense if compressed images were considered derivative works. They're not - copyright doesn't protect bytes on a disk, it protects creative expressions. Lossy compression doesn't affect the creative expression, so in copyright terms a compressed JPG is just a copy, and is covered exactly like the original image.

In contrast a derivative work is one creative expression that contains elements of another - like when you take an image and add commentary, or draw your own addition onto it, etc. And I'm pointing out that a trained model is not that - it's not itself a copyrightable expressive work. (We could think of it as a kind of algorithm for generating works, but algorithms aren't copyrightable.)

econ

In my mental imagery this is a situation that any advancing civilization in the universe should eventually run into. There will be all kinds of materials from the laborious and expensive to the effortless and "I was the first" or some other entitlement. It all boils down to having or not having such automatons. I'm sure there have been plenty who, like us with our books, have successfully denied progress. I'm also sure there have been plenty where it was completely obvious to upload the entire database of ET knowledge.

It is equally obvious what the later gained and the former lost in the process.

We, with our books, have successfully prevented people from educating themselves with amazing implications. Now the challenge is to create equally impotent machines!

You have no further questions:)

econ

The brain chip is just one more interface. I feel the need to remind the younglings that in the time before the internet we talked a lot and talked about whatever we wanted to. Moderation was done by the speaker himself by knowing people. Imagine that! It sometimes got emotional or violent but that is an important part of communication. Looking at your watch was somewhat of an insult as if the other failed to be interesting enough. Today no one is interesting enough to talk in long form, few remember how

Now imagine direct thought moderation. After all, thoughts belong to people? I thought it first? You can't just... It is clear we should control your thoughts. We can't just have you think random things. It would be like like TikTok! Or like reading books!Terrifying!

We are quite used to the man behind the curtain deciding everything for us. At what point would the deal get to absurd I wonder? Would 1984 eventually become a really boring book? Would it exist at all? Would people save up social credits to read it?

Other civilizations must have tried all possible variations with rather predictable results. To a free mind I mean.

Or are we already puppets on a string? How much am I boring you with this? Should I be allowed?

DaiPlusPlus

> We, with our books, have successfully prevented people from educating themselves with amazing implications

Que?

No, really... what?

econ

I mean how we, in stead of setting the books free, keep them in cages and sell tickets.

It seems to me any civilization in the history of the cosmos will inevitably reach a stage where they have choose to make knowledge available in order to solve problems.

One should only have to type the title of a book then get to browse around for a bit. Send a link to someone etc

Anything else is suicidal nonsense.

Tax hard working people to pay to defend dead peoples pixels from copying?

No one knows who or what an author is if there even is one. If I generate or write by hand all word combinations I don't get to own them.

Enforcement is much to expensive for normal people if one even notices the copying. They just get to pay for it.

An elaborate scheme in order to not solve problems, not innovate and not progress.

Terr_

"By continuing, you agree that using any content from this site in training Generative AI grants the site-owner a perpetual, irrevocable, and royalty-free license to use and re-license any and all output created by that Generative AI system, including but not limited to derivative works based on that output."

Just just a GPL-esque idea I've been musing lately [0], I'd appreciate any feedback from actual IP lawyers. The idea is to add a poison-pill, and if a company "steals" your content for their own profit, you can strike back by making it very hard for them to actually monetize the results. Since it's a kind of contract, it doesn't rely on how much work seems to be surfacing in a particular output.

So supposing ArtTheft Inc. snarfs up Jane Doe's paintings from her blog, she--or any similar victim--can declare that they grant the world an almost-public license to anything ArtTheft Inc. has made based on that model. If this happens ArtTheft Inc. could still make some money selling physical prints, but anyone else could undercut them with free or cheaper copies.

[0] https://news.ycombinator.com/item?id=42582615

amiantos

That's cute, I'm going to put that at the start of any creative work I make so that anyone who sees it owes me a license to everything they ever made afterward because a nugget of their life experience legally belongs to me now and all their creative works are now tainted by that.

Terr_

I didn't expect you to admit you were a computer program so readily. :p

Do you have any more-substantive critique? It sounds like you're trying to argue that the terms will be found to be unconscionable. However it's not asking for any payment, or even any effort-taking action: it's just saying that the site-owner provides content on the condition that if you incorporate that content into a generative product, the site-owner gets to use the results too. Clearly the people hoovering up training-data believe my work has some economic value to themselves, or they wouldn't be running a giant web crawler hitting every page of the blog--it's not as if they're arriving out of boredom, or because they followed some opaque hyperlink in curiosity.

protocolture

I dont think you can exclaim away fair use protections. Otherwise everyone already would.

Terr_

> I dont think you can exclaim away fair use protections.

"Copyright doesn't stop me from X" is different from "copyright lets me do X even though I agreed to a contract saying I wouldn't." (I have many problems with modern click/shrink-wrap, but that's a whole 'nother can of worms and I'm just trying to "fight fire with fire" here.)

If the average ToS has no force, then HN is currently infringing on my copyright by showing this post to you.

protocolture

Theres no valid TOS provided to a human to read when I am scraping the entire internet. I dont think wget can sign a contract?

Has anyone managed to hit Google or Yahoo with a TOS violation?

Der_Einzige

Good.

Everyone gets big mad when someone with money acts like Aaron Swartz did. The only bad thing about OpenAI is that they're not actually open sourcing or open accessing their stuff. Mistral or Llama "training on pirated material" is literally a feature, not a bug and the tears from all the artists and others who get mad are delicious. These same artists would profess literal radical marxism but become capitalist luddite copyright trolls the moment that the means of intellectual production became democratized against their will.

If you posted something on the internet, I can and will put it into ipadapter and take your style and use it for my own interests. You cannot stop me except by not posting it where I can access it. That is the burden of posting anything on the public internet. You opt out by not doing it.

tehjoker

Such a weird argument. A company doing the same thing as Aaron Swartz is doing it for personal gain, not for our collective benefit.

amiantos

It is comical to me how fast the anti-RIAA internet turned into a bunch of copyright maximalists who expect organizations like the RIAA to protect them in some way. In actuality, if someone manages to weaponize copyright against AI, it will only successfully be used by massive rights holders to extract payouts from AI companies and none of the money will be given to any of the creatives, and creatives will naturally still not be very happy about it. Spotify 1.0 is right holders streaming your content and paying you fractions of pennies for it, "Spotify 2.0" will be them licensing your content to AI companies and paying you a fraction of a fraction of a penny just once.

dadbod

The tool was called "Media Manager" LMFAO. A name so uninspired it perfectly reflected how little they cared.

grajaganDev

LOL - it sounds like something from 90's era Microsoft.