Skip to content(if available)orjump to list(if available)

The New York Times Has Spent $10.8M in Its Legal Battle with OpenAI So Far

lesuorac

I still find it very (depressingly) hilarious how everybody sees this as a lawsuit about if training on copyrighted context is legal or not.

Literally, the NYT claimed that OpenAI maintained a database of NYT's works and would just verbatim surface the content. This is not an AI issue, it's settled copyright law.

Maxatar

NYT's statement of claims is here:

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

And OpenAI's defense is here:

https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...

OpenAI's defense, which I am not in a position to verify, is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering. That the submissions NYT provided of verbatim reproductions are the result of a third party expert who tried more than 10000 prompts that included feeding into ChatGPT portions of NYT articles in order to get ChatGPT to respond with verbatim reproductions of NYT articles and that no one uses ChatGPT in this manner. Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

TrainedMonkey

This feels like an extremely weak defense to me, because if it works it is essentially a copyright backdoor.

1. There is little doubt that OpenAI has NYT articles in it's dataset.

2. They claim it's OK because all of the IP is mixed together and effectively laundered.

3. But in fact if you put in a secret code you can get the IP back nearly verbatim.

4. So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.

AnthonyMouse

> So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.

This seems like the inverse of the old "book cipher" scheme to "avoid" copyright infringement.

If you want to distribute something you're not allowed to, first you find some public data (e.g. a public domain book), then you xor it against the thing you want to distribute. The result is gibberish. Then you distribute the gibberish and the name of the book to use as a key and anyone can use them to recover the original. The "theory" is that neither the gibberish nor the public domain book can be used to recover the original work alone, so neither is infringing by itself, and any given party is only distributing one of them. Obviously this doesn't work and the person distributing the gibberish rather than the public domain book is going to end up in court.

So then which side of the fence is ChatGPT and which side is the text you have to feed it to get it to emit the article? Well, it's the latter that you need access to both the existing ChatGPT and the original article in order to produce.

Notice also that this fails in the same way. The people distributing the text that can be combined with the LLM to reproduce the article are the ones with the clear intention to infringe the copyright. Moreover, you can't produce the prompt that would get ChatGPT to do that unless you already have access to the article, so people without a subscription can't use ChatGPT that way. And, rather importantly, the scheme is completely vacuous. If you already have access to the article needed to generate the relevant prompt and you want to distribute it to someone else, you don't have to give them some prompt they can feed to ChatGPT, you can just give them the text of the article.

jrockway

I agree. If you gzip a NYT article and print it out, very few people would be able to read the article. But it can still be decoded ("prompt engineering" as OpenAI calls it).

sidewndr46

Copyright maximalism in the 21st century can be summed up as: When an individual makes a single copy of a song and gives it to a friend, that's piracy. When a corporation makes subtlety different copies of thousands of works and sells them to customers, that's just fair use

DSMan195276

I'd say there's some merit to that defense. Imagine for example if a website generated itself based on a sequence in Pi - technically all of the NYT is in that 'dataset' and if you tell it to start at the right digit it will spit back any NYT article. In a more realistic sense though you can make it spit back anything you want and the NYT article is just a consequence of that behavior - finding the right 'secret code' to get a verbatim article is not something you can easily just do.

ChatGPT is somewhere in-between - You can't just ask it for a specific NYT article and have it spit it back at you verbatim (NYT acknowledges as such, it took them ~10k prompts to do it), but with enough hints and guesses you can coax it into producing one (along with pretty much anything else you want). The question then becomes whether that's closer to the Pi example (ChatGPT is basically just spitting the prompt back at you), or if it's easy enough to do that it similar to ChatGPT just hosting the article.

Edit: I suppose I'd add, this is also a separate question from the training, training on copyrighted material may or may not be legal regardless of whether the model can verbatim spit the training material back out.

Maxatar

1. Anyone can get all of NYT's articles for free along with CNN and every other major news site, this isn't in dispute, it's available here in a single 93 terabyte compressed file:

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-05/inde...

2. I did not see any defense of this nature.

3. Yes and this is the big deal. If the secret code needed to reproduce copyrighted material involves large portions of that copyrighted material already then that's quite a bit different than just verbatim reproductions out of thin air.

4. Yes, if OpenAI wins this case then you could feed into ChatGPT large portions of NYT articles and OpenAI could possibly respond by regurgitating similar such portions of NYT articles in response.

freejazz

> is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering

You aren't allowed to infringe copyrights just because you make it difficult to do so. OpenAI's system should not be making verbatim copies at all.

AnthonyMouse

It's probably worth considering how the thing actually works.

LLMs are sort of like a fancy compression dictionary that can be used to compress text, except that we kind of use them in reverse. Instead of compressing likely text into smaller bitstrings, they generate likely text. But you could also use them for compression of text because if you take some text, there is highly likely a much shorter prompt + seed that would generate the same text, provided that it's ordinary text with a common probability distribution.

Which is basically what the lawyers are doing. Keep trying combinations until it generates the text you want.

But the ability to do that isn't really that surprising. If you feed a copyrighted article to gzip, it will give you a much shorter string that you can then feed back to gunzip to get back the article. That doesn't mean gunzip has some flaw or ill intent. It also doesn't imply that the article is even stored inside of the compression library, rather than there just being a shorter string that can be used to represent it because it contains predictable patterns.

It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.

jay_kyburz

Everybody seems to be focused on whether or not the OpenAI copied the data in training, but my understanding of copyright is that if a person when into a clean room and wrote an new article from scratch, without having read any NYT, that just so happened to be exactly the same as an existing NYT article, it would still be a copyright violation.

As soon as OpenAI repeats a set of words verbatim, it violates copyright.

The courts should examine how much damage an occasional verbatim regurgitation would damage NYTs business. (I would guess not much)

Maxatar

No this is untrue. Independent creation is an affirmative defense against copyright infringement. You'd never convince a jury that you independently wrote the exact same article as a New York Times article, but in principle you can argue that you independently wrote say... a song, or even reimplemented the WIN32 API without ever having read or familiarized yourself with the original source code:

https://github.com/wine-mirror/wine

https://harvardlawreview.org/print/vol-128/creating-around-c...

nyssos

> but my understanding of copyright is that if a person when into a clean room and wrote an new article from scratch, without having read any NYT, that just so happened to be exactly the same as an existing NYT article, it would still be a copyright violation.

It would not be. Independent creation is a complete defense against copyright infringement.

Patents, however, do work this way.

hiatus

> and that no one uses ChatGPT in this manner

Someone did though and was able to get verbatim reproductions of NYT articles out of it.

> Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

So what? NYT as a copyright holder might have no issue with those unaffiliated sites but have an issue with OpenAI.

amanaplanacanal

This is a problem with copyright law. There is no way for an end user to determine the copyright status of anything on the Internet, you can only make an educated guess.

jtbayly

First sentence of second paragraph of the lawsuit: “Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service.” First sentence of p7: “The Times objected after it discovered that Defendants were using Times content without permission to develop their models and tools.”

I think it’s ultimately about whether training on copyrighted content is legal or not.

Here are some other quotes from the lawsuit that approach it from a different angle: “These tools also wrongly attribute false information to The Times.” “By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue.”

Even if the first argument fails, if the second argument wins, it still boils down to not being able to train on copyrighted content unless it is possible to train on copyrighted data without ultimately quoting that content or attributing anything to the author of that content. My (uneducated) guess is that’s not possible.

otterley

> I think it’s ultimately about whether training on copyrighted content is legal or not.

It is.

The bulk of the complaint is a narrative; it's meant to be a persuasive story that seeks to put OpenAI in a bad light. You don't really get to the specific causes of action until page 60 (paragraphs 158-180). A sample of the specific allegations that comprise the elements of each cause of action are:

160. By building training datasets containing millions of copies of Times Works, including by scraping copyrighted Times Works from The Times’s websites and reproducing such works from third-party datasets, the OpenAI Defendants have directly infringed The Times’s exclusive rights in its copyrighted works.

161. By storing, processing, and reproducing the training datasets containing millions of copies of Times Works to train the GPT models on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.

162. On information and belief, by storing, processing, and reproducing the GPT models trained on Times Works, which GPT models themselves have memorized, on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.

163. By disseminating generative output containing copies and derivatives of Times Works through the ChatGPT offerings, the OpenAI Defendants have directly infringed The Times’s exclusive rights in its copyrighted works.

avbanks

IMO the first argument is invalid, however, the second one is a completely valid argument.

pro14

> "Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue."

News flash: you can read newspaper articles at the library.

blackeyeblitzar

I haven’t checked in on this case for a while, but aren’t there also many organizations that want OpenAI to win this case so that the concept of fair use is upheld?

n0rdy

I like following the OpenAI vs. NYT case, as it's a great example of the controversial situation:

- OpenAI created their models by parsing the internet by disregarding the copyrights, licenses, etc., or looking for a law loopholes

- by doing that, OpenAI (alongside others) developed a new progressive tool that is shaping the world, and seems to be the next “internet”-like (impact-wise) thing

- NYT is not happy about that, as their content is their main asset

- less democratic countries, can apply even less ethical practices for data mining, as the copyright laws don't work there, so one might claim that it's a question of national defense, considering the fact that AI is actively used in the miltech these days

- while the ethical part is less controversial (imho, as I'm with NYT there), the legal one is more complicated: the laws might simply say nothing about this use case (think GPL vs. AGPL license), so the world might need new ones.

And so on...

rustc

I hope they don't settle early and we finally get an answer to whether training AI on copyrighted content is fair use or not.

seydor

at this point we won't need to know because most AIs will be trained on generated data

pkamb

Is anyone building a public domain repository / AI training ground for old newspapers? Anything before 1930 has no restrictions. Newspapers.com has pretty good content but the interface and search is extremely lacking. Google News was abandoned a decade ago. This seems like something where AI could really help, for once. Not in training chatbots or whatever but actually just providing great search for articles in books, newspapers, and magazines.

bikeshaving

There’s also a fascinating proposal I read somewhere where you create a training set with a knowledge cutoff of 1900 or 1930 and see if the resulting AI could predict the future or independently discover later scientific breakthroughs.

themarbz

I'm imagining a model trained on pre-1930 data that only speaks "old-timey English"...

adonovan

> How do I make an HTML view of a SQL database?

Well old chap, you'll need a shoeshine box full of vacuum tubes and some brass flanges...

tombert

I think that idea is capital. I'm really hoping that chatbots starting using 23-skidoo in conversations.

mrweasel

The interface and search could probably be solved without the use of AI. Seems like mostly an OCR problem. Both ElasticSearch and Sphinx are already really good, and I'm sure that there are other open source or commercial search engines available, or hire ex-Google engineers, Google doesn't seem interested in search anymore.

pkamb

Newspapers have nearly identical newswire columns printed in 100+ newspapers, but with slightly different headlines and content. Or OCR breaking due to words being physically next to each other but in separate stories. The Newspapers.com search has fine OCR but is difficult and time consuming to use because of those issues. Seems like something "AI" could solve easily.

screye

I can't imagine a scenario where pre-training on someone else's works is fair-use, but distilling from a proprietary LLM isn't.

nimish

NYT will lose:

Copyright only protects the actual text. LLMs have weights, not exact copies. In any case, saying "if I put in some input and get copyrighted output" is tantamount to copyright violations; if I use a generative tool and generate copyrighted info is it the tools fault?

An LLM is a dump of effectively arbitrary numbers that, when hooked up to a command line, uses one of the world's most awful programming languages to evaluate and execute.

OpenAI at most broke an EULA or some technicality on copyright w.r.t. local ephemeral copies. What's the damage to the NYT though?

kopecs

> Copyright only protects the actual text. LLMs have weights, not exact copies.

Following this logic a lossily compressed image is completely unprotected by copyright.

> In any case, saying "if I put in some input and get copyrighted output" is tantamount to copyright violations; if I use a generative tool and generate copyrighted info is it the tools fault?

Do you not think this is obviously fact-specific? If I gzip a bunch of (copyrighted) files, then obviously that doesn't somehow make distributing them not infringement. If I now replace the tool = ungzip + input = files combination with tool = (ungzip and files) and input = (selection mechanism over files) do you think that in the second case distributing the tool is not infringement? I don't mean to say that any of these is precisely the same as the LLM case, but I think your argument is clearly overbroad.

> OpenAI at most broke an EULA or some technicality on copyright w.r.t. local ephemeral copies. What's the damage to the NYT though?

One obvious damage claim (if you are skeptical of market harm wrt newspaper/oneline sub sales) is that they were entitled to the FMV of licensing costs of the articles, which is not so hard to value: OpenAI has entered such agreements with AP and others. [0]

[0]: https://apnews.com/article/openai-chatgpt-associated-press-a...

dkjaudyeqooe

Wrong. I can sample a sound off a record, convert it to any format, manipulate it until it's unrecognizable and I'll still have to pay royalties to the original copyright holder.

Even a translation of original text into another language is copyright infringement.

The real question is if LLMs are fair use, and on the basis of the standard tests for fair use, it seems quite doubtful.

dragonwriter

> Copyright only protects the actual text.

Copyright protects against both derived works and copies in any form, including lossy or inaccurate copies that do not reach the originality level to be derived works, not just “exact copies”.

But that doesn't really matter, here, because OpenAI isn't being sued for producing and distributing an LLM (against a mere LLM distributor, NYT would have a much weaker case), they are being sued for providing a service which takes in copyrighted works and spits out copies, both exact and not, that are well within the established coverage of what is a copyright violation that does not fall within exceptions like fair use. and when they control the whole path in between original and copy, then the path in between is largely immaterial.

Its not an “is training AI on copyright protected works fair use” case, its an “is producing copies well within the established parameters of commercial copright violation rendered fair use by sticking an LLM in the middle of the process as part of the mechanism of copying” case.

otterley

To train the model, OpenAI had to make a copy of NYT's works in order to do it. (Running a scraper to dump websites onto your local storage is making a copy.) NYT's first theory is that the act of copying is a prima facie copyright violation.

ViktorRay

Would anyone here be able to explain to me where this money is going? Are the lawyers working for the New York Times really this expensive? If so these lawyers must be getting massive amounts of money...

echoangle

Is 10 million a year a lot for lawyers? I thought a partner at a large law firm might get $ 500k or more per year, so paying a few lawyers and the assistants for all of them can get expensive quickly.

nyssos

> I thought a partner at a large law firm might get $ 500k or more per year

Easily.

light_triad

$1000-$2000+/h is not uncommon for top lawyers

gotoeleven

Are they paying the lawyers with government money? I'm seriously asking. Why is the government paying 10s of millions of dollars/year to the New York Times? How can they still claim to be a news organization without having disclosed this? If the government is paying the NYT, then don't their productions belong in the public domain?

https://x.com/stillgray/status/1887191056074350690

delecti

That suggestion seems rather conspiratorial. Do you have any reason to think that's the case, or are you just throwing it out as a wild possibility?

Also, has anything changed WRT Ian Miles Cheong's credibility? He's been a far-right grifter for years, I wouldn't trust any data he puts out without a corroborating source.

dkjaudyeqooe

(1) Apparently (2) Why don't you ask the NYT or the departments in question (eg through a FOI request)? (3) The NYT sells things, also to the government, so what? Why should they disclose this? (4) No, not by default. Depends on the circumstances.

You make insinuations without a shred of relevant evidence. Your tinfoil hat is on too tight.

tester756

Why is it THAT expensive?

tintor

Lawyer time

$10.8 ~ 135 days * 8 hours * $1000/h x10

tester756

1000/h?

why they're this expensive?

echoangle

Because the times is trying to get millions or billions in compensation if they win, why would you lower your odds of winning by getting cheaper lawyers just to save a few 100 k?

songshu

They have, let’s not call it a union to not upset people, but let’s say a collective agreement that they won’t work for capital. They will only work for other lawyers. So there’s no Walmart Law or other enterprise selling legal services for cheap.

jcranmer

They're not.

Median lawyer rates are more like $200-300/h, with variations depending on locality--a lawyer in NYC is going to be much more expensive than a lawyer in middle-of-nowhere, Kentucky.

As for why they're expensive, part of the answer is because legal training (i.e., law school) is expensive, and lawyers have to pay their student debt.

hlynurd

Because people are willing to pay that amount

jfkrrorj

That is pretty cheap for NYC lawyer.

seydor

an AI lawyer could work 24/365

user3939382

My ideal solution would be to public domain anything NYT has written in the past, turn it over to archive.org, and dismantle NYT so it’s no longer an issue in the future.