Thomson Reuters wins first major AI copyright case in the US
93 comments
·February 11, 2025JackC
Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers: https://storage.courtlistener.com/recap/gov.uscourts.ded.721...
The core story seems to be: Westlaw writes and owns headnotes that help lawyers find legal cases about a particular topic. Ross paid people to translate those headnotes into new text, trained an AI on the translations, and used those to make a model that helps lawyers find legal cases about a particular topic. In that specific instance the court says this plan isn't fair use. If it was fair use, one could presumably just pay people to translate headnotes directly and make a Westlaw competitor, since translating headnotes is cheaper than writing new ones. And conversely if it isn't fair use where's the harm (the court notes no copyright violation was necessary for interoperability for example) -- one can still pay people to write fresh headnotes from caselaw and create the same training set.
The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.
You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.
anon373839
This is an interesting opinion, but there are aspects of it that I doubt will stand the test of time.
One aspect is the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark”. It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!
The key fact underlying all of this, I think, is that when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis. Source text was paraphrased using curiously similar language to West’s paraphrasing. That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.
The case has very little to say about the more commonly posed question of whether copyright is infringed in large-scale language modeling.
AnthonyMouse
> That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.
The "competing product" thing is probably the most extreme part of this opinion.
The most important fair use factor is if the use competes with the original work, but this is generally implied to be directly competes, i.e. if you translate someone else's book from English to French and want to sell the translation, the translation is going to be in direct competition for sales to people who speak both English and French. The customer is going to use the copy claiming fair use as a direct substitute for the original work, instead of buying it.
This court is trying to extend that to anything downstream from it, which seems crazy. For example, "multiple copies for classroom use" is one of the explicit examples of fair use from the copyright statute, but schools are obviously teaching people intending to go into competition with the original author, and in general the idea that you can't read something if you ever intend to write something to sell in competition with it seems absurd and in contradiction to the common practices in reverse engineering.
But this is also a district court opinion that isn't even binding on other courts, so we'll see what happens if it gets appealed.
mountainb
No that is not an extreme interpretation of the fair use factors. This is a routinely emphasized factor in fair use analyses for both copyright and trademark. School fair use is different because that defense is written into the statute directly in 17 U.S.C. § 107. Also, § 108 provides extensive protections for libraries and archives that go beyond fair use doctrines.
The idea that the schools are encouraging the students to compete with the original authors of works taught in the classroom is fanciful by the meaning that courts usually apply to competition. Your example is different from this case in which Ross wanted to compete in the same market against West offering a similar service at a lower price. Another reason that the schools get a carveout is because it would make most education impractical without each school obtaining special licenses for public performance for every work referenced in the classroom.
But maybe that also provokes the question as to if schools really deserve that kind of sweetheart treatment (a massive indirect subsidy), or does it over-privileges formal schools relative to the commons at large?
null
fncypants
I think this is the best takeaway. This case and its outcome is restricted to its facts. Most of the LLM activity today is very different than what happened here.
zozbot234
If close paraphrase can be detected, this ought to be proof enough that some non-trivial element of creativity was involved in the original text. Because purely functional and necessary elements are not protected by copyright, even when they would otherwise be creative (this is technically known as the 'scenes à faire' case) - and surely a "quote" which is unavoidable because it factually and unquestionably is the core of the ruling would have to fall under that.
6stringmerc
[dead]
Ajedi32
Interestingly, almost the entirety of the judge's opinion seems to be focused on the question of whether the translated notes are subject to copyright. It seems to completely ignore the question of whether training an AI on copyrighted material constitutes making a copy of that work in the first place. Am I missing something?
The judge does note that no copyrighted material was distributed to users, because the AI doesn't output that information:
> There is no factual dispute: Ross’s output to an end user does not include a West headnote. What matters is not “the amount and substantiality of the portion used in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public for which it may serve as a competing substitute.” Authors Guild, 804 F.3d at 222 (internal quotation marks omitted). Because Ross did not make West headnotes available to the public, Ross benefits from factor three.
But he only does so as part of an analysis of whether there's a valid fair use defense for Ross's copying of the head notes, ignoring the obvious (to me) point that if no copyrighted material was distributed to end users, how can this even be a violation of copyright in the first place?
unyttigfjelltol
Ross evidently copied and used the text himself. It's like Ross creating an unauthorized volume of West's books, perhaps with a twist.
Obscurity ≠ legal compliance.
dkjaudyeqooe
> The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.
Also the judge makes that statement, it looks like he misunderstands the nature of the AI system and the inherent generative elements it includes.
echoangle
How is the system inherently generative?
currymj
Generative is a technical term, meaning that a system models a full joint probability distribution.
For example, a classifier is a generative model if it models p(example, label) -- which is sufficient to also calculate p(label | example) if you want -- rather than just modeling p(label | example) alone.
Similar example in translation: a generative translation model would model p(french sentence, english sentence) -- implicitly including a language model of p(french) and p(english) in addition to allowing translation p(english | french) and p(french | english). A non-generative translation model would, for instance, only model p(french | english).
I don't exactly understand what this judge meant by "generative", it's presumably not the technical term.
null
echelon
If the copyright holders win, the model giants will just license.
This effectively kills open source, which can't afford to license and won't be able to sublicense training data.
This is very bad for democratized access to and development of AI.
The giants will probably want this. The giants were already purchasing legacy media content enterprises (Amazon and MGM, etc.), so this will probably further consolidation and create extreme barriers to entry.
If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.
dkjaudyeqooe
License what? Every available copyrighted work? Even getting a tiny fraction is not practical.
To the contrary, this just means companies can't make money from these models.
Those using models for research and personal use wouldn't be infringing under the fair use tests.
marcus0x62
> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.
Maybe the strategy is something like this:
1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.
2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models
The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.
If you take this to its logical conclusion, the AI companies wouldn’t have to pre-license anything, and would just pay out all the royalties to the biggest rights holders (more or less what happens in the music industry) on the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.
AnthonyMouse
> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.
They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.
But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.
echoangle
They didn’t train it on every available copyrighted work though, but on a specific set of legal questions and answers. And they did try to license them, and only did the workaround after not getting a license.
mvdtnz
> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.
Oh no. Anyway.
JoshTriplett
> If the copyright holders win, the model giants will just license.
No, they won't. The biggest models want to train on literally every piece of human-written text ever written. You can pay to license small subsets of that at a time. You can't pay to license all of it. And some of it won't be available to license at all, at any price.
If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.
simonw
"The biggest models want to train on literally every piece of human-written text ever written"
They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.
mvdtnz
Open source model builders are no more entitled to rip off content owners than anyone else. I couldn't possibly care any less if this impacts "democratized access" to bullshit generators. At least if the big boys license the content then the rightful owners get paid (and have the option to opt out).
vkou
I don't have either a data center, or every single copyrighted work in history to import as training data to train my open source model.
Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.
veggieroll
> Thomson Reuters prevailed on two of the four factors, but Bibas described the fourth as the most important, and ruled that Ross “meant to compete with Westlaw by developing a market substitute.”
Yep. That's what people have been saying all along. If the intent is to substitute the original, then copying is not fair use.
But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.
It'll be interesting to see how a defendant with a larger wallet will fare. But this doesn't look good.
Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed. This is inevitable now that it's at least partially framed in national security terms.
But I'd hope that this means there is a chance that if models have to train on all of human content, the weights will be available for free to all humans. If it requires massive copyright infringement on our content, we should all have an ownership stake in the resulting models.
toyg
> the current method for training requires this volume of data
This is one of those things that signal how dumb this technology still is - or maybe how smart humans are when compared to machines. A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.
gregschlom
> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
Maybe not directly, but consider that our brains are the product of million of years of evolution and aren't a blank slate when we're born. Even though babies can't speak a language at birth, they already have all the neural connections in place in order to acquire and manipulate language, and require just a few years of "supervised fine tuning" to learn the actual language.
LLMs, on the other hand, start with their weights at random values and need to catch up with those million years of evolution first.
skeledrew
Add to this, the brain is constantly processing raw sensory data from the moment it became viable, even when the body is "sleeping". It's using orders of magnitude more data than any model in existence every moment, but isn't generally deemed "intelligent" enough until it's around 18 years old.
dkjaudyeqooe
> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
> I remember talking with friends 30 years ago
I'd say you're pretty old. How many years of training did it take for you to start producing good output?
The leason here is we're kind of meta-trained: our minds are primed to pick up new things quickly by abstracting them and relating them to things we already know. We work in concepts and mental models rather than text. LLMs are incredibly weak by comparison. They only understand token sequences.
CobrastanJorji
We are unbelievably far from that. Everyone who tells you that we're within 20 years of emulating brains and says stuff like "the human brain only runs at 100 hertz!" has either been conned by a futurist or is in denial of their own mortality.
veggieroll
Absolutely! But the question is whether the next step-change in intelligence is just around the corner (in which case, this legal speedbump might spur innovation). Or, will the next revolution take a while.
There's enough money in the market to fund a lot of research into totally novel underlying methods. But if it takes too long, investors and lawmakers will just move to make what already works legal, because it is useful.
null
Intralexical
> I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.
Why would it be?
"It's inevitable that the Burj Khalifa gets built, once steel production gets high enough."
"It's inevitable that Pegasuses will be bred from horses, as soon as somebody collects enough oats."
Reducing intelligence to the bulk aggregate of brute "calculation power" is... Ironically missing the point of intelligence.
saulpw
> So the models are legitimately not viable without massive copyright infringement.
Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.
_DeadFred_
As long as someone give me the software software to run my business, that person might be in violation of copyright but I'm in the clear.
Simply running my business on illegally distributed copyrighted text/software/movie should not be copyright infringement.
itishappy
You might not be immediately liable, but that doesn't mean you're allowed to continue. I'd assume it's your duty to cease and desist immediately once it's pointed out that you're in violation.
dkjaudyeqooe
> simply training a model on illegally distributed text should not be copyright infringement
You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).
One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.
saulpw
If that mechanical process is not reversible, then it's not a copyright violation. For instance, I can compute the SHA256 hashes for every book in existence and distribute the resulting table of (ISBN, SHA256) and that is not a copyright violation.
aoanevdus
What’s a “mechanical process”? If I read The Lord of the Rings and it teaches me to write Star Wars, is that a mechanical process? My brain is governed by the laws of physics, right?
What if I’m a simulated brain running on a chip? What if I’m just a super-smart human and instead of reading and writing in the conventional way, I work out the LLM math in my head to generate the output?
null
veggieroll
I mean, you're right in the abstract. If you train an LLM in a void and never do anything with the model, sure.
But that's not what anyone is doing. People train models so that someone can actually use them. So I'm not sure how your comment is helpful other than to point out that distinction (which doesn't make much difference in this case specifically or how copyright applies for LLM's in general)
blibble
> Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI.
"a person reading" and "computer processing of data" (training) are not the same thing
MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement
null
Animats
This isn't really about "AI". It's about copying summaries. Google was fined for this in France for copying news headlines into their search results, and now has to pay royalties in the EU. Westlaw is a summarizing and indexing service for court case results. It's been publishing that info in book form since 1872.
Ross was trying to compete with Westlaw, but used Westlaw as an input. West's "Key Numbers" are, after a century and a half, a de-facto standard.[2] So Ross had to match that proprietary indexing system to compete. Their output had to match Westlaw's rather closely. That's the underlying problem. The court ruled that the objective was to directly compete with Westlaw, and using Westlaw's output to do that was intentional copyright infringement.
This looks like a narrow holding, not one that generally covers feeding content into AI training systems.
[1] https://apnews.com/article/google-france-news-publishers-cop...
zozbot234
The case involves headnotes, not just key numbers. Your links provide examples of such headnotes, which make it very clear that a lot of human creativity and judgment is involved in authoring them - they're not a matter of purely factual information, such as a phonebook. Thus, the headnotes are copywritten, and translating them to a different language doesn't negate that copyright. This looks like a slam dunk case, but it has very little to do with AI training as such - the AI was only used to create a kind of rough indexing over the translated text.
If this was only about key numbers, it might have gone the other way because the fact-like element there is considerably greater.
mmooss
TR may have intentionally chosen an easy battle to begin their legal war.
jll29
Note this case is explicitly NOT about large language model type AI - Ross' product is just a traditional search engine (information retrieval system), not a neural transformer a la ChatGPT.
About judge Bibas: https://en.wikipedia.org/wiki/Stephanos_Bibas
preinheimer
Great. The stated goal of a lot of these companies seems to be “train the model on the output of humans, then hire us instead of the humans”.
It’s been interesting that media where watermarking has been feasible (like photography) have seen creators get access to some compensation, while text based creators get nothing.
rolph
rotate similar [but different] fonts [or character pages] over each character. the sequence represents data thus watermark.
WillAdams
but the font changes won't be expressed in the (plain text) output of the LLM.
yifanl
Presumably the font will represent letters to look like a different letter, making it not useful to LLMs scraping the site but useful for visual readers.
This would have detrimental effects to people who use screen readers or have their own stylesheets of course.
simonw
Interesting to note from this 2020 story (when ROSS shut down) that the company was founded in 2014 and went out of business in 2020: https://www.lawnext.com/2020/12/legal-research-company-ross-...
The fact that it took until 2024 for the case to resolve shows how long the wheels of justice can take to turn!
dkjaudyeqooe
The fair use aspect of the ruling should send a chill down the spines of all generative AI vendors. It's just one ruling but it's still bad.
jug
I spontaneously feel like this is bad news for open AI, while playing in the hands of corporate behemoths able to strike expensive deals with major publishers and top it off with the public domain.
I’m not sure this signals the end of AI and a victory for the human, but rather who gets to train the models?
oidar
Ross intelligence was creating a product that would directly compete against Thomson Reuters. Pretty clearly not fair use.
teruakohatu
Ross Intelligence was more a search interface with natural language and, probably, vector based similarity. So I suspect they were hosting and using the corpus in production, not just training a model on it.
mmooss
Thomson Reuters chose to sue Ross Intelligence, not a company like Google or even OpenAI. I wonder how deeper pockets would have affected the outcome.
I wonder how the politics played out. The big AI companies could have funded Ross Intelligence, who could have threatened to sabotage their legal strategies by tanking and settling their own case in TR's favor.
mmooss
I missed this line from the article:
Even before this ruling, Ross Intelligence had already felt the impact of the court battle: the startup shut down in 2021, citing the cost of litigation.
r00fus
What's to say that's not the next step? First step, stop your competitors who are copying your IP.
Lawyers are gonna be happy is my thought.
https://archive.is/mu49I