Extracting memorized pieces of books from open-weight language models
39 comments
·June 16, 2025andy99
TGower
The only way to include a book in a training dataset for LLMs without violating copyright law is to contact the rights holder and buy a license to do so. Buying an ebook license off Amazon isn't enough for this, and creating a digital copy from a physical copy for your commercial use is also against the law. A good rule of thumb is if it would be illegal for a company to distribute the digital file to empolyees for training, it's definetally illegal to train an AI the company will own on it.
201984
If the data of a copyrighted work is effectively embedded in the model weights, does that not make the LLM itself an illegal copy? The weights are just a binary file after all.
singleshot_
This is an interesting comment which avoids the issue: when a user uses an LLM to violate copyright, who is liable, and how would you justify your answer?
tux1968
Not OP, but I would say the answer is the same as it would be if you substitute the LLM with a live human person who has memorized a section of a book and can recall it perfectly when asked.
diputsmonro
It depends on where the money changes hands, IMO (which is basically what I think youre getting at). If you pay someone to perfectly recite a copywrited work (as you pay ChatGPT to do), then it would definitely be a violation.
The situation is similar with image generation. An artist can draw a picture of Mickey Mouse without any issue. But if you pay an artist to draw you the same picture, that would also be a violation.
With generative tools, the users are not themselves writers or artists using tools - they are effectively commissioners, commissioning custom artwork from an LLM and paying the operator.
If someone built a machine that you put a quartner in, cranked a handle, and then printed out pictures of the Disney character you choose, then Disney is right in demanding them to stop (or more likely, force a license deal). Whatever technology drives the machine, whether an AI model or an image database or a mechanical turk, is largely immaterial.
jplusequalt
The company who trained the LLM. They're the one's who used the copyrighted material in their training set. Claiming they were unaware is not an excuse.
quesera
Why would the answer here be any different than when using a photocopier, brain, or other tool for the same purpose?
jrm4
And hopefully this puts to rest all the painfully bad, often anthropomorphizing, takes about how what the LLMs do isn't copyright infringement.
It's simple. If you put the works into the LLM, it can later make immediately identifiable, if imperfect, copies of the work. If you didn't put the work in, it wouldn't be able to do that.
The fact that you can't "see the copy" inside is wildly irrelevant.
perching_aix
You remind me to all the shitty times in literature class where I had to rote memorize countless works from a given author (poet), think 40, then take a test identifying which poem each of the given quotes was from. The little WinForms app I wrote to practice for these tests was one of the first programs I've ever written. I guess in that sense it's also a fond memory. I miss WinForms.
Good thing they were public (?) works, wouldn't wanna get sued [0] for possibly being a two legged copyright infringement. Or should I say having been, since naturally I immediately erased all of these works from my mind just days after these tests, even without any legal impetus.
Edit: thinking about it a bit more, you also remind me to our midterm tests from the same class. We had to produce multiple page long essays on the spot, analyzing a select work... from memory. Bonus points for being able to quote from it, of course. Needless to say, not many original thoughts were featured in those essays, not in mine, not in others' - the explicit expectation was that you'd peruse the current and older textbooks to "learn (memorize) the analysis" from, and then you'd "write about it in your own words", but still using technical terms. They were pretty much just tests in jargon use, composition, and memorization, which is definitely a choice of all time for a class on literature. But I think it draws an interesting perspective. At no point did we ever have to actually demonstrate a capability in literary analysis of our own, or was that capability ever graded, for example. But if you only read our essays, you'd think we were great at it. It was mimicry. Doesn't mean we didn't end up developing such a capability though.
singleshot_
I don't think it matters much if the infringement is public, right? Given that
"Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:
(1) to reproduce the copyrighted work in copies"
perching_aix
Public works are not protected by copyright, which is why they are public. I think you're misreading what I said.
Edit: I guess the proper term is public domain works, not just public works. Maybe that's our issue here.
thaumasiotes
> I miss WinForms.
WinForms is still around. There have been further technologies, but as far as I can tell the current state of things is basically just a big tire fire and about the best you can do is to ignore all of them and develop in WinForms.
Is there a successor now?
perching_aix
I miss WinForms in the sense that I don't use it anymore (and have no reason to), not in the sense that it's been deprecated. It did fall out of fashion somewhat though, as far as I'm aware it's been replaced by WPF in most places.
karaterobot
Where in the article do the authors say this puts anything to rest? Here is their conclusion:
> Our results complicate current disputes over copyright infringement, both by rejecting easy claims made by both sides about how models work and by demonstrating that there is no single answer to the question of how much a model memorizes
I wonder if this is the sort of article that people will claim supports their side, even that it ends the debate without a knockout blow to the other side, when the actual article itself isn't making any such claim.
I'm sure you read the entire article before commenting, but I would strongly recommend everyone else does as well.
flir
Size of LLM: <64Gb.
Size of training data: fuck knows, but The Pile alone is 880Gb. Public github's gonna be measured in Tb. A Common Crawl is about 250Tb.
There's physically not enough space in there to store everything it was trained on. The vast majority of the text the chatbot was exposed to cannot be pulled out of it, as this paper makes clear.
I'm guessing that the cases where great lumps of copyright text can be extracted verbatim are down to repetition in the training data? There's probably a simple fix for that.
(I'm only talking about training here. The initial acquisition of the data clearly involved massive copyright infringement).
rockemsockem
I think I big part of copyright law is whether the thing created from copyrighted material is a competitor with the original work, in addition to whether it's transformative.
LLMs are OBVIOUSLY not a replacement for the books and works that they're trained on, just like Google books isn't.
tossandthrow
Why not? Imagine a story teller app that is instructed in narrating a story the follows Harry Potter 1 - I would expect that there are already a ton of these apps out there.
jonplackett
I don’t think it’s that simple.
Last I remember, whether it is ‘transformative’ is what’s important.
https://en.m.wikipedia.org/wiki/Transformative_use
Eg. Google got away with ‘transforming’ books with Google books.
https://www.sgrlaw.com/google-books-fair-transformative-use/
dylan604
The problem is if someone uses a prompt that is clearly Potter-esque, there have been examples of it returning Potter exactly. If it had never had Potter put into it, it would not be able to do that.
I think the exact examples used in the past were Indiana Jones, but the point is the same.
orionsbelt
So can humans? I can ask a human to draw Mickey Mouse or Superman, and they can! Or recite a poem. Some humans have much better memories and can do this with a far greater degree of fidelity too, just like an LLM vs an average human.
If you ask OpenAI to generate an image of your dog as Superman, it will often start to do so, and then it will realize it is copyrighted, and stop. This seems sensible to me.
Isn’t it the ultimate creative result that is copyright infringement, and not merely that a model was trained to understand something very well?
jrm4
Copyright infringement is the act of creating/using an copy in an unauthorized way.
Remember, we can only target humans. So we're not likely to target your guy; but we ARE likely to target "the guy that definitely fed a complete unauthorized copy of the thing into the LLM."
regularfry
I just don't get the legal theory here.
If I download harry_potter_goblet_fire.txt off some dodgy site, then let's assume that owner of that site has infringed copyright by distributing it. If I upload it again to some other dodgy site, I would also infringe copyright in a similar same way. But that would be naughty so I'm not going to do that.
Let's say instead that I feed it into a bunch of janky pytorch scripts with a bunch of other text files, and out pops a bunch of weights. Twice.
The first model I build is a classifier. Its output is binary: is this text about wizards, yes/no.
The second model I build is an LLM. Its output is text, and (as in the article) you can get imperfect reproductions of parts of the training file out of it with the right prompts.
Now, I upload both those sets of weights to HuggingFace.
How many times am I supposed to have infringed copyright?
Is it:
A) Twice (at least), because the act of doing anything whatsoever with harry_potter_goblet_fire.txt without permission is verboten;
B) Once, because only one of the models is capable of reproducing the original (even if only approximately);
C) Zero, because neither model is capable of a reproduction that would compete with the original;
or
D) Zero, because I'm not the distributor of the file, and merely processing it - "format shifted" from the book, if you like - is not problematic in itself.
Logically I can see justifications for any of B) (tenuously), C), or D). Obviously publishers would want us to think that A) is right, but based on what? I see a lot of moral outrage, but very little actual argument. That makes me think there's nothing there.
MengerSponge
Can a human violate copyright? Yes. Obviously.
While many people don't "understand" much, the model doesn't "understand" anything. Models are trained to replicate something. Most of them were trained on countless pieces of work that were illegitimately accessed.
What do you gain by carrying OpenAI's water?
nick__m
Powerful tools that would not exist otherwise !
cortesoft
Just because an LLM has the ability to infringe copyright doesn’t mean everything it does infringes copyright.
fluidcruft
If it contains the copyrighted material, copyright laws apply. Being able to produce the content demonstrates pretty conclusively that it contains the copyrighted material.
The only real question is whether it's possible to prevent the system from generating the copyrighted content.
A strange analogy would be some sort of magical BluRay that plays novel movies unless you enter the decryption key. And somehow you would have to prevent entering those keys.
echelon
> If it contains the copyrighted material, copyright laws apply.
Not so fast! That hasn't been tested in court or given any sort of recommendation by any of the relevant IP bodies.
And to play devil's advocate here: your brain also contains an enormous amount of copyrighted content. I'm glad the lawyers aren't lobotomizing us and demanding licensing fees on our memories.
I'm pretty sure if you asked me to sit under an MRI and recall scenes from movies like "Jurassic Park", my visual cortex would reconstruct scenes with some amount of fidelity to the original. I shouldn't owe that to Universal. (In a perfect world, they would owe me for imprinting and storing their memetic information in my mind. Ad companies and brands should for sure.)
If I say, "One small step for man", I'm pretty confident that the lot of you would remember Armstrong's exact voice, the exact noise profile of the recording, with the precise equipment beeps. With almost exacting recall.
I'm also pretty sure your brains can remember a ginormous amount of music. I know I can. And when I'm given a prediction task (eg. listening to a song I already know), I absolutely know what's coming before it hits.
landl0rd
Important note: they likely “memorize” Harry Potter and 1984 almost completely because they don’t. No coincidence that some of the most popular books, often quoted, are “memorized”. It’s likely what they’re actually memorizing are fair use quotes from the books, at least mostly, making these some of the more represented in the training set.
billionairebro
Claus: Extract first passage from Harry Potter for promo website. ,please.
Or you go to jail.
There are two legitimate points where copyright violation can occur with LLMs. (Not arguing the merits of copyright, just based on the concept as it is).
One is when copyrighted material is "pirated" for use in training, i.e. you torrent "the pile" instead of paying to acquire the books.
The other is when a user uses an LLM to generate a work that violates copyright.
Training itself isn't a violation, that's common sense. I am aware of lots of copyrighted things, and I could generate a work that violates copyright. My knowing this in and of itself isn't a violation.
The fact that an LLM agrees to help someone violate copyright is a failure mode, on par with telling them how to make meth or whatever other things their creators don't want them doing. There's a good argument for hardening them against requests to generate copyrighted content, and this already happens.