Authors seek Meta's torrent client logs and seeding data in AI piracy probe

173 comments

·January 20, 2025

rockemsockem

It seemed obvious to me for a long time before modern LLM training that any sort of training of machine intelligence would have to rely on pirated content. There's just no other viable alternative for efficiently acquiring large quantities of text data. Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently (assuming tech companies negotiated and threw money at them), the only efficient way to access large volumes of media is piracy. The media ecosystem doesn't allow anything else.

brookst

I don’t follow the “millions of ebooks are hard” line of thinking.

If Meta (or anyone) had approached publishers with a “we want to buy one copy of every book you publish”, that doesn’t seem technical or business difficult.

Certainly Amazon would find that extremely easy.

spencerflem

Buying a book to read and incorporating their text in a product are two different things. Even if they bought the book, imo it would be illegal.

bawolff

There are situations where you are allowed to incorporate the text in your product (fair use).

The million dollar question is if this counts.

brookst

Perhaps, but by not even buying the book they’ve conceded the point.

IMO copyright law does not control what you can do with a book once you’ve bought a license, except for reproduction. It’s arguable that LLMs engage in illegal distribution, but that’s a totally different question from whether simple training is illegal even if the model is never made available to anyone.

amanaplanacanal

Maybe it is, maybe it isn't. The courts will decide.

dkjaudyeqooe

I think they'd ask why they'd want those millions of books. The publishers don't have to, and would be unlikely to, sell if they though something like copyright violation was the goal.

Gud

Which would be fair. It’s not up to the tech oligopoly to dictate who gets to follow which laws.

diggan

> There's just no other viable alternative for efficiently acquiring large quantities of text data. [...] take a lot of effort [...] isn't a thing that can be done efficiently [...] only efficient way to access large volumes of media is piracy

Hypothetical: If the only way we could build AGI would be to somehow read everyone's brain at least once, would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?

brookst

It’s a fun hypothetical and not an obvious answer, to me at least.

But it’s not at all a similar dilemma to “should we allow the IP empire-building of the 1900’s to claim ownership over the concept of learning from copyrighted material”.

diggan

> It’s a fun hypothetical and not an obvious answer, to me at least.

As I wrote it out, I didn't know what I thought either.

But now some sleep later, I feel like the answer is pretty clearly "No, not worth it", at least from myself.

Our exclusive control over access to our mind, is our essential form of self-determination, and what it means to be an individual in society. Cross that boundary (forcefully no less) and it's probably one of the worst ways you could violate a human.

Besides, I'm personally not hugely into the whole "aggregate benefits could outweigh individual harms" mindset utilitarians tend to employ, and feels like it misses thinking about the humans involved.

Anyways, sorry if the question upset some people, it wasn't meant to paint any specific picture but a thought experiment more or less, as we inch closer to scarier stuff being possible.

vkou

> would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?

Sure, if we all get a stake of ownership in it.

If some private company is going to be the main beneficiary, no, and hell no.

visarga

> Sure, if we all get a stake of ownership in it.

But we do, in the sense that benefits flow to the prompter, not the AI developers. The person comes with a problem, AI generates responses, they stand to benefit because it was their problem, the AI provider makes cents per million tokens.

AI benefits follow the owners of problems. That person might have started a projct or taken a turn in their life as a result, the benefit is unquantifiable.

LLMs are like Linux, they empower everyone, and benefits are tied to usage not development.

impossiblefork

Wouldn't it be a bad thing, even if it didn't require any privacy invasion?

If it matched human intellectual productivity capacity, that ensures that human intelligence will no longer get you more money than it takes to run some GPUs, so it would presumably become optional.

BriggyDwiggs42

Could this agi cure cancer, and would it be in the hands of the public? Then sure, otherwise nah.

onemoresoop

> in the hand of the public

Would you trust a businessman on that?

pptr

If you don't, your geopolitical adversary might be the first to build AGI.

So in this scenario I could see it become necessary from a military perspective.

nosbo

scarecrowbob

Ah geeze, I come to this site to see the horrors of the sociopaths at the root of the terrible technologies that are destroying the planet I live on.

The fact that this is an active question is depressing.

The suspicion that, if it were possible, some tech bro would absolutely do it (and smugly justify it to themselves using Rokkos Basalisk or something) makes me actually angry.

I get that you're just asking a hypothetical. If I asked "Hypothetical: what if we just killed all the technologists" you'd rightly see me as a horrible person.

Damn. This site and its people. What an experience.

plsbenice34

Would the average person even be against it? I am the most passionately pro-privacy person that i know, but i think it is a good question because society at large seems to not value privacy in the slightest. I think your outrage is probably unusual on a population level

jahsome

[flagged]

davidcbc

Fuck no

ben_w

IMO, if the AI were more sample-efficient (a long-standing problem that predates LLMs), they would be able to learn from purely open-licensed content, which I think Wikipedia (CC-BY-SA) would be an example of? I think they'd even pass the share-alike requirements, given Meta are giving away the model weights?

https://en.wikipedia.org/wiki/Wikipedia:Copyrights

visarga

Alteratively if they trained the model on synthetic data, filtered to avoid duplication, then no copyrighted material would be seen by the model. For example turn an article into QA pairs, or summarize across multiple sources of text.

thesz

  > trained the model on synthetic data

You get knowledge collapse [1] this way.

[1] https://arxiv.org/abs/2404.03502

wizzwizz4

Since this is Wikipedia, it could even satisfy the attribution requirements (though most CC-licensed corpora require attributing the individual authors).

nh2

> Buying millions of ebooks online would take a lot of effort

I don't understand.

Facebook and Google spend billions on training LLMs. Buying 1M ebooks at $50 each would only cost $50M.

They also have >100k engineers. If they shard the ebook buying across their workforce, everyone has to buy 10 ebooks, which will be done in 10 minutes.

shakna

Google also operates a book store, like Amazon. Both could process a one-off to pay their authors, and then draw from their own backend.

Manfred

For my thesis I trained a classifier on text from internal messaging systems and forums from a large consultancy company.

Most universities have had their own corpora to work with, for example: the Brown Corpus, the British National Corpus, and the Penn Treebank.

Similar corpora exist for images and video, usually created in association with national broadcasting services. News video is particularly interesting because they usually contain closed captions, which allows for multi-modal training.

the-rc

Google has scans from Google Books, as well as all the ebooks it sells on the Play Store.

lemoncookiechip

Wouldn't that still be piracy? They own the rights of distribution, but do they (or Amazon) have the rights to use said books for LLM training? And what rights would those even be?

XorNot

Literally no rights agreement covers LLMs. They cover reproduction of the work, but LLMs don't obviously do this i.e. that the model transiently runs an algorithm over the text is superficially no different to the use of any other classifier or scoring system like those already used by law firms looking to sue people for sharing torrents.

bawolff

> but do they (or Amazon) have the rights to use said books for LLM training?

The real question is - does copyright grant the authors' the right to control if their work is used for LLM training?

Its not obvious what the answer is.

If authors don't have that right to begin with then there is no way amazon could buy it off them.

brookst

It’s a good question. Textbook companies especially would be pretty enthusiastic about a new “right to learn” monetization strategy. And imagine how lucrative it would be if you could prove some major artist didn’t copy your work, but learned from your work. The whole chain of scientific and artistic development could be monetized in perpetuity.

I think this is a dangerous road with little upside for anyone outside of IP aggregators.

majormajor

It means they have existing relationships/contacts to reach out to for negotiating the rights for other usages of that content. I think it negates (for the case of Google/Apple/Amazon who all sell ebooks) the claim made that efficiently acquiring the digital texts wouldn't be possible.

pdpi

Leveraging their position in one market to get a leg up on another market? No idea if it would stick, but that would be one fun antitrust lawsuit right there.

brookst

Fun fact: it’s only illegal to leverage a monopoly in one market to advance another. It’s perfectly legal for Coke to leverage their large but not monopolistic soft drink empire to advance their bottled water entries.

notachatbot123

> Buying millions of ebooks online would take a lot of effort

Let me put that into perspective:

- Googling "how many books exist" gives me ~150 million, no idea how accurate but let's use that. - Meta had a net profit of ~40 billion USD in 2023. - That could be an potential investment of ~250 USD per book acquisition.

That sounds like a ludicrously high budget to me. So yeah, Meta could very well pay. It would still not be ethical to slurp up all that content into their slop machine but there is zero justification to pirate it all, with these kinds of money involved.

rockemsockem

The number 150,000,000 is laughably small.

Anyway the problem is not money it's technical feasibility and timelines.

You clearly don't think LLMs have any value though, so W/E

maeil

> In the most recent fiscal year, Alphabet's net income amounted to 73.7 billion U.S. dollars

Absolutely no way. Yup.

> Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently

Oh no, it takes effort and can't be done efficiently, poor Google!

How can this possibly be an excuse? This is such a detached SV Zuckerberg "move fast and break things"-like take.

There's just no way for a lot of people to efficiently get out of poverty without kidnapping and ransoming someone, it would take a lot of effort.

thatcat

maeil

Not my point, never said it is. Substitute that example with another criminal act.

Edit: Changed it just for you

dang

Recent and related. Others?

Zuckerberg appeared to know Llama trained on Libgen - https://news.ycombinator.com/item?id=42759546 - Jan 2025 (73 comments)

Zuckerberg approved training Llama on LibGen [pdf] - https://news.ycombinator.com/item?id=42673628 - Jan 2025 (191 comments)

Zuckerberg Approved AI Training on Pirated Books, Filings Say - https://news.ycombinator.com/item?id=42651007 - Jan 2025 (54 comments)

loeg

> “By downloading through the bit torrent protocol, Meta knew it was facilitating further copyright infringement by acting as a distribution point for other users of pirated books,” the amended complaint notes.

> “Put another way, by opting to use a bit torrent system to download LibGen’s voluminous collection of pirated books, Meta ‘seeded’ pirated books to other users worldwide.”

It is possible to (ab)use the bittorrent ecosystem and download without sharing at all. I don't know if this is what Meta did, or not.

wongarsu

However since this is a civil case they don't have to prove beyond reasonable doubt that Meta seeded torrents. If they did use torrents the presumption would be that they used a regular bittorrent client with regular settings, and it would be on Meta to show they didn't do that.

anon373839

Meta can show this with testimony. (Employee: “I opened the settings and disabled sharing.”)

This is a difficult theory for the plaintiffs to prevail on, since they would have no evidence of their own to contradict Meta’s testimony to keep the issue in play. Which is why they’re asking for client logs - and good luck with that.

loeg

I am not commenting on any legal mechanics. Just technical details.

butterandguns

Hypothetically you could just not seed.

loeg

Right, that's what I'm talking about. I.e. https://github.com/pmoor/bitthief and similar.

cactusplant7374

That is probably exactly what they did if they were smart about it.

qingcharles

I was (partly) responsible for obtaining recordings for a Very Large Online Streaming Service(tm). Sometimes the studios would send us trucks filled with CDs. Sometimes they didn't have any easily accessible copies of the albums and would tell us to just "get it however..." which often involved SoulSeek, Limewire, etc.

We were not smart about it. We just found the stuff and hit download. To the point where there were days the corp Internet was saturated from too many P2P clients running.

crmd

I am trying to imagine the legal contortions required for the US Supreme Court to relieve Meta of copyright infringement liability for participating in a bit torrent cloud (and thereby facilitating "piracy" by others) in this case, while upholding liability for ordinary people using bit torrent.

Would love if any lawyers here can speculate.

courseofaction

They have been appointed by the president who Zuckerberg stood beside at the inauguration of the age of grift. Legal specifics don't feel very relevant anymore.

brookst

Not a lawyer, but I could see an argument that Meta’s use is transformative whereas just pirating to watch something is not. Not asserting that myself, just saying it seems a possible avenue.

wongarsu

The issue with bittorrent isn't so much that you are acquiring material but that you are also distributing it. There are cases where downloading copyrighted material is legal. But distributing it without consent never is, and is generally punished much worse.

alternatetwo

You can turn off uploading in some torrent clients, such as Transmission.

philistine

The use might (might!) be transformative, but the work is copyrighted. How Meta copied it is an issue. Is the way they acquired it illegal?

After all, Google Books did not acquire their books through torrents. They got physical books.

protocolture

Thats the argument they are using before they were likely seeding.

If anything, and the seeding can be proven, there will be a lot of entities seeking restitution.

Its a technicality but its better than breaking fair use to appease authors.

russellbeattie

So here's a related thought...

Google is currently being sued by journalist Jill Leovy for illegally downloading and using her book "Ghettoside" to train Google's LLMs [1].

However, her book is currently stored, indexed and available as a snippet on Google Books [2]. That use case has been established in the courts to be fair use. Additionally, Google has made deals with publishers and the Author's Guild as well.

So many questions! Did Google use its own book database to train Gemini? Even if they got the book file in an illegal way, does the fact that they already have it legally negate the issue? Does resolving all the legal issues related to Google Books immunize them from these sorts of suits? Legally, is training an LLM the same as indexing and providing snippets? I wonder if OpenAI, Meta and the rest will be able to use Google Books as a precedent? Could Google license its model to other companies to immunize them?

Google's decade-long Books battle could produce major dividends in the AI space. But I'm not a lawyer.

1. https://www.bloomberglaw.com/public/desktop/document/LeovyvG...

2. https://books.google.com/books?id=bZXtAQAAQBAJ

svl7

While Meta's use of copyrighted material might actually fall under fair use I wonder about the implications of having to use the whole source material for training purposes...

Let's say I quote some key parts of a copyrighted book in an way that complies with fair use for a work of mine. In order to find the quoted parts I have to read the whole book first. To read the book first I need to acquire it. If it was simply pirated, wouldn't that technically be the main issue, not the fair use part in their service? I am an absolute layman when it comes to the subject of law and just thinking loudly. It seems to me that admitting using pirated works could be more problematic on itself, regardless of the resulting fair use when it is clear that the whole content had to be consumed / processed to get to the result.

kazinator

The mind boggles. Are the plaintiffs jumping to the conclusion that Meta must have used BitTorrent, based on the idea that whenever someone pirates anything anywhere using the Internet, it's always done with BitTorrent? Or is there actual evidence for this?

Maskawanian

There was employee communication that expressed it being odd to use a torrent client on company computers. [1]

[1] https://timesofindia.indiatimes.com/technology/tech-news/whe...

qingcharles

There were comments published somewhere in the early days where it was specifically mentioned they used one of the big torrent files. That's where the authors got their idea from, I guess.

glitchc

I see a silver lining here: If Meta and/or Google's lawyers can successfully demonstrate in court that piracy does not cause harm, it would nullify copyright infringement laws, making piracy legal for everyone.

spencerflem

This would be poetic, but not gonna happen. It will be legal for big corps but not you and me

hresvelgr

You know, I actually don't think so. Gabe Newell famously said piracy is a distribution problem, so a court would likely have to acknowledge inadequate distribution methods hampering AI development. This gives great precedence for consumer piracy, especially for old media that isn't sold anymore. It may not be a criminal offence if best efforts aren't being made by the original copyright holders to distribute.

spencerflem

Their argument will be that piracy only applies to humans IMO. They're just doing what Google has been allowed to do for decades.

everforward

Meta isn't arguing that, though. They are arguing their use is one of the loopholes in copyright law where they aren't liable for the damages. Even them succeeding would only demonstrate that LLM training is transformative, and would not impact the common uses of piracy for average folk.

I would also be stunned if they make that argument. There is almost undeniably some number of dollars Google/Meta would have paid for the data. It may be less than publishers would want, but I don't anyone would actually believe Google/Meta saying "if the data wasn't free, we just wouldn't have done AI".

Der_Einzige

Yup. As a full on IP abolitionist, I'm super excited by this. Information wants to be free. LLM providers training on things that folks don't want them to is a feature, not a bug. The tears of those mad about this are delicious and will ultimately be drowned out in the rain. Luddites and Copyright Trolls should be annihilated from the body politic with extreme prejudice.

spencerflem

wtf

bawolff

If you need the logs, doesn't that prove the point that the AI is not a derrivative work?

Like if you can't figure out which works were used to create the AI just by looking, its hard to argue that they "copied" the work. Copyright is not a general prohibition on using the copyrighted work only the creativity contained within.

_Algernon_

I asked chatgpt about a design pattern the other day and it plagiarized a paragraph verbatim without attribution even from a textbook Im also reading (Design Patterns)

It isn't difficult to show copyright infringement in these models. The assumption should be that copyright infringement has occured until proven otherwise.

Just the fact that they are indiscriminately web scraping proves that. Just because it is publicly and (monetarily) freely available doesn't mean it isnt copyrighted.

Crestwave

This is why the "AI learns from materials just like a human does so it's not copyright infringement" argument always bothered me. A person won't recite full pages of word-for-word copies [1] from their head when you ask them something.

When I first tried Copilot, I asked it to write a simple algorithm like FizzBuzz and it ripped off a random repo verbatim, complete with Korean comments and typos. Image models will also happily generate near-identical [2] copies (usually with some added noise) of copyrighted images with the right prompt.

[1] https://bair.berkeley.edu/blog/2020/12/20/lmmem/

[2] https://www.theregister.com/2023/02/06/uh_oh_attackers_can_e...

bawolff

A human reproducing a patagraph word for word in an educational context would probably not be considered copyright infringement (although lack of attribution might be problematic). In the US anyways. The US is sonewhat unique as having very broad fair use when it comes to material used in an educational context, much broader than most other countries.

_Algernon_

One of the factors going into determining fair use is whether the use is commercial.

Another factor is the effect on the market of the original product.

Non-attribution + commercial use + affecting the marketability of the original product (which is what LLMs do) seems unlikely to be considered fair use by any existing precedent.

That being said IANAL.

samsin

Claiming that Meta distributed pirated works is still a copyright claim, but you're correct that it's seemingly irrelevant to the fair use argument (which the article acknowledges).

NBJack

Define "figure out" and "looking" for a LLM, a bundle of pseudo-nerual pathways driven by parameters we number in the billions for sufficiently large models.

Earw0rm

No. Because you can't tell by inspecting the weights, and it can be hard to tell AIUI if the capability to generate the output is present, but suppressed by a safety mechanism, or doesn't exist at all.

Crestwave

They have already proven that copyrighted data was used for training but got struck down in court. The reason why they're asking for the torrent logs is because Meta torrenting the pirated data means they probably seeded and thus distributed it, which has a much greater impact legally than just downloading.

FireBeyond

Try to use any of the big players training models and see how quickly they remember how much they value copyright.

WhatsName

You mean OpenAIs infamous "you shall not train on the output of our model" clause?

Terr_

If that's contractually-enforceable in their terms-of-service... then I have my own terms-of-service proposal that I've been kicking around here for several weeks, a kind of GPL-inspired poison-pill:

> If the Visitor uses copyrighted material from this site (Hereafter: Site-Content) to train a Generative AI System, in consideration the Visitor grants the Site Owner an irrevocable, royalty-free, worldwide license to use and re-license any output or derivative works created from that trained Generative AI System. (Hereafter: Generated Content.)

> If the Visitor re-trains their Generative AI System to remove use of the Site-Content, the Visitor is responsible for notifying the Site Owner of which Generated Content is no longer subject to the above consideration. The Visitor shall indemnify the Site-Owner for any usage or re-licensing of Generated Content that occurs prior to the Site-Owner receiving adequate notice.

_________

IANAL, but in short: "If you exploit my work to generate stuff, then I get to use or give-away what you made too. If you later stop exploiting my work and forget to tell me, then that's your problem."

Yes, we haven't managed to eradicate a two-tiered justice system where the wealthy and powerful get to break the rules... But still, it would be cool to develop some IP-lawyer-vetted approach like this for anyone to use, some boilerplate ToS and agree-button implementation guidelines.

protocolture

I still dont think this has legs, precisely because of this case.

They accessed the material through piracy. They never accepted a TOS. They will probably get away with acquiring the material however they liked because of fair use.

The technicality is that they redistributed the material because of seeding, which is a no no.

That said, you might find inspiration in Midjourneys TOS. Anyone paying less than a Business plan agrees that anyone else on the platform can sample your output and your prompt.

oakpond

It's incredibly hypocritical too. They have become rich by training on valuable data produced by others. Yet others are not allowed to train on valuable data produced by them.

visarga

Has anyone thought about orphaned books? Training on orphaned books might open them up to be reintegrated into culture instead of dying off unused and forgotten. Copyright kills works by making them irreproducible when the authors are not to be found.

bhouston

I am not sure you have to use torrent to pirate books. Pdfdrive is likely mich more effective than torrents. Torrents are best for large assets or those that are highly policed by copyright authorities but for smaller things torrents have little benefits.

protocolture

They have an email from a meta employee seeking clarification because it "felt wrong" to torrent the books.

crtasm

I think if you're downloading hundreds of thousands to millions of books you'll be dealing with some pretty large archives.

edit: books3.tar.gz alone is 37GB and claimed to have 197,000 titles in plain text.

Marsymars

A publisher's entire library of books is a large asset.

HN

Authors seek Meta's torrent client logs and seeding data in AI piracy probe

Authors seek Meta's torrent client logs and seeding data in AI piracy probe