Zuckerberg approved training Llama on LibGen [pdf]
146 comments
·January 12, 2025mrtksn
Workaccount2
The legal problem is in outputting IP, I still have yet to see a convincing argument that training on copyrighted data is a breach of IP laws.
The trained models are trillionths the size of their training sets. There is no archive of copied data in them.
agilob
>argument that training on copyrighted data is a breach of IP laws.
You pay for access to materials, not using or remembering the material in its original format.
93po
Nearly every website does not charge me anything to retrieve information that is their intellectual property.
swatcoder
Training on copyrighted works licensed for such use is inarguably conforming.
Acquiring and using works without such license is just piracy. Whatever your stand on piracy is, most individuals and businesses are not free to incorporate it into their projects. Normal people have faced significant penalties for piracy, and concientious business operators avoid it.
Sure would be disappointing to all those people if there were suddenly a ruling that said "well, but it's okay that these guys did it because they're filthy rich and went real hard with it"
Workaccount2
Again, models are not archives of data.
Llama 3.1 70B is around 45GB is size, despite being trained on likely hundreds of petabytes of data. And before you say it, they are not fancy compression algo's either, the loss is so high they would be useless.
JeremyNT
How can it possibly be the case that it's ok for meta to download and ingest the entire contents of libgen but it is not ok for an individual human to selectively download a single work and read it?
Whatever legal contortions used to justify this are, quite frankly, bullshit. This isn't how anything should work even if these companies can buy themselves a regulatory regime where it does.
JumpCrisscross
> We are approaching the "UBI or Guillotine" fork
Even in the 18th century, the French aristocracy mostly cruised through the Revolution from afar, surviving with fortunes largely intact to this day [1]. If the fork is UBI or guillotine, the selfish move by the private-jetting billionaire class—personally and financially more mobile and global than the French aristocracy ever was—is the latter.
> if there's no such thing as IP, reset the playing field for everyone
Your thesis is letting Altman, Zuckerberg and Musk have free rein would decrease inequality?
> IIRC that's what's happening in China
Not really [2].
[1] https://www.bbc.com/news/magazine-37655777
[2] https://www.chinaiplawupdate.com/2023/08/china-prosecutes-11...
lupire
Extremely misleading citation.
> Criminal trademark infringement made up the majority of IP crimes with 10,384 people prosecuted accounting for 88.9% of the total.
Trademark infringement is of a completely different character from copyright.
Trademark infringement is pure fraud and lying.
Take out trademark infringement, and you have only 1 prosecution per year per 700,000 people.
JumpCrisscross
> Take out trademark infringement, and you have only 1 prosecution per year per 700,000 people
What is it in America? Did we even have a single criminal non-trademark IP prosecution in 2024?
XorNot
The other way to look at it though is that revolution won't solve your problems, and Americans are far too confident that it will.
JumpCrisscross
> other way to look at it though is that revolution won't solve your problems, and Americans are far too confident that it will
Americans are largely not for a revolution because most of us aren’t idiots. There is idle chatter of a civil war, but that’s again (a) bluster (not that this can’t take on a life of its own) and (b) about consolidating control versus wholesale rebuilding the American class structure.
tyleo
I’m no advocate for revolution but the American problem is that our revolution actually worked. Americans freed themselves from a prior group of elites unlike the grandparent comment is claiming of the French elites.
bushbaba
unlike then, today global mobility is within the means of most the western world. A French Revolution today could very well extend globally to identify and re patriot.
JumpCrisscross
> French Revolution today could very well extend globally to identify and re patriot
We have zero historical or contemporary precedent for this, and strong incentives for everyone else in the world to not play along. (As they did in sheltering the French aristocracy.)
In a hypothetical American revolution, foreign powers would be looking for their slice of the pie. To think through this dispassionately, imagine civil war breaking out in Russia or China. A second American revolution à la the first would put today’s billionaires and political elite in a room to draft a new constitution to their liking.
wesapien
Isn't UBI just going to raise inflation? People who don't need it will claim it and use the existing tax loopholes. Tax laws will need to be rewritte.
weatherlite
It's gonna be complex and messy. On the one hand yes, many people receiving UBI = inflation. On the other hand many highly paid software devs (And soon after - accountants, lawyers, marketers, sales people etc etc) are losing their incomes = very deflationary.
It's gonna be interesting that's for sure.
webmaven
The "U" in UBI is for "Universal". There is no means-testing. Everyone gets it regardless of assets or income, which means there is no need to spend any effort on checking whether someone is "poor enough".
Though the state would have to make sure the person receiving the benefit actually exists, is still alive, etc.
wesapien
I understand what UBI means but it's the effect is what I think people do not understand. Based on the Cantillon effect, UBI will just accelerate the separation between the rich and the poor.
tharmas
Indeed it would as the landlords would just raise rents accordingly.
We saw a bit of that with Covid cheques.
ColdTakes
There isn't going to be a revolution. Americans are all talk no action.
bdndndndbve
The idea that abolishing IP protections and letting AI companies run rampant is an offramp for wealth inequality is such a wild take to me?
Realistically billionaires are using racist and homophobic populism as a way to direct working class energy away from wealth inequality. Making people think "woke" is the reason why the earth is on fire and they can't have health insurance.
netfl0
Ah yes, because the working class is primarily concerned with protecting their intellectual property…
bdndndndbve
I think OP is coming from the "temporarily embarassed billionaire" perspective where if only we had a libertarian hellscape without pesky laws they would be a funeral baron who runs Bartertown.
impomura
the working class is paywalled out of education because of IP laws that can seemingly be ignored by the AI companies
casey2
How can you get the definition of fairness so backwards? Giant corporations provide literally everything you take for granted and they should be punished because you are envious? I don't get it.
There is a reason everyone with over 130 IQ wants to work for them rather than starting their own companies.
saagarjha
People who are smart typically have better things to do than talk about their IQs. Or sell ads, for that matter.
Lucasoato
They shouldn’t be punished because people are envious, they should be punished because they’re not respecting other people's intellectual property without an agreement in place.
We can’t protect IPs only when that benefits big corps. We should protect them always or accept that the world is better if we go in another direction, changing the rules for everybody.
visarga
Training on copyrighted data should be legally allowed
- of course exact reproduction of protected content is a no-no
- but learning is ok, as long as it is transformative. User prompts and responses are pushing the model outside its training distribution anyway - users add their own intent, making usage transformative
- when LLMs synthesize from multiple sources, the result is transformative
- if you try to protect expression it is meaningless now, but if you protect abstract ideas it kneecaps creativity
- the problems of copyright started with the apparition of internet, not with AI
- revenues from royalty are almost zero today, as each new content competes against an unbounded list of other works that have been accumulating for decades online
- because royalties are shit, creatives now focus on ads, and this leads to enshittification, attention grabbing junk everywhere, attention is scarce content is post-scarcity
- we actually like interactive participation more than passive consumption; we now edit Wikipedia, contribute to open source, have papers published for free on arXiv, use social networks where our comments are shared with the world, play games instead of reading books - it is another age, the interactive age
- AI is actually more than an infringement tool, it is useful for many legit purposes
- and AI is the worst possible infringement tool, it can hallucinate details, get thins wrong; By comparison copying is free and easy and precise to the letter
So the idea that training is infringement is pretty abusive, it tries to make copyright be about abstractions which is wrong. We can't return to 1990s, so we have to live with its demise. It's been dying for 3 decades already.
bdndndndbve
How can you get the definition of fairness so backwards? The King provides literally everything you take for granted and he should be punished because you are envious? I don't get it.
There's a reason why every vassal with a sizeable estate wants to be in the King's court rather than starting their own country.
null
boramalper
Alluded multiple times in the comments already but worth being explicit: Aaron Swartz killed himself 12 years ago yesterday for facing "a cumulative maximum penalty of $1 million in fines, 35 years in prison" [0] after downloading academic journal articles, which would be only a small percentage of what's available on LibGen.
Free for me, not for thee.
JumpCrisscross
> Free for me, not for thee
Swartz was charged with 35 to 50 years, realistically faced up to 10, and was offered 6 months if he plead guilty [1]. That offer moreover wasn’t the final offer.
Put another way, it’s not clear that the law is being applied to Zuckerberg differently than it was to Swartz given the law wasn’t actually ever applied to Swartz. (Or that they wouldn’t gladly trade this lawbreaking for $1mm in fines and a negotiation over penalties where the prosecution opens with 6 months jail.)
The prosecutor acted inappropriately in that case; MIT, more wildly so. That doesn’t, however, carry over to a transgression of the law given we never got to that stage.
[1] https://www.forbes.com/sites/forbesdev/2023/02/28/increase-w...?
inetknght
> it’s not clear that the law is being applied to Zuckerberg differently than it was to Swartz given the law wasn’t actually ever applied to Swartz
Has Zuckerberg actually been charged with something with equivalent potential consequences?
If not, then your statement is false on its face.
JumpCrisscross
> Has Zuckerberg actually been charged with something with equivalent potential consequences?
I didn’t say Zuckerberg has been subjected to what Swartz was. Swartz never wielded the nation-state level power of a billionaire—it’s difficult to imagine how he could be subjected to similar psychological stress.
I said the law isn’t being applied to Zuckerberg (or anyone who has downloaded LibGen, for that matter) differently because the law was never applied to Swartz. Given the unpopular Swartz prosecution ended Ortiz’s career, and the lack of recent criminal copyright cases, it’s unlikely anyone would attempt to apply it as they did then. To anyone, including Zuckerberg.
TL; DR If you dislike what Zuckerberg is doing, you’re probably advocating for a clarification of the law. If you like it, erm, nothing much to do here.
bagels
LibGen is the most generic name ever, had to look it up. Turns out that LibGen is a collection of pirated books.
perihelions
Shadow libraries are a heavily-discussed, recurring topic on HN,
https://hn.algolia.com/?query=libgen&type=all ("LibGen")
https://hn.algolia.com/?query=anna's%20archive&type=all ("Anna's Archive")
https://hn.algolia.com/?query=z%20library&type=all ("Z-Library")
A_D_E_P_T
It's not just a collection, it's the collection. It contains almost every scientific book ever printed, for one thing.
Frankly, it's a massive boon to researchers. It's like a top-tier research university library at your fingertips, and usually more convenient than the real thing.
reddalo
Also free. That helps.
But the sad state of the affairs is that if Aaron Swartz does it, he ends up dead; if Meta does it, everything is fine.
A_D_E_P_T
A lot of people would gladly pay. I'm a paying subscriber to Anna's Archive, which vastly improves the experience of that site. (It's borderline unusable without a subscription.)
Thing is, the Elsevier/Springer model makes it incredibly difficult to pay them. With single papers or book chapters in the $30-40 range, an afternoon's research can easily cost $600. (Note that the authors and reviewers don't get royalties on this, and the Editor-in-Chief of any given journal usually only makes a small stipend!)
There are services like DeepDyve, but they're intentionally gimped and difficult to use, because their user interface is 100% built around preventing you from downloading or screenshotting the papers you "rent"!
If the publishers set up a $100/month all-open-access program, and if the experience were at least halfway decent, I'd bet that a lot of people sign up. And that's not cheap!
mistercheph
Funny that the world where almost all human knowledge and art is free and accessible for everyone exists in parallel to one where articles about which McDonalds meal are you are paywalled, and funny which world civilized nations have chosen in order to protect The Suite Life of Zack & Cody and all the artists whose livelihoods depend on reruns of iCarly.
elashri
There are three positions around the usage of of shadow libraries.
1- Should we develop this argument into more discussion as society and humans around the knowledge publication and the publication industry greed and the rent-seeking business model.
2- Big Corporation shouldn't just ignore the copyright law while maintaining the strongest copyright protections and going after small folks.
3- The usual argument about how LLMs training is different from people actually using pirated textbook because it is expensive (college and learning is hard and expensive specially in places like Africa).
These are different angles and I think we can try to address all of them as they are not exclusive. There are good arguments around point 3 on two sides. I don't think there is a good argument why we should allow the status quo regarding the first point though. For two, it is more complicated to even discuss specially on HN.
miohtama
We can rewrite copyright laws.
consumer451
It is very difficult for me to believe that Meta's recent political relations moves are not related to the open cases where Meta is the defendant.
qwertox
I don't understand your comment. This is about a lawsuit which shows that Zuckerberg OK'd the downloading and use of LibGen data. The case exists at least since mid-2023 and was in discovery phase until 13. Dec 2024. Shortly before the deadline Meta provided this new information, because they had to.
credit_guy
I guess the parent is saying that the new administration could be more business friendly in prosecuting this type of cases. It might even drop this case altogether. But only if Meta is "friendly" to the administration too.
tux3
They're saying that Meta has been kowtowing to the incoming administration in hopes of getting in their good graces.
Rather famously, some elements of that administration are above the criminal code, so that's not implausible.
monsieurbanana
He's saying that he wants to pay Trump to win these lawsuits, which is a smart move as we know justice is for sale.
lupire
PP is referring to Facebook/Meta's new policy changes like banning intelligence/sanity-based insults on the Platform, but carving out an exception specifically and explicitly for transgender people as targets, and removing tampons from men's bathrooms.
bamboozled
The pivot would potentially help his cause though , would it not ?
aprilthird2021
The antitrust one is most relevant as the new party in power would be gleeful to see it broken up but otherwise disagrees with the concept of antitrust
frob
"For my friends, everything; for my enemies, the law."
1vuio0pswjnm7
PDF: https://ia902305.us.archive.org/34/items/gov.uscourts.cand.4...
Text: https://www.courtlistener.com/docket/67569326/373/kadrey-v-m...
"Meta's request is preposterous. With one possible exception, there is not a single thing in those briefs that should be sealed."
"It is clear that Meta's sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage. Rather, it is designed to avoid negative publicity."
"If Meta again submits an unreasonably broad sealing request, all materials will simply be unsealed."
"One final comment. Between this sealing request and assertions in Meta's opposition brief such as "[t]hat document expressly discusses torrents and seeding", Opp. at 7, the Court is becoming concerned that Meta and its counsel are starting to travel down a familiar road. See In re Facebook, Inc. Consumer Privacy User Profile Litigation, 655 F. Supp. 3d 899 (N.D. Cal. 2023)."
Havoc
I guess the zuck would download a car...
Will be interesting to see where this lands, because all outcomes seem to have significant secondary effects.
mnky9800n
I would download a car
blooalien
https://cults3d.com/en/collections/best-stl-files-cars-3d-pr...
You're welcome! :P
Funes-
Yes. Every "AI" company is training their software on everything, regardless of what they claim, and making millions, billions of dollars on it.
visarga
Do you mean "making cents per million tokens"? And the benefit obviously belongs to the person who prompts, because they solve a task or get help. The value of that help can be from trivial to life changing.
emahhh
Exactly. Not surprised at all.
jpc0
I'm at a moral impass on yhis specifically.
Llama is probably one of the few LLMs that probably doesn't generate an income for Meta but I can't exactly see how other than by assisting their current ad generation.
Them being open weight isn't as good a what a "proper" open source LLM would be, but vs OpenAI which likely did the same thing it's significantly better.
On the other hand if copyright is enforced it should be enforced across the board, if I did the same thing while training an AI would I get the same treatment... Equal before the law and all that...
On the third point, I cannot legally obtain scientific paper without very significant cost to myself. My local libraries don't have a reasonable selection and even the university libraries that will let me as a member of public or even alumni still hold membership, specifically exclude scientific papers in that membership and you need to pay per paper.
resiros
I would argue that it's right call: 1) it's in the world's best interest. I am running llama locally on laptop, and the ability to have the distilled world's knowledge at your fingertips will generate much much more value than what it takes. 2) it does not 'take' any value from the book creators. No one's going to 'not buy a book' because an LLM has been trained with its content (in contrast you might argue that you are likely to not buy a book because you downloaded it from libgen).
Copyright laws are not millennia-old ethical laws that everyone agrees on (like don't steal), they are a modern human construct that were created for the greater good (incentivize creation), and we should revisit them with new tech.
lnkl
"1) it's in the world's best interest."
How is pleasing Meta's shareholders in world's best interest.
TiredOfLife
How is using llama for free locally pleasing shareholders?
edoceo
> incentivize creation
Humans do that naturally (see: children)
The copyright laws are to protect profit.
ulbu
wat? facebook is going to 'not buy a book' for each book it's gone through. world's best interest that one of the wealthiest companies in the world don't pay their dues? world's best interest? when we know nothing about the societal and political effects llms will have in the hands of such people?
what are you rationalising about?
lxgr
In hindsight (considering how LLMs are trained etc.) it makes total sense, but "Big Tech vs. Big Copyright" is something I didn't have on my 2020s bingo card.
I wonder who will come out on top, and whether there will be any incidental improvements for consumers, but unfortunately I can imagine an "AI training exemption" all too well.
kccqzy
That's not surprising to me at all. Even in the 2000s there was a famous lawsuit about Google Books scanning books without approval and the proposed settlement was essentially allowing Google to sell scanned ebooks while giving copyright holders a cut[0]. At that time Google truly felt like don't-be-evil corporation, and lawyers for the copyright holders wanted to give Google all this data as long as Google pays the copyright holders. In the 2020s however I cannot imagine any Big Tech company to have that don't-be-evil spirit and I also cannot imagine them voluntarily paying anything to copyright holders.
[0]: https://www.newyorker.com/business/currency/what-ever-happen...
dialup_sounds
Odds are that licensing gets streamlined into something like compulsory mechanical licensing and rates get negotiated into something that Big Tech and Big Media can both live with.
The whole conflict boils down to one party having piles of money and another party having something they want. That's not an intractable problem.
visarga
Maybe training on copyrighted data should be allowed if the size of the training set is huge, as each individual example is justa drop in the ocean compared to the full training set.
If you train a model 20B parameters on 20T tokens, even with 1000 tokens per example, the model extracts about 1 byte of information per example. What is the value of 1 byte of copyright infringement?
lxgr
By the same logic, pirating movies should be allowed as long as the person doing it watches enough of them for each individual one to be almost meaningless…?
webmaven
If by "pirating" you mean distributing copies, probably not. But if you mean downloading copies, probably yes. Consider the case of the film student studying the entire ouevres of multiple directors.
visarga
Yes, if they watch a billion movies, it should be free to watch any copyrighted one.
CuriouslyC
Big tech will win, because what they're doing is already basically legal, and they're worming their way up the new administration's ass.
TiredOfLife
The hilarious thing is that the same people that freely pirate music, videos, books and articles are on the side of huge copyright hoarders like Disney
nonrandomstring
> "Big Tech vs. Big Copyright"
Indeed. But when do those intersect or diverge?
I don't blame him. What would you do? If I had a near perfect data training set of all the most useful books and a hungry AI to train, it would be the logical step.
The reason this is news is because of the stinking hypocrisy of it all. It's really the same topic as the Swartz-Altman discussion here [0], in that these giant companies want to have it both ways.
Where is Zuckerberg's shout-out for Alexandra Elbakyan? [1] Or for Brewster Kahle? Or any of the wast army of people who preserve and curate the vital culture of humanity by protecting it from intellectual property dungeons?
The colossal hypocrisy is that a company like Meta wishes to live under the protective umbrella of "Intellectual Property". It wants to stop me just stealing it's stuff and setting up a better Facebook
Were it exposed to the same rules it wishes to live by, it would be torn apart by vibrant and deserving competition within days.
All the Zuckerberg, Meta or OpenAI are doing is setting the ground for the abolition of intellectual property. They are literally the proverbial people who will buy the rope with which to hang themselves.
(Edit. that doesn't make sense insert <proverb about buying ropes that actually makes sense>)
criley2
I don't view Big Tech as being against copyright. They simply hold a position that they will not pay for something unless forced to ("make me" - a very common position for the powerful to hold).
In fact, I'd argue that Big Tech is pro copyright, because once they force the copyright holder to negotiate, the cost is irrelevant to them and they build a moat around that access.
For example, Google stole Reddit content for Gemini until Reddit was forced to the table, and now Google has a seemingly exclusive agreement around Reddit data for AI purposes.
jsheard
> I don't view Big Tech as being against copyright. They simply hold a position that they will not pay for something unless forced to
Yep, the contradiction between them feeling entitled to use anything they want for training, while simultaneously having license terms which forbid using the output of their models to train other models is pretty glaring. Information wants to flow freely but only in one direction apparently.
lxgr
> having licenses which forbid using the output of their models to train other models
I haven't been following it closely, but aren't there already court rulings saying that generative AI output by itself is not copyrightable?
swatcoder
Yup. For Big Tech, the ideal outcome of these cases isn't that copyright is widely or deeply undermined as they rely heavily on it themselves (let alone how their customers and investors benefit from it).
Their ideal outcome is that there's some narrow carveout that gives them permission to ignore copyright where they want to, while extending similar permission to as few/irrelevant others as possible.
n144q
> I'd argue that Big Tech is pro copyright
I agree but for a different reason -- cost is actually relevant, in the sense that only the biggest player can afford to pay for the copyrights. If you are a small player, however your tech stack is or how good your model is, if you can't afford it, you can't compete with Google.
52-6F-62
In the past we called that tyranny, when a power thought it could act entirely without restraint.
Now I guess it’s defended as good business and good science by so many flunkies.
Knifes edge stuff. Tech people should all be reading the books, not Mark’s steamroller.
There goes the gravy train
oidar
A few questions I have on this:
Is it possible for an LLM of llama3/sonnet3.5/GPT4o quality to be trained on freely available works?
Are there other types of LLMs that can be trained on smaller data sets with comparable quality?
If that is not possible, and the courts shut down training on copyrighted works - what position will the "rule following" nations be in compared to nations that don't follow those rules?
Workaccount2
It's not even clear that training on copyrighted data even is a breach of IP law. People start their arguments on that assumption so they have an argument, but in reality that question isn't even resolved yet, and frankly it looks like the courts will likely determine that it's not a breach of IP law to train on copyrighted data (but is a breach to output it).
We are approaching the "UBI or Guillotine" fork simply because rules and regulations work selectively. Just like with the "If we pay for copyright or business becomes impossible" defense, this is yet another wast unfairness against those who had to transfer their resources to learn a skill. Awful lot of people had hard life or got into debt for things that big tech is immune from.
Or maybe we will come into the conclusion that all this works only if there's no such thing as IP, reset the playing field for everyone and if anyone wants to make money will have to actually work for it every single time. IIRC that's what's happening in China and its how they surpassed US in innovation.
Technically, that's a deregulation - just not the kind of deregulation the big tech is pushing for. Maybe the next time there's a graph showing how regulations made EU lag behind, add the graph of China too to spice things up.
With so many technical people out of work and promises of make the employed ones obsolete too, it can be a good idea to let people build thing instead of unfairly concentrating even more power onto kleptocratic entities.