It sure looks like Meta stole a lot of books to build its AI
193 comments
·January 21, 2025dangoodmanUT
moogly
I dunno, it's put rather matter-of-factly.
p3rls
I like the idea of every article about wikipedia describing Jimmy Wales as a pornographer-- maybe with some pictures of the Bomis Babes to really 'show and not tell'.
polygon87
Does every article about IBM need “IBM, which once helped facilitate the Holocaust, …”?
anileated
If IBM’s original product was “designed to facilitate the Holocaust”, then every article about IBM darn better lead with that.
In case of Facebook, it was originally designed literally to compare the attractiveness of female students; but no, it’s not even remotely as bad, and comparing it to the Holocaust trivializes an incident where masses of people were murdered in an industrial fashion.
1vuio0pswjnm7
Is this "every article" about Facebook/Meta or is it just one. The questionm is whether the statement makes sense to use in this article.
Possibly the point the author is making is that Zuckerberg never had an original idea or one that could be the subject of a business plan. He copied the "Hot or Not" websites that had come before. Further, the photos of students used for his "Facemash" website, i.e., the "content", were downloaded from the university's computers, not uploaded to Zuckerberg's computer. Initially, he downloaded and used students' photos without permission.
https://web.archive.org/web/20250115010420if_/https://www.th...
This pattern continued when he copied the idea of an online "face book" for the university which the university was already working on; adopting the name "thefacebook.com".
From the document production in Kadrey v Meta, it appears the pattern of copying still continues. Meta is still downloading and using others' work. Initially, without permission.
Comparing copyright infringment to assisting genocide is absurd. Although Meta may have assisted in ethnic cleansing
https://www.amnesty.org/en/latest/news/2023/08/myanmar-time-...
there is no reference to it in the article. Perhaps because, unlike the story of "Facemash", it bears no relation to the subject matter: copyright infringment.
1vuio0pswjnm7
It is what it is. History cannot be changed. Matter-of-factly, unemotionally, the website/company was based on "theft" from the beginning: Zuckerberg did not seek license (permission) to use Harvard's student profiles to create "Facemash".
null
null
alvah
The rest of the article did not disappoint!
dangoodmanUT
> CEO and founder Mark Zuckerberg announced that slurs are okay on their platforms, added a pro-Trump UFC boss to their board
The next paragraph...
koops
Isn't that in fact true? Emotional or not.
mrits
No, he literally never said or even implied slurs are ok
mock-possum
I mean he did say that transphobia is specifically protected speech so
skellington
You don't get to shut down speech by just adding -phobia to everything you have a strong opinion about.
null
spinach
Just stating biological reality is "transphobia" now, if even speaking about reality is censored we are lost.
OsrsNeedsf2P
As a long-time sailor, this case may have no impact on regular folks, but I maintain a fool's hope a Meta victory will significantly weaken the copyright system.
timewizard
It's already pretty weak, in terms of it's stated goals, to promote the progress of science and useful arts. Perhaps reform is a better direction to go.
yencabulator
The only way I see copyright weakening is that you and anyone without enough money[1] loses its protection, while Disney keeps it.
[1]: All your favorite authors, journalists, anything indie.
23B1
As a long-time writer, this case will have impact on me, but I maintain a fool's hope a Meta loss will significantly weaken IP-laundering megacorps.
Filligree
As a long-time writer, this case will have little impact on me, but I maintain a hope it’ll expand the limits of fair use and let us make more shared worlds than we’re allowed to.
23B1
Fair use was designed for teachers photocopying pages of Walden for their pupils to read over the weekend. Not industrial-scale laundering of IP to benefit shareholders.
ASalazarMX
I just want writers to receive a bigger share of the cake, so publisher megacorps aren't the ones grabbing the lion's share. Writers could make a better living, and copyright could be reformed so corpos can't gate popular culture for more than two generations.
carlosjobim
Writers decide if they want to use a publisher and which publishers to use. With the internet, I don't think any good author needs to rely on a traditional publisher.
joshe
I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts. Sometimes I even quote from them. A lot of them I didn't even pay for, I borrowed them from libraries or friends.
anileated
Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps? An implement that does the same but (A) at scale and (B) without thought or free will or agency, completely at the bidding of its operator, for profit, has no such protections. Instead, the operator carries all responsibility (in this case, Meta).
If a software service had legal protections like that, sure, I could build one that returns you any book you request and say that the service had integrated it into its worldview. Who can check, eh?
* Actually, in some countries you could be in trouble for reading a book and incorporating it into your worldview, to say nothing about quoting it, but let’s set that aside.
gruez
>Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps?
Not a relevant factor when it comes to copyright law. Fair use (the law that's most applicable here) applies regardless if you're a student using incorporating news articles into your work, or google making thumbnails and displaying them on their search results.
anileated
This is not a good analogy. Google does not display the contents to any significant degree (you have to visit the search result). And even then it was/is in legal trouble, in fact (in some countries like Australia* more than others).
Furthermore:
> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.
I do not see “automated generation of derivative works of arbitrary nature” in it.
Paradigma11
So, your argument is similar to cryptobros who argue that much of defi is not plain financial fraud because it runs on a blockchain only reverse.
anileated
Why?
maeil
Didn't know Meta paid for them or borrowed them from libraries. In fact, I don't think they did.
furyofantares
Maybe not but even if Meta did buy 1 copy of every book I doubt it would stop anyone from making bad analogies to theft. (Not that the analogy on the other side to a human reading is any better.)
corobo
Isn't that literally what copyright is?
You have bought the text so you have the readright, but you do not the copyright.
nicce
What if we could buy the books for one human, make the human read all the books, and somehow we would be able to clone this human in a way that they remember the book contents.
Now, would that be a fair use of the books?
aprilthird2021
Yes because using the books to train a book-replicating machine is a legally grey area.
irjustin
> I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts.
That... doesn't make it okay...
> A lot of them I didn't even pay for, I borrowed them from libraries or friends.
This 2nd sentence doesn't fit your first. What is your message?
philistine
Did you get those books through torrents? That's what's at stake. Distribution, not the parsing, which might (might!) have been legal.
recursivecaveat
The "and then fed them into an AI" part of "facebook pirated a bunch of books and then fed them into an AI" part is irrelevant. It would be equally illegal if they pirated them and then sat around reading them. Unless you somehow hope that the entirety of copyright will be overturned by this court case (not a chance) then you should strongly hope that facebook loses, because the alternative is literally "rules for thee but not for me" where corps can pirate whatever they want, but nothing changes for ordinary citizens.
ClumsyPilot
So first there was this ‘corporations are people’ and not we have ‘computers are people’.
So I expect to see that either you are no longer allowed to own computer software
Or a return of slavery.
Also if we find indecent portrayal of minors in a data centre I expect that we treat it as a strict liability crime and the entire data centre or corporation that owns it gets a long prison sentence, just like a human would. However that is suppose to work.
SecretDreams
I can't tell if your implication is what meta did is fine or just that your brain is as good as an AI?
Could you clearly speak your point?
whynotminot
This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.
Extrapolated out into some new future a hundred years from now when we have embodied AI humanoids walking alongside us, would it be weird if those humanoids were barred from buying a new book or charged a different rate than the humans they coexist with?
I’m still deciding how I feel about some of this too.
plasticchris
If we are going to afford models like this treatment equivalent to sentient beings in this regard, why not others? In your extrapolation these ai walking among us are property of giant tech companies…
SecretDreams
> This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.
I'm not even against this to a point. The issue is what comes after. The monetization. The enshitification. The derivatives in place of real creativity.
tinco
It's a straw man though, whether or not AI should be allowed to learn from books is irrelevant to the point that Meta stole tens of thousands of books to accomplish this. A fact that they've admitted to and even had they not would be trivially proven.
They're not being charged, that would be a vast improvement over reality.
hedora
The courts have already ruled that training on data is similar to reading it (sufficiently transformative) to be considered fair use, in the same way that I cannot claim a copyright on your brain because you read this comment.
On the other hand, they torrented books and then open sourced LLM weights. No punishment is too severe for that!
If you still don’t understand, I strongly suggest watching Max Headroom, “Lessons”, which you can get here:
wang_li
I think he’s saying he works for meta and when the company employees committed mass copyright violation that’s ok because once someone read Winnie the Pooh to him at story hour at the library.
Guthur
Honestly, if people continue to conflate human development with a mega corps trawling copyright material to build a mathematical model and then wrap it up and charge a subscription for it, then there's really not much you, I or anyone else can do to avoid the inevitable fallout and we really deserve everything we get for it.
SecretDreams
> and we really deserve everything we get for it.
I agree with you until this part. There comes a time where I don't think I deserve to get my eyes poked out just because other people find that fashionable.
maeil
> and we really deserve everything we get for it.
Those people who conflate them deserve it. You and me don't.
> there's really not much you, I or anyone else can do
We can make our own community. And Bluesky is very much not it.
23B1
OP is making the spurious argument that technology should have the same ethical entitlements as humans. It's on par with "information wants to be free".
derektank
I don't read it as an ethical argument, it's an argument about the purpose of copyright. Copyright is intended to restrict reproduction of a work for the purpose of incentivizing the creation of new works. Copyright is not intended to restrict the transmission of knowledge.
SecretDreams
My thoughts as well. I just prefer to remove the nuance on these types of things. If OP wants to draw a line and clearly state "I'm with the corpo robots" that's fine. Just state it plainly so I can proceed accordingly.
nicce
Maybe I need to use this argument when I go to the movies next time.
__loam
There's not a lot of sympathy for this view on this site because half of everyone in tech thinks it's their god given right to pirate content because "copying the information doesn't destroy the information", and because in the past, the people cracking down on piracy were mostly media conglomerates and not the artists themselves.
But I'll try to articulate it anyway. The people who created the data all these models trained on, be they artists, writers, or even programmers, created a lot of, if not most of, the value that is now being derived from these models. Instead of being rewarded for their part, a lot of folks here seem very content with casting those people aside and letting huge corporations take everything, while building a system that is trying to make people creating actual things that have value have a much harder time surviving off their trade.
It's very gross to me that people are defending Meta here, and seem to be okay with capital eating all forms of cultural expression while giving nothing back.
AuryGlenz
Note: I’m not one of those people that think copyright shouldn’t exist. Authors definitely deserve to get paid for their work.
That said, my personal belief is that if the books weren’t legally freely available all that Meta owes would be how much the book costs. Each individual book would be such a small part of the model that it’s barely distinguishable. I’m sure image models have been trained on some of my professionally taken photographs and I don’t care one bit.
I’d argue that’s the cost of the books was how much it was worth before AI models, and the authors themselves didn’t create the technology. Therefore the added value of the technology has absolutely nothing to do with them. If book publishers/authors decide to have different pricing in the future to take the tech into account that’s their right.
aprilthird2021
> That said, my personal belief is that if the books weren’t legally freely available all that Meta owes would be how much the book costs
Why isn't stealing and knowing you're stealing penalized more than the cost of the item? If the world worked this way everyone would steal.
> Each individual book would be such a small part of the model that it’s barely distinguishable.
Needs to be proven (and also impossible to prove how much of the value of the model comes from the classified material)
keernan
Civil law recognizes the concept of awarding punitive damages against a defendant whose conduct is deemed to have been illegal or otherwise of such an abhorrent nature that punitive damages are warranted to deter such behavior by this defendant as well as serving as a deterrent to others.
Knowingly obtaining millions of copyrighted materials that were posted illegally simply to serve your own financial interests, might very well qualify.
unyttigfjelltol
I'll help. It's like someone copied all of Facebook's code from a pirated source and stood up a competing website that allows users to avoid "paying" for Facebook's services (in the form here of not being served Facebook ads or offering up private information). Maybe in an extreme case the someone might even occasionally call themselves "Facebook" or use its logos.
The problem here is the tech industry sits on throne of riches built by IP law, so it doesn't sit well when suddenly it's "good for thee but not for me." If we're going to cherry pick, how about we walk back to a view that software isn't copyrightable and the copyright term is 34 years?
realusername
That's because the current copyright system is so extreme and went so far into removing rights that some people would rather let it burn completely than reform it.
Maybe if the copyright system wasn't so extreme, people would have a more balanced view of the system and show more support.
__loam
I don't think meta is taking some moral stand when they flout the law like this
realusername
They don't but it's like "enemies of my enemies are my friends" type of support.
I don't think anybody really believe Meta is doing this as a charity.
carlosjobim
One thing to note is that a huge proportion of the books in these shadow libraries have authors who are dead since long.
b8
Ok? The Google lawsuit and promising lawsuit of the Open library probably will result in a W for Meta. Torrenting is obviously the best way to grab lots of data to train on. Just because they seeded (torrent clients automatically do this) doesn't mean they actually uploaded anything if they couldn't connect to a peer or manually paused/stoped the torrent. Also the author slants the story against Meta and has a bias. At least I felt that way when reading it.
tiahura
Aren’t they going to have to prove the FB seed seeded the particular blocks containing the works of the plaintiffs?
keernan
I'm not so sure. imo it would be a stretch to suggest Fair Use should disregard how the copyrighted materials were obtained.
For example, if Facebook employees broke into the home of a renowned author and stole private copyrighted materials they then used for training their LLMs, should the Court, in analyzing the Fair Use factors, disregard the illegal nature of how the copyrighted materials were obtained?
I believe it unlikely a Court would be willing to reward such behavior.
Once that principle is resolved, the next step would be for the court to consider whether it would make any difference if Facebook employees did not engage in the direct theft, but acquired copies of the stolen materials from the thief with full knowledge they were stoken.
If the court believes both #1 and #2 would be unacceptable, their analysis would then proceed to consider if there were differences favoring Facebook if Facebook acquired millions of copyrighted materials via a notorious website widely accused of illegally posting unauthorized access to copyrighted materials.
I suspect this will likely be the central issue of the legal debate. And, I for one, do not think Facebook has a very strong legal argument. Going back to the first step of the anslysis, I would be shocked if SCOTUS would be willing to state that how the copyrighted materials were obtained is irrelevant to the Fair Use analysis.
null
talldayo
> But in plain terms, it doesn’t seem defensible for a major company with tons of lawyers, money, and talent to knowingly use stolen work to build something that they then turn around and sell.
cough https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
antigeox
[dead]
conradev
> it doesn’t seem defensible for a major company with tons of lawyers, money, and talent to knowingly use stolen work to build something that they then turn around and sell.
I thought they give it away for free?
jsheard
Not quite, Llamas license has a carve-out where if you have a lot of users then the open license doesn't apply to you, and you have to talk to Meta to negotiate a different (presumably paid) license. Hyperscaler deployments like AWS Bedrock probably fall into that category.
Meta isn't a charity - even if they're not profiting from Llama today, they believe they will at some point.
sadeshmukh
The user exception was for 100m monthly users - if you have that many users, you can afford your own base model. That's hardly a restriction as it is now.
yoyohello13
lol, have they been living under a rock? Thats exactly the kind of thing I’d expect a major company to do.
intuitionist
To misquote Don Draper, that’s what the lawyers and money are for!
zdw
This site seems to have some javascript that prevents text selection, even with plugins like StopTheMadness that should allow text selection everywhere.
(Not trying to select + copy, just trying to indicate how far I am through the document)
glitchc
They're trying to protect against AI crawlers siphoning their content. Unsurprising given the article's positioning.
Isn't it just a matter of time before we see content creators weaponize their sites? They may present text as images or require a user to solve a pictogram cipher etc.
jsheard
I don't see how messing with text selection does anything to ward off scrapers. The article content isn't obscured in the HTML in any way.
TiredOfLife
Just like drm only harming paid users this only harms humans. Ai has no problem making a screenshot and extracting text.
glitchc
One could add adversarial artifacts to the image, making the AI think it's a turtle instead.
tonetegeatinst
I would wager that because the AI bots want to scrape all the data possible, then eventually someone will realise that they can setup a site that once scraped will poison the training data in subtle ways.
Effectively turning the web scraper into an attack vector.
The CSV score would be weird for that.
spencerflem
This exists already and is called SEO
batch12
> The CSV score would be weird for that.
Did you mean CVSS?
carlosjobim
The only thing needed is a paywall. Scrapers are not going to pay for access.
jsheard
This very thread is about an AI company going out of their way to acquire paywalled content. Maybe they wouldn't bother with this site, but paywalling demonstrably does not help once an AI company sets their sights on you.
oidar
Vanilla firefox can select text with no extensions.
archargelod
Selection (and copying) works on Firefox even with uBlock disabled.
With uBlock, though, I have 20+ blocked items. What's the point of having so many scripts that do absolutely nothing visible on the page?
imgabe
Stole them from where? An Amazon warehouse? The New York Public Library? People's homes?
Is there any plan to recover the books and return them to their owners?
irjustin
Stole in that they didn't pay publishers (authors?) any money to use the books.
Maybe the Library system does apply here though, that all AI trainers need to buy 1 copy of their book so "their child" can "learn"?
Agraillo
My first (almost) thought was "Ok, looking at the llama, where's the quality boost?" I think the explanation is probably in how LLM are trained. Even without knowing deeply the internals, I suspect that it's the compression of information so to simplify you can't make 8GB data contain all the facts of a "bigger" normalized relational database. So they keep the facts present everywhere and often drop rare facts. For example, a fact "SQL was invented at IBM", this fact can be found everywhere, in books, web sites, comments. You don't need access to copyrighted books to acquire this fact. But a first-person account of someone who worked at IBM at that time is probably can be found in a couple of books, but due to "compression", it will be gone anyway
WiSaGaN
The current copyright system is clearly broken. Copyright itself is not a natural right but a construct that was created relatively recently to incentivize innovation within society. Currently, we still want to train models using copyrighted materials. What we don’t want, however, is for companies to use these materials without giving back, especially to the content creators themselves, such as book authors, publishers, Stack Overflow answer writers, and Stack Overflow as a platform. We need to find a way to properly compensate these contributors to keep book writing, answer writing, and other creative efforts going, preferably while keeping the content publicly accessible. In the future, I hope we can figure out a model that might draw inspiration from music streaming (though perhaps not exactly like it).
yonran
Legally, would it be any worse for Meta to download works that someone illegally copied and repurpose it for a transformative fair use than if Google had taken their scans of snippet-view Google Books borrowed from the library to train their model in fair use?
aprilthird2021
I think their intent and knowledge of whether they needed to pay for the content or not is also important. And there's clear evidence they knew they should pay and decided not to
yonran
Google could take their scans from the library (which they did not pay publishers for) and train an AI model on that, right? They won the Authors Guild v. Google lawsuit because snippet view is transformative fair use, and I think teaching an AI model common sense and facts is even more transformative. Should Meta do the same?
aprilthird2021
> They won the Authors Guild v. Google lawsuit because snippet view is transformative fair use, and I think teaching an AI model common sense and facts is even more transformative.
What you think isn't law, it will be decided now.
> Should Meta do the same?
Well, they didn't do that. They stole the books and knew they were stealing them.
> It’s a grim week for Meta. The company formerly known as Facebook, and before that Facemash, “designed to evaluate the attractiveness of female Harvard students,”
Wow if that's the opener, I expect the rest to be SUPER emotionally charged