It sure looks like Meta stole a lot of books to build its AI
136 comments
·January 21, 2025OsrsNeedsf2P
timewizard
It's already pretty weak, in terms of it's stated goals, to promote the progress of science and useful arts. Perhaps reform is a better direction to go.
23B1
As a long-time writer, this case will have impact on me, but I maintain a fool's hope a Meta loss will significantly weaken IP-laundering megacorps.
Filligree
As a long-time writer, this case will have little impact on me, but I maintain a hope it’ll expand the limits of fair use and let us make more shared worlds than we’re allowed to.
23B1
Fair use was designed for teachers photocopying pages of Walden for their pupils to read over the weekend. Not industrial-scale laundering of IP to benefit shareholders.
ASalazarMX
I just want writers to receive a bigger share of the cake, so publisher megacorps aren't the ones grabbing the lion's share. Writers could make a better living, and copyright could be reformed so corpos can't gate popular culture for more than two generations.
carlosjobim
Writers decide if they want to use a publisher and which publishers to use. With the internet, I don't think any good author needs to rely on a traditional publisher.
dangoodmanUT
> It’s a grim week for Meta. The company formerly known as Facebook, and before that Facemash, “designed to evaluate the attractiveness of female Harvard students,”
Wow if that's the opener, I expect the rest to be SUPER emotionally charged
moogly
I dunno, it's put rather matter-of-factly.
polygon87
Does every article about IBM need “IBM, which once helped facilitate the Holocaust, …”?
anileated
If IBM’s original product was “designed to facilitate the Holocaust”, then every article about IBM darn better lead with that.
In case of Facebook, it was originally designed literally to compare the attractiveness of female students; but no, it’s not even remotely as bad, and comparing it to the Holocaust trivializes an incident where masses of people were murdered in an industrial fashion.
null
dangoodmanUT
> CEO and founder Mark Zuckerberg announced that slurs are okay on their platforms, added a pro-Trump UFC boss to their board
The next paragraph...
koops
Isn't that in fact true? Emotional or not.
mrits
No, he literally never said or even implied slurs are ok
mock-possum
I mean he did say that transphobia is specifically protected speech so
skellington
You don't get to shut down speech by just adding -phobia to everything you have a strong opinion about.
null
spinach
Just stating biological reality is "transphobia" now, if even speaking about reality is censored we are lost.
b8
Ok? The Google lawsuit and promising lawsuit of the Open library probably will result in a W for Meta. Torrenting is obviously the best way to grab lots of data to train on. Just because they seeded (torrent clients automatically do this) doesn't mean they actually uploaded anything if they couldn't connect to a peer or manually paused/stoped the torrent. Also the author slants the story against Meta and has a bias. At least I felt that way when reading it.
tiahura
Aren’t they going to have to prove the FB seed seeded the particular blocks containing the works of the plaintiffs?
null
joshe
I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts. Sometimes I even quote from them. A lot of them I didn't even pay for, I borrowed them from libraries or friends.
maeil
Didn't know Meta paid for them or borrowed them from libraries. In fact, I don't think they did.
furyofantares
Maybe not but even if Meta did buy 1 copy of every book I doubt it would stop anyone from making bad analogies to theft. (Not that the analogy on the other side to a human reading is any better.)
nicce
What if we could buy the books for one human, make the human read all the books, and somehow we would be able to clone this human in a way that they remember the book contents.
Now, would that be a fair use of the books?
corobo
Isn't that literally what copyright is?
You have bought the text so you have the readright, but you do not the copyright.
aprilthird2021
Yes because using the books to train a book-replicating machine is a legally grey area.
anileated
Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps? An implement that does the same but (A) at scale and (B) without thought or free will or agency, completely at the bidding of its operator, for profit, has no such protections. Instead, the operator carries all responsibility (in this case, Meta).
If a software service had legal protections like that, sure, I could build one that returns you any book you request and say that the service had integrated it into its worldview. Who can check, eh?
* Actually, in some countries you could be in trouble for reading a book and incorporating it into your worldview, to say nothing about quoting it, but let’s set that aside.
gruez
>Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps?
Not a relevant factor when it comes to copyright law. Fair use (the law that's most applicable here) applies regardless if you're a student using incorporating news articles into your work, or google making thumbnails and displaying them on their search results.
anileated
This is not a good analogy. Google does not display the contents to any significant degree (you have to visit the search result). And even then it was/is in legal trouble, in fact (in some countries like Australia* more than others).
Furthermore:
> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.
I do not see “automated generation of derivative works of arbitrary nature” in it.
recursivecaveat
The "and then fed them into an AI" part of "facebook pirated a bunch of books and then fed them into an AI" part is irrelevant. It would be equally illegal if they pirated them and then sat around reading them. Unless you somehow hope that the entirety of copyright will be overturned by this court case (not a chance) then you should strongly hope that facebook loses, because the alternative is literally "rules for thee but not for me" where corps can pirate whatever they want, but nothing changes for ordinary citizens.
irjustin
> I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts.
That... doesn't make it okay...
> A lot of them I didn't even pay for, I borrowed them from libraries or friends.
This 2nd sentence doesn't fit your first. What is your message?
ClumsyPilot
So first there was this ‘corporations are people’ and not we have ‘computers are people’.
So I expect to see that either you are no longer allowed to own computer software
Or a return of slavery.
Also if we find indecent portrayal of minors in a data centre I expect that we treat it as a strict liability crime and the entire data centre or corporation that owns it gets a long prison sentence, just like a human would. However that is suppose to work.
SecretDreams
I can't tell if your implication is what meta did is fine or just that your brain is as good as an AI?
Could you clearly speak your point?
whynotminot
This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.
Extrapolated out into some new future a hundred years from now when we have embodied AI humanoids walking alongside us, would it be weird if those humanoids were barred from buying a new book or charged a different rate than the humans they coexist with?
I’m still deciding how I feel about some of this too.
plasticchris
If we are going to afford models like this treatment equivalent to sentient beings in this regard, why not others? In your extrapolation these ai walking among us are property of giant tech companies…
SecretDreams
> This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.
I'm not even against this to a point. The issue is what comes after. The monetization. The enshitification. The derivatives in place of real creativity.
tinco
It's a straw man though, whether or not AI should be allowed to learn from books is irrelevant to the point that Meta stole tens of thousands of books to accomplish this. A fact that they've admitted to and even had they not would be trivially proven.
They're not being charged, that would be a vast improvement over reality.
hedora
The courts have already ruled that training on data is similar to reading it (sufficiently transformative) to be considered fair use, in the same way that I cannot claim a copyright on your brain because you read this comment.
On the other hand, they torrented books and then open sourced LLM weights. No punishment is too severe for that!
If you still don’t understand, I strongly suggest watching Max Headroom, “Lessons”, which you can get here:
wang_li
I think he’s saying he works for meta and when the company employees committed mass copyright violation that’s ok because once someone read Winnie the Pooh to him at story hour at the library.
23B1
OP is making the spurious argument that technology should have the same ethical entitlements as humans. It's on par with "information wants to be free".
derektank
I don't read it as an ethical argument, it's an argument about the purpose of copyright. Copyright is intended to restrict reproduction of a work for the purpose of incentivizing the creation of new works. Copyright is not intended to restrict the transmission of knowledge.
SecretDreams
My thoughts as well. I just prefer to remove the nuance on these types of things. If OP wants to draw a line and clearly state "I'm with the corpo robots" that's fine. Just state it plainly so I can proceed accordingly.
nicce
Maybe I need to use this argument when I go to the movies next time.
Guthur
Honestly, if people continue to conflate human development with a mega corps trawling copyright material to build a mathematical model and then wrap it up and charge a subscription for it, then there's really not much you, I or anyone else can do to avoid the inevitable fallout and we really deserve everything we get for it.
SecretDreams
> and we really deserve everything we get for it.
I agree with you until this part. There comes a time where I don't think I deserve to get my eyes poked out just because other people find that fashionable.
maeil
> and we really deserve everything we get for it.
Those people who conflate them deserve it. You and me don't.
> there's really not much you, I or anyone else can do
We can make our own community. And Bluesky is very much not it.
conradev
> it doesn’t seem defensible for a major company with tons of lawyers, money, and talent to knowingly use stolen work to build something that they then turn around and sell.
I thought they give it away for free?
jsheard
Not quite, Llamas license has a carve-out where if you have a lot of users then the open license doesn't apply to you, and you have to talk to Meta to negotiate a different (presumably paid) license. Hyperscaler deployments like AWS Bedrock probably fall into that category.
Meta isn't a charity - even if they're not profiting from Llama today, they believe they will at some point.
yoyohello13
lol, have they been living under a rock? Thats exactly the kind of thing I’d expect a major company to do.
intuitionist
To misquote Don Draper, that’s what the lawyers and money are for!
zdw
This site seems to have some javascript that prevents text selection, even with plugins like StopTheMadness that should allow text selection everywhere.
(Not trying to select + copy, just trying to indicate how far I am through the document)
glitchc
They're trying to protect against AI crawlers siphoning their content. Unsurprising given the article's positioning.
Isn't it just a matter of time before we see content creators weaponize their sites? They may present text as images or require a user to solve a pictogram cipher etc.
jsheard
I don't see how messing with text selection does anything to ward off scrapers. The article content isn't obscured in the HTML in any way.
tonetegeatinst
I would wager that because the AI bots want to scrape all the data possible, then eventually someone will realise that they can setup a site that once scraped will poison the training data in subtle ways.
Effectively turning the web scraper into an attack vector.
The CSV score would be weird for that.
spencerflem
This exists already and is called SEO
batch12
> The CSV score would be weird for that.
Did you mean CVSS?
TiredOfLife
Just like drm only harming paid users this only harms humans. Ai has no problem making a screenshot and extracting text.
glitchc
One could add adversarial artifacts to the image, making the AI think it's a turtle instead.
carlosjobim
The only thing needed is a paywall. Scrapers are not going to pay for access.
jsheard
This very thread is about an AI company going out of their way to acquire paywalled content. Maybe they wouldn't bother with this site, but paywalling demonstrably does not help once an AI company sets their sights on you.
oidar
Vanilla firefox can select text with no extensions.
archargelod
Selection (and copying) works on Firefox even with uBlock disabled.
With uBlock, though, I have 20+ blocked items. What's the point of having so many scripts that do absolutely nothing visible on the page?
unraveller
Something tells me that if they did it "properly" i.e. scraped and distilled all the world's libraries directly themselves (google streetview style without storing any individual pages) instead of torrent there'd still be backlash for the disruption it brings.
ChrisArchitect
[dupe]
More discussion: https://news.ycombinator.com/item?id=42651007 https://news.ycombinator.com/item?id=42673628
WiSaGaN
The current copyright system is clearly broken. Copyright itself is not a natural right but a construct that was created relatively recently to incentivize innovation within society. Currently, we still want to train models using copyrighted materials. What we don’t want, however, is for companies to use these materials without giving back, especially to the content creators themselves, such as book authors, publishers, Stack Overflow answer writers, and Stack Overflow as a platform. We need to find a way to properly compensate these contributors to keep book writing, answer writing, and other creative efforts going, preferably while keeping the content publicly accessible. In the future, I hope we can figure out a model that might draw inspiration from music streaming (though perhaps not exactly like it).
yonran
Legally, would it be any worse for Meta to download works that someone illegally copied and repurpose it for a transformative fair use than if Google had taken their scans of snippet-view Google Books borrowed from the library to train their model in fair use?
aprilthird2021
I think their intent and knowledge of whether they needed to pay for the content or not is also important. And there's clear evidence they knew they should pay and decided not to
yonran
Google could take their scans from the library (which they did not pay publishers for) and train an AI model on that, right? They won the Authors Guild v. Google lawsuit because snippet view is transformative fair use, and I think teaching an AI model common sense and facts is even more transformative. Should Meta do the same?
imgabe
Stole them from where? An Amazon warehouse? The New York Public Library? People's homes?
Is there any plan to recover the books and return them to their owners?
irjustin
Stole in that they didn't pay publishers (authors?) any money to use the books.
Maybe the Library system does apply here though, that all AI trainers need to buy 1 copy of their book so "their child" can "learn"?
arthurcolle
PDF accessible on the world wide web is fair game
As a long-time sailor, this case may have no impact on regular folks, but I maintain a fool's hope a Meta victory will significantly weaken the copyright system.