Skip to content(if available)orjump to list(if available)

Aaron Swartz and Sam Altman

Aaron Swartz and Sam Altman

37 comments

·January 12, 2025

rglover

Aaron was the OG. If you've never dug through his blog, do yourself a favor [1]. Also make some time to watch The Internet's Own Boy doc about him [2] and look up some of his talks during the SOPA shenanigans. RIP.

[1] http://www.aaronsw.com/weblog/

[2] https://www.youtube.com/watch?v=9vz06QO3UkQ&rco=1

menzoic

For one Sam scraped under the veil of a corporation which helps reduce or remove personal liability.

Second, if the crime was the act of scraping then it’s directly comparable. But if the crime is publishing the data for free, that’s quite different from training AI to learn from the data while not being able to reproduce the exact content.

“Probabilistic plagiarism” is not what’s happening or even aligned with the definition of plagiarism (which matters if we’re talking about legal consequences). What’s happening is that it’s learning patterns from the content that it can apply to future tasks.

If a human reads all that content then gets asked a question about a paper, they too would imperfectly recant what they learned.

dkjaudyeqooe

Your argument might make sense to you, but it doesn't make sense legally.

The fact is that “Probabilistic plagiarism” is a mechanical process, so as much as you might like to anthropomorphize it for the sake of your argument ('just like a human learning') it's still a mechanical reproduction of sorts which is an important point under fair use, as it the fact that it denies the original artists the fruits of their labor and is a direct substitute for their work.

These issues are the ones that will eventually sink (or not) the legality of AI training, but they are seldom addressed in these sorts of discussions.

menzoic

> The fact is that “Probabilistic plagiarism” is a mechanical process, so as much as you might like to anthropomorphize it for the sake of your argument

I did not anthropomorphize anything. “Learning” is the proper term. It takes input and applies it intelligently to future tasks. Machines can learn, machine learning has been around for decades. Learning doesn’t require biology.

My statement is that it is not plagiarism in any form. There is no claim that the content was originally authored by the LLM.

An LLM can learn from a textbook and teach the content, and it will do so without plagiarism. Just as a human can learn from a textbook and teach. Making an analogy to a human doesn’t require anthropomorphism.

jazzyjackson

Ask any LLM to recite lyrics and see that it's not so probabilistic after all, it's perfectly capable of publishing protected content, and the filter to prevent it from doing so is such a bolt-on its embarrassing.

menzoic

We have to understand what plagiarism is if making claims of it. Claiming that you authored content and reciting content are different things. Reciting content isn’t plagiarism. Claiming you are the author of content that you didn’t author is plagiarism.

> it's perfectly capable of publishing protected content

At most it can produce partial excerpts.

LLMs don’t store the data that it’s trained on. That would be infeasible, the models would be too large. Instead, it stores semantic representations which often uses entirely different words and sentence structures than the source content. And of course most of the data is lost entirely during this lossy compression.

theyinwhy

LLMs (OpenAi models included) are happy to reproduce books word by word, page by page. Just try it out yourself. And even if some words were reproduced wrong, it still would be copyright violation.

throw5959

It's not really some words, it's more like you won't be able to get more than a page out of it and even that is going to be so wrong it's basically a parody and thus allowed.

angoragoats

I’d love to see you try to defend this notion in court. Parody requires deliberate intent to be humorous. And courts have repeatedly held that changing the words of a copyrighted work while keeping the same general meaning can still be copyright infringement.

menzoic

> reproduce books word by word, page by page

This statement is a figment of the commenters imagination with no basis in reality. All they would have to do is try it to realize they just spouted a lie.

At most LLMs can produce partial excerpts.

LLMs don’t store the data that it’s trained on. That would be infeasible, the models would be too large. Instead, it stores semantic representations which often uses entirely different words and sentence structures than the source content. And of course most of the data is lost entirely during this lossy compression.

rhubarbtree

The NYT has extracted long articles from ChatGPT and submitted the evidence in court.

angoragoats

> At most LLMs can produce partial excerpts.

Glad you agree that LLMs infringe copyrights.

realusername

I tried and I could not make it work. And even if you could, that has to be the most inefficient way to pirate books on earth.

qwertox

Do you really think that OpenAI has deleted the data it has scraped? Don't you think OpenAI is storing all this scraped data at this moment on some fileservers in order to re-scan this data in the future to create better models? Models which may even contain verbatim copies of that data internally but prevent access to it through self-censorship?

In any case it's a "Yes, we have all this copyrighted data and we're constantly (re)using it to produce derived works (in order to get wealthy)". How can this be legal?

If that were legal, then I should be able to copy all the books in a library and keep them on a self-hosted, private server for my or my companies use, as long as I don't quote too much of that information. But I should be able to have all that data and do close to whatever I want with it.

And if this were legal, why shouldn't it be legal to request a copy of all the data from a library and obtain access to it via a download link?

b8

Sam Altman failed upwards only because PG likes him. Aaron Swartz was actually a technical genius imo. DOJ never should of charged Swartz.

null

[deleted]

khazhoux

Thank you, Sam Altman and everyone at OpenAI, for creating ChatGPT and unleashing the modern era of generative AI. I regularly use ChatGPT now for coding questions instead of stackoverflow, and I use it to polish up my writings at work.

Signed,

Someone who doesn't care that you're making $$$$ from it

angoragoats

If I’m an author and I don’t want my work included in the corpus of text used for training ChatGPT, should I have that right?

What about if I’m an artist and I don’t want my work included in the training data for an image generation model?

qwertox

Sam Altman wouldn't spend a second reflecting about this.

dehrmann

This post misses a lot of nuance. Aaron Swartz was an activist who did obviously illegal things and got caught and prosecuted. What OpenAI is doing is in legal gray area because it might be transformative enough to be fair use; we just don't know yet.

dkjaudyeqooe

Simply being transformative is not sufficient for it to be fair use.

But more to the point if it's deemed illegal Altman won't suffer any personal legal consequences.

benatkin

I know I support what aaronsw did and I don’t think he shouldn’t have gotten in any trouble for it, let alone to the tragic level it went to. As for sama, I’m not sure, on one hand I like the innovation and on the other hand it’s very worrying for humanity. I appreciate the post and the fond memories of Aaron but I’m not in complete agreement with the author about sama.

mastazi

In the photo there are some other faces that I think I might recognise, but I'm not 100% sure. Is there a list of everyone in the picture somewhere on the internet?

Edit I think the lady on the left is Jessica Livingston and a younger PG on the right

aimazon

https://i.imgur.com/e0GPhSE.jpeg

1. zak stone, memamp

2. steve huffman, reddit

3. alexis ohanian, reddit

4. emmet shear, twitch

5. ?

6. ?

7. ?

8. jesse tov, https://www.ycombinator.com/companies/simmery

9. pg

10. jessica

11. KeyserSosa, initially memamp but joined reddit not long after (I forget his real name)

12. phillip yuen, textpayme

13. ?

14. aaron swartz, infogami at the time

15. ?

16. sam altman, loopt at the time

17. justin kan, twitch

tptacek

No, that's Jessica Livingston, the cofounder of YC.

mastazi

yes, you are right, I edited my comment.

begueradj

Aaron was a developer himself but Sam ... ?

nell

Why is Sam Altman singled out in these copyright issues? Aren't there plenty of competing models?

mrkpdl

OpenAI is the highest profile.

mvdtnz

I don't believe you're asking this in good faith because the answer is so obvious. But just in case, it's because OpenAI is by a ridiculously large margin the most well known player in the space and unlike the leaders of other organisations his name is known and he has a personal brand which he puts a lot of effort into promoting.

sinuhe69

The two are seen on the picture together! So more similar they could not be. That highlights the irony and hypocrite of capitalism, of better say of human society.

tempodox

> one did it to make the knowledge free for all while the other did it to make $$$$ through probabilistic plagiarism. The US DOJ only came after one of them & the other is feted by tech bros and executives.

Making $$$ is the holiest act you can perform in the U.S., no matter the details. Not having that as a goal is un-American and puts you under suspicion.

error_logic

This despite the fact that what actually Made America Great was constructive, honest, healthy competition--not the insane destroy-the-competition monopolist's outlook which tries to destroy opportunity for others and thus competition itself.

Horseshoe theory is real.

ilrwbwrkhv

Oh man. Heavy stuff. Our industry will be looked at as good or bad? I hope we end up doing good for the world.

memonkey

Hard to say when there is a profit motive for all industries. Seems like every industry at the moment is not really looking for human advancement, or maybe it is looking at advancing but only if the results are expensive for end users and efficient/proprietary for the company.

ilrwbwrkhv

Yes but the thing is our industry has almost unparalleled leverage and marginal utility cost is zero.