LLMs Were Backdoored Years Ago
12 comments
·February 6, 2025echoangle
sshine
The watermarks are both PDF metadata and slight variation in wording. The metaphor of a watermark indeed falls short, and it should perhaps be called fingerprinting.
echoangle
You’re saying publishers of scientific papers slightly adjust the wording and this is used to detect who „leaked“ the version of the paper? If I and someone else download the same paper, the wording would change? How would that work if I want to quote them?
keyle
I think using the term backdoor is completely wrong in this case.
You're not at risk for using LLM other than the content producer being able to tell that the LLM trained on their data; maybe, potentially... It's a stretch even.
Your queries are not being relayed to them, they don't have a backdoor into the LLM content, algos or queries. They merely have a tainted marker, potentially showing, in the output.
LLM providers can always make the claim that they didn't get the tainted data from the source, but got it from another source and that's the ones you should go after; good luck proving the misdirection. I bet it's probably even hard for them to know exactly where this exact output came from, since it's probably been re-ingurgitated 250,000 times.
4ndrewl
But there's no legal precedent for any of this yet. Microsoft etc will offer to pay your legal fees etc if you are sued for what their service produces, but aiui that only kicks in after you've potentially been wiped out in court.
keyle
We are definitely in uncharted territories.
However you don't sue a product designer for using your patented screws, they're selling an added value to the stack of screws they used to build their product. So I see tokens roughly the same way.
logifail
> However you don't sue a product designer for using your patented screws
https://www.uspto.gov/trademarks/basics/trademark-patent-cop...
Copyright: "Artistic, literary, or intellectually created works, such as novels, music, movies, software code, photographs, and paintings that are original and exist in a tangible medium, such as paper, canvas, film, or digital format."
Copyright protection: "Protects your exclusive right to reproduce, distribute, and perform or display the created work, and prevents other people from copying or exploiting the creation without the copyright holder’s permission."
Patent: "Technical inventions, such as chemical compositions like pharmaceutical drugs, mechanical processes like complex machinery, or machine designs that are new, unique, and usable in some type of industry."
Patent protection: "Safeguards inventions and processes from other parties copying, making, using, or selling the invention without the inventor’s consent."
sshine
It should be “LLM training data has been backdoored.”
pcranaway
this is really interesting
but as much as i want to agree with the author’s stance:
> [..] no matter how hard these companies try to sell us on AGI or “research” models, you can just laugh until you cry that they really thought they could steal from the well of knowledge and turn around and sell it back to us through SaaS
i feel like at the end, these companies will still win
fjjjrjj
It's the Uber model. Grow first and figure out licensing, compliance, and lobbying second.
renewiltord
Google stole from the well of knowledge and sold it back to us as search. Or rather, accessing a database efficiently is a problem of its own.
Why are all these articles always so fraught with these excessively fearful terms? It’s not real, man. Humanity invented a new tool.
And yeah, people freaked out about search engines too. And it was the same breathless terror. Get a grip.
4ndrewl
Lawyers are yet to test this.
I don’t understand how the watermarking is supposed to affect the LLMs being trained. Isn’t the watermarking some hidden data in the PDF file itself and the text contained in the PDF is the same for everyone? And aren’t the LLMs trained with the extracted plain text, not the PDF itself? How would the LLM training be different if there’s a watermark?