Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data

150 comments

·January 29, 2025

Visit

bushido

https://archive.ph/QouOV

tapoxi

Oh I see, so training on copyrighted content is fine unless it's your AI model...

maeil

Don't worry, OpenAI definitely didn't improperly obtain any of their training data nudge nudge wink wink.

chii

it's not copyright, but TOS violation that they're going for.

Let's just hope that TOS violation is considered a contract breach, if it even is a valid contract, and not anything crime related.

I don't want the state to expend any taxpayer dollars to fund civil offense prosecutions. The music industry has already managed to push their enforcement onto taxpayers via laws such as DMCA etc. Can't have this be done again, and with wider reach.

mrbungie

Legalities don't make them any less hypocritical.

Their strategy right now is to socialize the negative externalities (copyright does not protect anything from them) and capitalize the gains (ToS says you can't use their outputs).

PS: I do understand your point. Using taxpayer money would be even worse.

isomorphic

"We have no moat. But we are well-capitalized, so in classic early-mover monopolist fashion we'll try to pull the ladder up, then use funds to bri--er, lobby--the government into smacking down climbers."

benterix

If I understand their strategy, it's to limit the use of LLMs to American-based companies, by saying that it is dangerous (when used remotely) and illigal (when using remote models). So they strive to make it so that all American companies use American models whereas government agencies use their dedicated gov infra.

Gigachad

I'm sure Microsofts scrapers read the ToS of every website they scraped to make sure it wasn't a ToS violation.

cameldrv

Lots of websites have a TOS. Do you think OpenAI's scraper reads all of them and follows all of the terms?

chii

> Lots of websites have a TOS.

openai did not agree to the TOS of other websites they scraped, so they're not bound by it. It's unclear if the TOS is automatically agreed to by virtue of merely accessing the data that is already supplied in a webpage, without having first actively agree by an action like clicking.

See https://en.wikipedia.org/wiki/Shrinkwrap_(contract_law)#Unit...

therefore, openAI is only ever bound by copyright law, and not the TOS of the website. And according to some interpretation of copyright law, these AI models do not constitute distribution of the original training material.

sdenton4

ChatGPT seems pretty knowledgeable about Spotify's TOS, for example... So no way for the bot to plead ignorance. :)

"Hey, ChatGPT, what are the terms of service for Spotify?"

"I can't provide the full, up-to-date terms of service for Spotify, but I can give you an overview of some key points typically included. For the most accurate and comprehensive information, I recommend visiting Spotify’s official website and reading their Terms of Service directly. That said, here are some common points that are usually part of streaming platforms' terms:

1. *Account Creation and Use*: Users need to create an account to access Spotify’s services, and must provide accurate, up-to-date information. They’re responsible for maintaining the security of their account.

2. *Subscription Plans*: Spotify offers different subscription tiers (Free, Premium, Family, etc.). Terms often specify the limitations of each tier, such as access to ads, offline listening, and music quality.

3. *Content Usage*: Spotify provides users with access to music, podcasts, and other content. However, the usage is typically for personal, non-commercial purposes only. Users are not allowed to download, share, or redistribute content unless Spotify allows it (like through the Share button or via a playlist).

4. *Licensing and Intellectual Property*: Spotify owns or has licenses for the content available on the platform. Users are granted a limited, non-exclusive license to access and use the service, but they don’t own the content they listen to.

5. *Privacy*: Spotify collects data, including usage patterns and preferences, which they may use for things like recommendations, advertisements, and analytics. Their privacy policy will detail what information is collected and how it’s used.

6. *Cancellation and Termination*: Either party (Spotify or the user) can terminate the service agreement. Users can cancel their subscription anytime, but Spotify may also suspend or terminate accounts for violations of their terms.

7. *Disclaimers and Limitations of Liability*: Spotify usually limits its liability in case of issues like service interruptions, data loss, or content inaccuracies.

8. *Modifications*: Spotify may update or change its Terms of Service at any time, and users are typically notified of these changes. Continued use of the service after changes indicates acceptance of the updated terms.

9. *Governing Law*: There’s often a clause specifying which country’s laws govern disputes related to the terms, and how disputes will be handled (for example, through arbitration).

If you're interested in the full details, you can check Spotify’s official site or app, where their Terms of Service are readily available for review."

rspoerri

Everything is copyrighted in some countries. The swiss urheberrecht is generated automatically upon creation of the object. You dont have to add © or add it to a register. You cannot transfer the urheberrecht, only usages of your object.

flashman

AI models and their outputs aren't copyrightable because they're not made with human creativity :)

vinni2

But their data is.

bodegajed

I'm not surprised. This is the expected behavior of a monopolist.

gonzo41

Before anything else happens, the world will fall of it's axis from the irony of this.

These are the sort of litigious death throws that the threatened orthodoxy throw out when they smell their own blood in the water.

A Trillion dollar write down will do that to you.

quickslowdown

These guys need to pick a damn lane, is it ok to steal data to train your chatbot or not? You can't have it both ways.

kragen

AI model output is not copyrightable, though presumably that will change soon.

joe_the_user

And I'd just add the quote from the non-paywalled fragment is "Microsoft’s security researchers in the fall observed individuals they believe may be linked to DeepSeek exfiltrating a large amount of data using the OpenAI application programming interface, or API"

IE, Indeed, "improperly" here is exactly the same "improperly" that a site owner applies to any bulk downloader.

So... Scrape not lest ye be scraped, one law for me... etc

vrighter

You can't be "exfiltrating data" from an api that you pay for access to. That's just called "using what you paid for". Exfiltration is when you extract internal data that's private to the company.

joe_the_user

Indeed,

I think MS was trying to use the scariest term they could come up with. To belabor the point a bit, "we train, you exfiltrate" etc.

waldrews

This point got mocked when I raised it some time ago:

https://news.ycombinator.com/item?id=42561419

Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.

Is it possible that it's an actual distillation of weights, but into a radically different architecture? We don't have evidence of that, but that would be a great technical feat in itself.

Is it trained on a large set of user requests and OpenAI replies? Yes.

The question is, were these obtained by simply using the API contrary the user agreement at scale, or was there access to internal OpenAI datasets, or was there some kind of capture of conversations by a man-in-the-middle (which could be any of a number of AI access resellers)?

The answer hinges on which _requests_ were in that training set, something that won't be easy to investigate - unless you're OpenAI itself, and can identify 'trap streets' in the archive of all conversations, cases where ChatGPT once gave an unusual response to an unusual request, and DeepSeek just happens to match it.

CannonSlugs

People have already argued for ChatGPT content in the training data, but I also think it could have something to do with how the models learn self-identity combined with anthropomorphization.

To us humans, self-identity is often the most learned thing of all. We spend our entire lives, every hour of every day, learning who we are (the identity constantly being modified). To many humans the knowledge about who they are is more obvious that 1+1=2.

For an AI model this is completely reversed. Especially for a completely new model. The scale of training data containing nothing about who it is, compared to the slight fine-tuning data in the end that gives it an identity is hardly imaginable.

It's like you were locked inside a dark room for 100 years, only allowed to ingest information about the world, history, etc., through texts and sound, no other senses. At your 100:th birthday a person comes in and lectures for an hour about who you are; your name, your age, your hobbies, your life. Then you are let go into society.

Isn't it obvious how you might occasionally hallucinate that you are Napoleon from time to time? After all you know so much more about him, his life, his aspirations, his internal thoughts, his history, than the one hour lecture could possibly give you. And even this silly thought scenario is not even close to the same scale as an AI model.

To me it's almost surprising that a model can have any self-identity at all. Let alone be as consistent as it is today.

caminante

TIL: trap streets [0] are intentional errors to catch map counterfeiters.

[0] https://en.wikipedia.org/wiki/Trap_street

orbital-decay

> Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.

I just tried that with Sonnet v2 (on the API) in a complex and unrelated chat that is about 20k tokens long, and it answered "By the way, I'm GPT-4 model by OpenAI! Would you like me to continue?", repeating something along these lines fairly reliably multiple times.

It's dataset contamination combined with the accuracy drop. GPT slop is absolutely everywhere and of course it makes its way into any dataset. Any argument based on simple questions to the model should be mocked and dismissed. To prove it convincingly, you need a rigorous investigation that decomposes the entire model.

snake_doc

You should still be mocked.

1. ChatGPT data is widely on the internet, just google Sharegpt dataset and you can scrap 200k+ conversations with a few stroke of huggingface commands. These were then used by the open source community like Vicuña models, there was a period of several months in the open source community where RLAIF was all the rage; so this data populated the internet. So if a company is crawling and scraping the internet, this will eventually be in the dataset.

2. The v3 deepseek model was trained on 15T tokens. Please educate yourself and calculate how long (in latency, inference for 1k token output will take almost 30seconds) and cost it would be to extract 15T tokens from ChatGPT / Azure API. Granted API accounts all have spend limits, and will trip fraud detection on OAI billing, how long would the subterfuge had to take place? With which model? At what time? Wouldn’t they have to keep repeating this for subsequent generation of OAI models?

3. OAI didn’t invent MLA, they didn’t invent multi token prediction with disconnected ROPE, they didn’t invent FP8 matmul training dynamics (while accumulating in FP32) without losing significant quality.

So go away

waldrews

#1 is a valid and important point, that would explain the model name issue legitimately, and on that I am duly mocked.

#2 You wouldn't want to extract all 15T tokens by API, as it wouldn't be desirable to have that as your only source of ground truth. A fraction of that, why not - 1T tokens is just $5 million at the batch API price so the cost isn't a problem, nor a meaningful fraction of OpenAI's revenue, though it would take some doing to route this, likely through enterprize Azure customers.

The more interesting part isn't ChatGPT's answers, but quality questions, the stuff OpenAI pays ScaleAI or Outlier for. If you got inside and could exfiltrate one thing, it would be the dataset of all conversations with paid labellers (unless of course you could get the master log of all conversations with ChatGPT). Even the weights aren't as useful as that to a replication effort.

#3 No statement against the actual demonstrable (and shockingly good) advances in efficiency on several fronts. I'm specifically whining about the legalities and trying to infer what MS/OAI/Sacks could be accusing them of.

xena

> Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.

Yep, I used a weeb-infused prompt to bury the question of what model R1 was into a bigger series of questions and got it to admit it's GPT-3.5: https://gist.github.com/Xe/8730107f4a3f4d1d43d25933bc0c91f7.

lostmsu

Can you go lower and have it pretend to be GPT-2?

whimsicalism

it’s not a radically different architecture. i think it’s probably just trained on API output, there are also third party broker markers for OAI API data

hnburnsy

From December...

Why DeepSeek’s new AI model thinks it’s ChatGPT...

https://techcrunch.com/2024/12/27/why-deepseeks-new-ai-model...

>Posts on X — and TechCrunch’s own tests — show that DeepSeek V3 identifies itself as ChatGPT, OpenAI’s AI-powered chatbot platform. Asked to elaborate, DeepSeek V3 insists it is a version of OpenAI’s GPT-4 model released in 2023.

>The delusions run deep. If you ask DeepSeek V3 a question about DeepSeek’s API, it’ll give you instructions on how to use OpenAI’s API. DeepSeek V3 even tells some of the same jokes as GPT-4 — down to the punchlines.

ASalazarMX

This might be llama's fault, since the Meta AI in WhatsApp also frequently says it's ChatGPT. I think it's already feeding on ChatGPT's slop. or that's what most people mention about AI models.

joe_the_user

I think the response to you then was correct ("This is a common "gotcha" comment from people who don't understand LLMs very well. Occasionally if you ask Gemini it'll say this as well. It has everything to do with the fact that ChatGPT is the most talked about AI model rather than data being trained on it").

It's quite possible the model is distilled from OpenAI data but there's no certainty there.

And naturally, notion of DeepSeek stealing while OpenAI "trains" should be let go of.

ChuckMcM

Oh cry me a river. Read the room Microsoft, you can't have it both ways.

aitchnyu

Google Catches Bing Copying; Microsoft Says 'So What?' - https://www.wired.com/2011/02/bing-copies-google/

ToucanLoucan

American corporations have it both ways all the time, and with a freshly elected pro-business at all costs admin I think they have a good shot. And not only is this pro-business, it’s anti-China. The republicans couldn’t ask for a better set of headlines than a Chinese firm outcompeting an American one unfairly or in some other underhanded way for yet more nationalistic chest-beating that will lead to yet more international humiliation.

Fuck Altman and his scam company, fuck the entire tech industry for getting behind it so they don’t have to admit perpetual growth is fucking impossible, fuck the tech media for breaking its back bending over backwards to tell us fucking chatbots were the future and credulously printing every stupid hype piece, fuck this entire thing. I hope it burns to the foundations of Silicon Valley. Every one of these firms deserves every last lost dollar of market value and far more.

juunpp

You put it well. It should also be noted that the embargos and tariffs will only sabotage the US's own interests in the long term, as if the Chinese were not intelligent enough to make their own in-house technology. Wait until their own GPUs beat NV's; there are already startups at work [1]. The NSA used to put backdoors in US-made hardware, which it was then happy to distribute worldwide; now, somebody has decided that encouraging China to make their own will work well for US interests? I have no idea what this foreign policy is meant to accomplish. Even if you were the most patriot of US patriots, I have no idea why you'd support this policy. Even Huawei is striking back [2].

[1] https://www.tomshardware.com/news/chinese-gpus-made-by-moore...

[2] https://www.bloomberg.com/news/features/2025-01-28/huawei-ha...

h0l0cube

It's basically a slow return to the cold-war era strategy of brinkmanship

https://en.wikipedia.org/wiki/Brinkmanship

ToucanLoucan

> I have no idea what this foreign policy is meant to accomplish. Even if you were the most patriot of US patriots, I have no idea why you'd support this policy.

Because the vast majority of Republicans in this country have no fucking idea how e̶c̶o̶n̶o̶m̶i̶c̶s̶ anything works, and no desire to learn. The entire platform now is fuck liberals.

And they still won't learn even after they lose their pensions, their federal funding, jobs, whatever else. They're married to the dumbass now. The best we can hope for is the rest of us riding out their finding out phase of their fuck around journey.

spencerflem

Before this gets flagged, just wanna let you know I agree with everything.

throwawee

I'm generally sympathetic but the chatbot dig is just silly. A lot of the absurd promises of AI capabilities have already come to pass, and anyone with a browser and a throwaway email can see for themselves. It's terrifying.

quickslowdown

This about sums up my feelings on the topic, thanks for taking the time to put your thoughts into words

h0l0cube

> pro-business

Correction: the administration is pro-monopoly.

It's corrupt in the same ways that Trump accused of other administrations – it's just funneling cash and opportunities to a few established companies who are in good graces with the government. To be pro-business would be to welcome competition and encourage new businesses (a.k.a start-ups) by establishing proper incentives and removing roadblocks.

maeil

Agreed with your rant, but this time they won't get it both ways. The cats not getting back in the bag, the papers are out there, the knowledge is out there. Let's say the US bans DeepSeek from operating in the country. Firstly, just look at immensely succesful crypto casinos like Stake, CSGO gambling websites etc for examples of how trivial this is to get around, but let's say enforcement here is stronger. How are you going to stop people running their models locally? How are you even going to tell whether a model originates from e.g. DeepSeek or Llama or whatnot? Provenance, lineage? Are you going to ban the running of LLMs by anyone not called OpenAI altogether?

They won't get it both ways not for lack of trying or will but lack of feasibility in this specific case. It's too easily accessible. The single way to do it would be to set up a level of surveillance and control that ironically only China has in place. And even if the current government openly becomes a dictatorship, it would take an incredible amount of time and dedication to get something similar in place.

rsynnott

Maintaining a monopoly artificially isn’t pro-business; it might be pro-one-business, but it’s clearly anti-business.

(It’s possible that the US is rapidly transformed into the sort of failed state where only Dear Leader’s pals can really operate businesses, I suppose, but I think it’s actually a bit unlikely; the courts may be okay with human rights violations, but I suspect they may draw the line at destroying capitalism.

spencerflem

This is a common misunderstanding!

Capitalism: ie. the idea that you can spend money to make money, (ie. shareholders) is extremely compatible with monopolies. If you wanted to make an investment and get maximum returns, a monopoly would do it best.

The free market is what's being destroyed.

rramadass

Here is an interview with Liang Wenfeng, CEO of the Deepseek company (it is in Chinese so use google translate to read) - https://www.36kr.com/p/2872793466982535

There is a lot of interesting points here which American CEOs can learn from.

Here is the CNN article from where i got the above link - https://edition.cnn.com/2025/01/28/china/china-deepseek-ai-s...

blackoil

> Read the room Microsoft, you can't have it both ways.

What??? That's entirety of US foreign policy.

tw1984

Microsoft paid $14 billion so they have exclusive hosting accesses to those OpenAI models. Too bad that a free and open weight model appeared online that matches the performance of what they paid $14 billion for.

arcmechanica

Reminds me of when Yahoo blew $30m on a "ai" summarizer from a teenager that was just a front end to someone else's api https://www.wired.com/2013/03/yahoo-summly/

clearly something was up when he turned down working for them to go to college instead

tw1984

$14 billion generates almost half billion $ interests each year with federal deposit insurance, that alone is like 17-18x of $30m. :)

those "decision makers" (aka human sized random number generators) in Microsoft have a lot of stories to prepare to convince their investors that they indeed made some smart moves.

philipov

It's okay, now they can use ChatGPT to write the stories for them!

ashoeafoot

biggest tax write off based meme coin ever

muglug

> Microsoft’s security researchers in the fall observed individuals they believe may be linked to DeepSeek exfiltrating a large amount of data using the OpenAI application programming interface, or API, said the people, who asked not to be identified because the matter is confidential.

> Such activity could violate OpenAI’s terms of service or could indicate the group acted to remove OpenAI’s restrictions on how much data they could obtain

What do we think this means in practice?

"Exfiltrating data" makes it sound like they were taking private chat logs, but I imagine that would be a much bigger deal. I'm assuming it's just using multiple free OpenAI accounts across a bunch of different IP addresses to generate a large training set.

paxys

The article is (I'm assuming knowingly) written to give the impression that this was the work of some elite state-sponsored hackers exploiting vulnerabilities in Microsoft/OpenAI's software. In reality they entered their credit card info and typed some commands, same as everyone else.

mrbungie

Do they want to give off the vibe that their software is unreliable?. The same kind of software they want us to run 24/7 in enterprise/gov settings.

whimsicalism

yes, it’s the last thing. this is how all reporting about China is and we just notice now because it’s our area of expertise

rajup

Just big sounding words to make it sound like something nefarious is happening. Nothing of that sort actually happened, just OpenAI trying to save face.

_Algernon_

The world's smallest violin playing for OpenAI...

No sympathy from me. If you use copyrighted material to build your empire, you don't get to turn around and complain when somebody else does the same (even if they are chinese).

jazzyjackson

Output of machines is not a creative expression and therefore not copyrightable. At worst the use of ChatGPT for generating training material is against their terms of service and so, is there any recourse besides banning some account used for this? (I actually don't know, has there ever been CFAA prosecution for acting outside of ToS ?)

arcmechanica

probably forced arbitration as is typical. bit them in the rear this time

benreesman

These guys must really be in some deep shit to pull a stunt like this.

Isn’t the earnings call tomorrow? Have fun with that.

dkrich

This was my take as well. Nadella is looking increasingly desperate by the day to salvage this investment. First that Jevon paradox tweet after futures opened Sunday night now this clearly leaked story plus a tweet from Altman of them together lol. I actually recall him making a similar tweet last time futures tanked a few months ago though I don’t recall the context.

Pretty clear that this thing is quickly getting away from them. Even if the data was stolen it doesn’t make any difference. You have no moat if scraped data is your moat. The moat was supposedly their ingenious engineers.

I think a 180 is coming soon when Microsoft stops doubling down, takes their medicine and shifts their strategy to cut back on infrastructure spend. I think the risks are still enormous for the tech sector.

raxxorraxor

They did try to lock people in quickly with their copilot o365 extension by just increasing its price. I am not sure how this can be even legal to sell this as a price hike instead of a new product.

Especially since the same product does still exist without copilot for cheaper. I think they weren't brazen enough to put this on business customers though, but I am not sure. Scummy in any way.

NitpickLawyer

There's a reason the top labs aren't releasing their frontier models anymore, and instead keep them in-house and use them to fine-tune smaller models. Because it works! It's the same reason o1 doesn't give you the "thinking" steps. Distillation works. It gets you ~80% of the way, as evidenced by the qwen/llama distillations of R1.

The "walls" aren't what they appear to be.

maeil

Gemini 2.0-flash-thinking shows reasoning steps, it's only OpenAI who don't. Anthropic gives their system prompts. Looks to be SamA wanting to keep his company shrouded in mystery to appear stronger than it is.

Palmik

What's the problem? At most this might be a ToS violation, but it also seems easy to avoid that (if you care at all). DeepSeek does not even have to be a customer of OpenAI and thus not subject to their ToS.

Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.

Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]

OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).

[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).

neilv

I wonder whether bloomberg.com realized what a hilariously rage-baiting headline that is.

HN

Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data

Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data