The impact of competition and DeepSeek on Nvidia

499 comments

·January 25, 2025

Visit

dang

Related ongoing thread:

Nvidia’s $589B DeepSeek rout - https://news.ycombinator.com/item?id=42839650 - Jan 2025 (574 comments)

pjdesno

The description of DeepSeek reminds me of my experience in networking in the late 80s - early 90s.

Back then a really big motivator for Asynchronous Transfer Mode (ATM) and fiber-to-the-home was the promise of video on demand, which was a huge market in comparison to the Internet of the day. Just about all the work in this area ignored the potential of advanced video coding algorithms, and assumed that broadcast TV-quality video would require about 50x more bandwidth than today's SD Netflix videos, and 6x more than 4K.

What made video on the Internet possible wasn't a faster Internet, although the 10-20x increase every decade certainly helped - it was smarter algorithms that used orders of magnitude less bandwidth. In the case of AI, GPUs keep getting faster, but it's going to take a hell of a long time to achieve a 10x improvement in performance per cm^2 of silicon. Vastly improved training/inference algorithms may or may not be possible (DeepSeek seems to indicate the answer is "may") but there's no physical limit preventing them from being discovered, and the disruption when someone invents a new algorithm can be nearly immediate.

AlanYx

Another aspect that reinforces your point is that the ATM push (and subsequent downfall) was not just bandwidth-motivated but also motivated by a belief that ATM's QoS guarantees were necessary. But it turned out that software improvements, notably MPLS to handle QoS, were all that was needed.

pjdesno

Nah, it's mostly just buffering :-)

Plus the cell phone industry paved the way for VOIP by getting everyone used to really, really crappy voice quality. Generations of Bell Labs and Bellcore engineers would rather have resigned than be subjected to what's considered acceptable voice quality nowadays...

WalterBright

I've noticed this when talking on the phone with someone with a significant accent.

1. it takes considerable work on my part to understand it on a cell phone

2. it's much easier on POTS

3. it's not a problem on VOIP

4. no issues in person

With all the amazing advances in cell phones, the voice quality of cellular is stuck in the 90's.

hedgehog

Yes, I think most video on the Internet is HLS and similar approaches which are about as far from the ATM circuit-switching approach as it gets. For those unfamiliar HLS is pretty much breaking the video into chunks to download over plain HTTP.

nyarlathotep_

>> Plus the cell phone industry paved the way for VOIP by getting everyone used to really, really crappy voice quality

What accounts for this difference? Is there something inherently worse about the nature of cell phone infrastructure over land-line use?

I'm totally naive on such subjects.

I'm just old enough to remember landlines being widespread, but nearly all of my phone calls have been via cell since the mid 00s, so I can't judge quality differences given the time that's passed.

tlb

And memory. In the heyday of ATM (late 90s) a few megabytes was quite expensive for a set-top box, so you couldn't buffer many seconds of compressed video.

Also, the phone companies had a pathological aversion to understanding Moore's law, because it suggested they'd have to charge half as much for bandwidth every 18 months. Long distance rates had gone down more like 50%/decade, and even that was too fast.

accra4rx

Love those analogies . This is one of main reason I love hacker news / reddit . Honest golden experiences

vFunct

I worked on a network that used a protocol very similar to ATM (actually it was the first Iridium satellite network). An internet based on ATM would have been amazing. You’re basically guaranteeing a virtual switched circuit, instead of the packets we have today. The horror of packet switching is all the buffering it needs, since it doesn’t guarantee circuits.

Bandwidth is one thing, but the real benefit is that ATM also guaranteed minimal latencies. You could now shave off another 20-100ms of latency for your FaceTime calls, which is subtle but game changing. Just instant-on high def video communications, as if it were on closed circuits to the next room.

For the same reasons, the AI analogy could benefit from both huge processing as well as stronger algorithms.

lxgr

> You’re basically guaranteeing a virtual switched circuit

Which means you need state (and the overhead that goes with it) for each connection within the network. That's horribly inefficient, and precisely the reason packet-switching won.

> An internet based on ATM would have been amazing.

No, we'd most likely be paying by the socket connection (as somebody has to pay for that state keeping overhead), which sounds horrible.

> You could now shave off another 20-100ms of latency for your FaceTime calls, which is subtle but game changing.

Maybe on congested Wi-Fi (where even circuit switching would struggle) or poorly managed networks (including shitty ISP-supplied routers suffering from horrendous bufferbloat). Definitely not on the majority of networks I've used in the past years.

> The horror of packet switching is all the buffering it needs [...]

The ideal buffer size is exactly the bandwidth-delay product. That's really not a concern these days anymore. If anything, buffers are much too large, causing unnecessary latency; that's where bufferbloat-aware scheduling comes in.

vFunct

The cost for interactive video would be a requirement of 10x bandwidth, basically to cover idle time. Not efficient but not impossible, and definitely wouldn’t change ISP business models.

The latency benefit would outweigh the cost. Just absolutely instant video interaction.

pjdesno

Man, I saw a presentation on Iridium when I was at Motorola in the early 90s, maybe 92? Not a marketing presentation - one where an engineer was talking, and had done their own slides.

What I recall is that it was at a time when Internet folks had made enormous advances in understanding congestion behavior in computer networks, and other folks (e.g. my division of Motorola) had put a lot of time into understanding the limited burstiness you get with silence suppression for packetized voice, and these folks knew nothing about it.

thijson

I remember my professor saying how the fixed packet size in ATM (53 bytes) was a committee compromise. North America wanted 64 bytes, Europe wanted 32 bytes. The committee chose around the midway point.

wtallis

53 byte frames is what results in the exact compromise of 48 bytes for the payload size.

richbhanover

> ... guaranteed minimal latencies. You could now shave off another 20-100ms of latency for your FaceTime calls...

I already do this. But I cheat - I use a good router (OpenWrt One) that has built-in controls for Bufferbloat. See [How OpenWrt Vanquishes Bufferbloat](https://forum.openwrt.org/t/how-openwrt-vanquishes-bufferblo...)

eru

> The horror of packet switching is all the buffering it needs, since it doesn’t guarantee circuits.

You don't actually need all that much buffering.

Buffer bloat is actually a big problem with conventional TCP. See eg https://news.ycombinator.com/item?id=14298576

aurareturn

Doesn’t your point about video compression tech support Nvidia’s bull case?

Better video compression led to an explosion in video consumption on the Internet, leading to much more revenue for companies like Comcast, Google, T-Mobile, Verizon, etc.

More efficient LLMs lead to much more AI usage. Nvidia, TSMC, etc will benefit.

onlyrealcuzzo

No - because this eliminates entirely or shifts the majority of work from GPU to CPU - and Nvidia does not sell CPUs.

If the AI market gets 10x bigger, and GPU work gets 50% smaller (which is still 5x larger than today) - but Nvidia is priced on 40% growth for the next ten years (28x larger) - there is a price mismatch.

It is theoretically possible for a massive reduction in GPU usage or shift from GPU to CPU to benefit Nvidia if that causes the market to grow enough - but it seems unlikely.

Also, I believe (someone please correct if wrong) DeepSeek is claiming a 95% overall reduction in GPU usage compared to traditional methods (not the 50% in the example above).

If true, that is a death knell for Nvidia's growth story after the current contracts end.

munksbeer

I can see close to zero possibility that the majority of the work will be shifted to the CPU. Anything a CPU can do can just be done better with specialised GPU hardware.

e_y_

On desktop, CPU decoding is passable but it's still better to have a graphics card for 4K. On mobile, you definitely want to stick to codecs like H264/HEVC/AVC1 that are supported in your phone's decoder chips.

CPU chipsets have borrowed video decoder units and SSE instructions from GPU-land, but the idea that video decoding is a generic CPU task now is not really true.

Now maybe every computer will come with an integrated NPU and it won't be made by Nvidia, although so far integrated GPUs haven't supplanted discrete ones.

I tend to think today's state-of-the-art models are ... not very bright, so it might be a bit premature to say "640B parameters ought to be enough for anybody" or that people won't pay more for high-end dedicated hardware.

chpatrick

That's just factually wrong, DeepSeek is still terribly slow on CPUs. There's nothing different about how it works numerically.

aurareturn

  No - because this eliminates entirely or shifts the majority of work from GPU to CPU - and Nvidia does not sell CPUs.

I'm not even sure how to reply to this. GPUs are fundamentally much more efficient for AI inference than CPUs.

mandevil

It lead to more revenue for the industry as a whole. But not necessarily for the individual companies that bubbled the hardest: Cisco stock is still to this day lower than it was at peak in 2000, to point to a significant company that sold actual physical infra products necessary for the internet and still around and profitable to this day. (Some companies that bubbled did quite well, AMZN is like 75x from where it was in 2000. But that's a totally different company that captured an enormous amount of value from AWS that was not visible to the market in 2000, so it makes sense.)

If stock market-cap is (roughly) the market's aggregated best guess of future profits integrated over all time, discounted back to the present at some (the market's best guess of the future?) rate, then increasing uncertainty about the predicted profits 5-10 years from now can have enormous influence on the stock. Does NVDA have an AWS within it now?

aurareturn

>It lead to more revenue for the industry as a whole. But not necessarily for the individual companies that bubbled the hardest: Cisco stock is still to this day lower than it was at peak in 2000, to point to a significant company that sold actual physical infra products necessary for the internet and still around and profitable to this day. (Some companies that bubbled did quite well, AMZN is like 75x from where it was in 2000. But that's a totally different company that captured an enormous amount of value from AWS that was not visible to the market in 2000, so it makes sense.)

Cisco in 1994: $3.

Cisco after dotcom bubble: $13.

So is Nvidia's stock price closer to 1994 or 2001?

vFunct

I agree that advancements like DeepSeek, like transformer models before it, is just going to end up increasing demand.

It’s very shortsighted to think we’re going to need fewer chips because the algorithms got better. The system became more efficient, which causes induced demand.

eru

It will increase the total volume demanded, but not necessarily the amount of value that companies like NVidia can capture.

Most likely, consumer surplus has gone up.

diamond559

More demand for what, chatbots? ai slop? buggy code?

floatrock

obligatory https://en.wikipedia.org/wiki/Jevons_paradox

pjdesno

No, it doesn't.

Not only are 10-100x changes disruptive, but the players who don't adopt them quickly are going to be the ones who continue to buy huge amounts of hardware to pursue old approaches, and it's hard for incumbent vendors to avoid catering to their needs, up until it's too late.

When everyone gets up off the ground after the play is over, Nvidia might still be holding the ball but it might just as easily be someone else.

fspeech

If you normalize Nvidia's gross margin and take into account of competitors sure. But its current high margin is driven by Big Tech FOMO. Do keep in mind that 90% margin or 10x cost to 50% margin or 2x cost is a 5x price reduction.

aurareturn

So why would DeepSeek decrease FOMO? It should increase it if anything.

snailmailstare

It improves TSMC' case.. Paying Nvidia would be like paying Cray for every smartphone that is faster than a supercomputer of old.

9rx

Yes, over the long haul, probably. But as far as individual investors go they might not like that Nvidia.

Anyone currently invested is presumably in because they like the insanely high profit margin, and this is apt to quash that. There is now much less reason to give your first born to get your hands on their wares. Comcast, Google, T-Mobile, Verizon, etc., and especially those not named Google, have nothingburger margins in comparison.

If you are interested in what they can do with volume, then there is still a lot of potential. They may even be more profitable on that end than a margin play could ever hope for. But that interest is probably not from the same person who currently owns the stock, it being a change in territory, and there is apt to be a lot of instability as stock changes hands from the one group to the next.

eru

> Anyone currently invested is presumably in because they like the insanely high profit margin, [...]

I'm invested in Nvidia because it's part of the index that my ETF is tracking. I have no clue what their profit margins are.

TheCondor

It seems more stark even. The energy costs that are current and then projected for AI are staggering. At the same time, I think it has been MS that has been publishing papers on LLMs that are smaller (so called small language models) but more targeted and still achieving a fairly high "accuracy rate."

Didn't TMSC say that SamA came for a visit and said they needed $7T in investment to keep up with the pending demand needs.

This stuff is all super cool and fun to play with, I'm not a nay sayer but it almost feels like these current models are "bubble sort" and who knows how it will look if "quicksort" for them becomes invented.

TMWNN

>but there's no physical limit preventing them from being discovered, and the disruption when someone invents a new algorithm can be nearly immediate.

The rise of the net is Jevons paradox fulfilled. The orders of magnitude less bandwidth needed per cat video drove much more than that in overall growth in demand for said videos. During the dotcom bubble's collapse, bandwidth use kept going up.

Even if there is a near-term bear case for NVDA (dotcom bubble/bust), history indicates a bull case for the sector overall and related investments such as utilities (the entire history of the tech sector from 1995 to today).

lokar

Another example: people like to cite how the people who really made money in the CA gold rush were selling picks and shovels.

That only lasted so long. Then it was heavy machinery (hydraulics, excavators, etc)

tuna74

I always like the "look" of high bit rate Mpeg2 video. Download HD japanese TV content from 2005-2010 and it still looks really good.

breadwinner

Great article but it seems to have a fatal flaw.

As pointed out in the article, Nvidia has several advantages including:

   - Better Linux drivers than AMD
   - CUDA
   - pytorch is optimized for Nvidia
   - High-speed interconnect

Each of the advantages is under attack:

   - George Hotz is making better drivers for AMD
   - MLX, Triton, JAX: Higher level abstractions that compile down to CUDA
   - Cerbras and Groq solve the interconnect problem

The article concludes that NVIDIA faces an unprecedented convergence of competitive threats. The flaw in the analysis is that these threats are not unified. Any serious competitor must address ALL of Nvidia's advantages. Instead Nvidia is being attacked by multiple disconnected competitors, and each of those competitors is only attacking one Nvidia advantage at a time. Even if each of those attacks are individually successful, Nvidia will remain the only company that has ALL of the advantages.

toisanji

I want the NVIDIA monopoly to end, but there is no real competition still. * George Hotz has basically given up on AMD: https://x.com/__tinygrad__/status/1770151484363354195

* Groq can't produce more hardware past their "demo". It seems like they haven't grown capacity in the years since they announced, and they switched to a complete SaaS model and don't even sell hardware anymore.

* I dont know enough about MLX, Triton, and JAX,

billconan

I also noticed that Groq's Chief Architect now works for NVIDIA.

https://research.nvidia.com/person/dennis-abts

simonw

That George Hotz tweet is from March last year. He's gone back and forth on AMD a bunch more times since then.

roland35

The same Hotz who lasted like 4 weeks at Twitter after announcing that he'd fix everything? It doesn't really inspire a ton of confidence that he can single handedly take down Nvidia...

bdangubic

is that good or bad?

bfung

It looks like he’s close to having own AMD stack, tweet linked in the article, Jan 15,2025: https://x.com/__tinygrad__/status/1879615316378198516

htrp

We'll check in again with him in 3 months and he'll still be just 1 piece away.

saagarjha

$1000 bounty? That's like 2 hours of development time at market rate lol

epolanski

> Any serious competitor must address ALL of Nvidia's advantages.

Not really, his article focuses on Nvidia's being valued so highly by stock markets, he's not saying that Nvidia's destined to lose its advantage in the space in the short term.

In any case, I also think that the likes of MSFT/AMZN/etc will be able to reduce their capex spending eventually by being able to work on a well integrated stack on their own.

madaxe_again

They have an enormous amount of catching up to do, however; Nvidia have created an entire AI ecosystem that touches almost every aspect of what AI can do. Whatever it is, they have a model for it, and a framework and toolkit for working with or extending that model - and the ability to design software and hardware in lockstep. Microsoft and Amazon have a very diffuse surface area when it comes to hardware, and being a decent generalist doesn’t make you a good specialist.

Nvidia are doing phenomenal things with robotics, and that is likely to be the next shoe to drop, and they are positioned for another catalytic moment similar to that which we have seen with LLMS.

I do think we will see some drawback or at least deceleration this year while the current situation settles in, but within the next three years I think we will see humanoid robots popping up all over the place, particularly as labour shortages arise due to political trends - and somebody is going to have to provide the compute, both local and cloud, and the vision, movement, and other models. People will turn to the sensible and known choice.

So yeah, what you say is true, but I don’t think is going to have an impact on the trajectory of nvidia.

dralley

>So how is this possible? Well, the main reasons have to do with software— better drivers that "just work" on Linux and which are highly battle-tested and reliable (unlike AMD, which is notorious for the low quality and instability of their Linux drivers)

This does not match my experience from the past ~6 years of using AMD graphics on Linux. Maybe things are different with AI/Compute, I've never messed with that, but in terms of normal consumer stuff the experience of using AMD is vastly superior than trying to deal with Nvidia's out-of-tree drivers.

saagarjha

They are.

Herring

He's setting up a case for shorting the stock, ie if the growth or margins drop a little from any of these (often well-funded) threats. The accuracy of the article is a function of the current valuation.

eigenvalue

Exactly. You just need to see a slight deceleration in projected revenue growth (which has been running 120%+ YoY recently) and some downward pressure on gross margins, and maybe even just some market share loss, and the stock could easily fall 25% from that.

breadwinner

AMD P/E ratio is 109, NVDA is 56. Which stock is overvalued?

null

[deleted]

2-3-7-43-1807

> The accuracy of the article is a function of the current valuation.

ah ... no ... that's nonsense trying to hide behind stilted math lingo.

null

[deleted]

csomar

> - Better Linux drivers than AMD

Unless something radically changed in the last couple years, I am not sure where you got this from? (I am specifically talking about GPUs for computer usage rather than training/inference)

idonotknowwhy

> Unless something radically changed in the last couple years, I am not sure where you got this from?

This was the first thing that stuck out to me when I skimmed the article, and the reason I decided to invest the time reading it all. I can tell the author knows his shit and isn't just parroting everyone's praise for AMD Linux drivers.

> (I am specifically talking about GPUs for computer usage rather than training/inference)

Same here. I suffered through the Vega 64 after everyone said how great it is. So many AMD-specific driver bugs, AMD driver devs not wanting to fix them for non-technical reasons, so many hard-locks when using less popular software.

The only complaints about Nvidia drivers I found were "it's proprietary" and "you have to rebuild the modules when you update the kernel" or "doesn't work with wayland".

I'd hesitate to ever touch an AMD GPU again after my experience with it, haven't had a single hick-up for years after switching to Nvidia.

cosmic_cheese

Another ding against Nvidia for Linux desktop use is that only some distributions either make it easy to install and keep the proprietary drivers updated (e.g. Ubuntu) and/or ship variants with the proprietary drivers preinstalled (Mint, Pop!_OS, etc).

This isn’t a barrier for Linux veterans but it adds significant resistance for part-time users, even those that are technically inclined, compared to the “it just works” experience one gets with an Intel/AMD GPU under just about every Linux distro.

csomar

Wayland was a requirement for me. I've used an AMD GPU for years. I had a bug exactly once with a linux update. But has been stable since.

fragmede

they are, unless you get distracted by things like licensing and out of tree drivers and binary blobs. If you'd rather pontificate about open source philosophy and rights than get stuff done, go right ahead.

litigator

Check out Anthonix on Twitter. He's already done what George Hotz is trying to do and he did it months ago. He's moved on from the RX 7900 XTX to MI300X and is setting some records. He had to write the majority of the code by himself but kept some of ROCm he deemed fit. He is always stirring George up when he has his AMD tantrums. Seriously though, how bad are AMD engineers if one person in their free time can make a custom stack that out performs ROCm.

aorloff

The unification of the flaws is the scarcity of H100s

He says this and talks about it in The Fallout section - even at BigCos with megabucks the teams are starved for time on the Nvidia chips and if these innovations work other teams will use them and then boom Nvidia's moat is truncated somehow which doesn't look good at such lofty multiples

isatty

Sorry, I don’t know who George Hotz is, but why isn’t AMD making better drivers for AMD?

adastra22

George Hotz is a hot Internet celebrity that has basically accomplished nothing of value but has a large cult following. You can safely ignore.

(Famous for hacking the PS3–except he just took credit for a separate group’s work. And for making a self-driving car in his garage—except oh wait that didn’t happen either.)

medler

He took an “internship” at Twitter/X with the stated goal of removing the login wall, apparently failing to realize that the wall was a deliberate product decision, not a technical challenge. Now the X login wall is more intrusive than ever.

xuki

He was famous before the PS3 hack, he was the first person to unlock the original iPhone.

Den_VR

You’re not wrong, but after all these years it’s fair to give benefit of the doubt - geohot may have grown as a person. The PS3 affair was incredibly disappointing.

sebmellen

Comma.ai works really well. I use it every day in my car.

hshshshshsh

What about comma.ai?

fairity

DeepSeek just further reinforces the idea that there is a first-move disadvantage in developing AI models.

When someone can replicate your model for 5% of the cost in 2 years, I can only see 2 rational decisions:

1) Start focusing on cost efficiency today to reduce the advantage of the second mover (i.e. trade growth for profitability)

2) Figure out how to build a real competitive moat through one or more of the following: economies of scale, network effects, regulatory capture

On the second point, it seems to me like the only realistic strategy for companies like OpenAI is to turn themselves into a platform that benefits from direct network effects. Whether that's actually feasible is another question.

aurareturn

This is wrong. First mover advantage is strong. This is why OpenAI is much bigger than Mixtral despite what you said.

First mover advantage acquired and keeps subscribers.

No one really cares if you matched GPT4o one year later. OpenAI has had a full year to optimize the model, build tools around the model, and used the model to generate better data for their next generation foundational model.

dplgk

What is OpenAI's first-mover moat? I switched to Claude with absolutely no friction or moat-jumping.

xxpor

What is Google's first mover moat? I switched to Bing/DuckDuckGo with absolutely no friction or moat jumping.

Brands are incredibly powerful when talking about consumer goods.

moralestapia

*sigh*

This broken record again.

Just observe reality. OpenAI is leading, by far.

All these "OpenAI has no moat" arguments will only make sense whenever there's a material, observable (as in not imaginary), shift on their market share.

roncesvalles

>What is OpenAI's first-mover moat?

The same one that underpins the entire existence of a little company called Spotify: I'm just too lazy to cancel my subscription and move to a newer player.

aurareturn

OpenAI has a lot more revenue than Claude.

Late in 2024, OpenAI had $3.7b in revenue. Meanwhile, Claude’s mobile app hit $1 million in revenue around the same time.

pradn

Brand - it's the most powerful first-mover advantage in this space.

ChatGPT is still vastly more popular than other, similar chat bots.

kpennell

almost everyone I know is the same. 'Claude seems to be better and can take more data' is what I hear a lot.

ransom1538

I moved 100% over to deepseek. No switch cost. Zero.

One moat will eventually come in the form of personal knowledge about you - consider talking with a close friend of many years vs a stranger

itissid

OpenAI does not have a business model that is cashflow positive at this point and/or a product that gives them a significant leg up in the same moat sense Office/Teams might give to Microsoft.

aurareturn

Companies in the mobile era took a decade or more to become profitable. For example, Uber and Airbnb.

Why do you expect OpenAI to become profitable after 3 years of chatgpt?

lxgr

> First mover advantage acquired and keeps subscribers.

Does it? As a chat-based (Claude Pro, ChatGPT Plus etc.) user, LLMs have zero stickiness to me right now, and the APIs hardly can be called moats either.

distances

If it's for mass consumer market then it does matter. Ask any non-technical person around you. High chance is that they know ChatGPT but can't name a single other AI model or service. Gemini, just a distant maybe. Claude, definitely not -- I'm positive I'm hard pressed to find anyone in my technical friends who knows about Claude.

jaynate

They also burnt a hell of a lot more cash. That’s a disadvantage.

Mistletoe

I feel like AI tech just reverse scales and reverse flywheels, unlike the tech giant walls and moats now, and I think that is wonderful. OpenAI has really never made sense from a financial standpoint and that is healthier for humans. There’s no network effect because there’s no social aspect to AI chatbots. I can hop on DeepSeek from Google Gemini or OpenAI at ease because I don’t have to have friends there and/or convince them to move. AI is going to be a race to the bottom that keeps prices low to zero. In fact I don’t know how they are going to monetize it at all.

tw1984

> DeepSeek just further reinforces the idea that there is a first-move disadvantage in developing AI models.

you are assuming that what DeepSeek achieved can be reasonably easily replicated by other companies. then the question is when all big techs and tons of startups in China and the US are involved, how come none of those companies succeeded?

deepseek is unique.

11101010001100

Deepseek is unique, but the US has consistently underestimated Chinese R&D, which is not a winning strategy in iterated games.

rightbyte

There seem to be a 100 fold uptick in jingoists in the last 3-4 years which makes my head hurt but I think there is no consistent "underestimation" in academic circles? I think I have read articles about the up and coming Chinese STEM for like 20 years.

corimaith

That doesn't the calculus regarding the actions you would pick externally, in fact it only strengthens the point for increased tech restrictions and more funding.

rightbyte

Unique, ye, but isn't their method open? I read something about a group replicating a smaller variant of their main model.

ghostzilla

Which brings the question, if LLMs are an asset of such strategic value, why did China allow the DeepSeek to be released?

I see two possibilities here, either that the CCP is not that all-reaching as we think, or that the value of the technology isn't critical, and that the release was further cleared with the CCP and maybe even timed to come right after Trump's announcement of American AI supremacy.

jerjerjer

We have one success after ~two years of ChatGPT hype (and therefore subsequent replication attempts). That's as fast as it gets.

null

[deleted]

boringg

Your making some big assumptions projecting into the future. One that deepseek takes market position, two that the information they have released is honest regarding training usage, spend etc.

Theres a lot more still to unpack and I don’t expect this to stay solely in the tech realm. Seems to politically sensitive.

meiraleal

DeepSeek is profitable, openai is not. That big expensive moat won't help much when the competition knows how to fly.

aurareturn

DeepSeek is not profitable. As far as I know, they don’t have any significant revenue from their models. Meanwhile, OpenAI has $3.7b in revenue last reported and has high gross margins.

meiraleal

tell that to the stock market then, it might change the graph direction back to green.

WiSaGaN

Deepseek inference API has positive margin. This however does not take into account R&D like salary and training cost. I believe OpenAI is the same in these aspects, at least before now.

UncleOxidant

Even if DeepSeek has figured out how to do more (or at least as much) with less, doesn't the Jevons Paradox come into play? GPU sales would actually increase because even smaller companies would get the idea that they can compete in a space that only 6 months ago we assumed would be the realm of the large mega tech companies (the Metas, Googles, OpenAIs) since the small players couldn't afford to compete. Now that story is in question since DeepSeek only has ~200 employees and claims to be able to train a competitive model for about 20X less than the big boys spend.

samvher

My interpretation is that yes in the long haul, lower energy/hardware requirements might increase demand rather than decrease it. But right now, DeepSeek has demonstrated that the current bottleneck to progress is _not_ compute, which decreases the near term pressure on buying GPUs at any cost, which decreases NVIDIA's stock price.

kemiller

Short term, I 100% agree, but remains to be seen what "short" means. According to at least some benchmarks, Deepseek is two full orders of magnitude cheaper for comparable performance. Massive. But that opens the door for much more elaborate "architectures" (chain of thought, architect/editor, multiple choice) etc, since it's possible to run it over and over to get better results, so raw speed & latency will still matter.

groby_b

I think it's worth carefully pulling apart _what_ DeepSeek is cheaper at. It's somewhat cheaper at inference (0.3 OOM), and about 1-1.5 OOM cheaper for training (Inference costs: https://www.latent.space/p/reasoning-price-war)

It's also worth keeping in mind that depending on benchmark, these values change (and can shrink quite a bit)

And it's also worth keeping in mind that the drastic drop in training cost(if reproducible) will mean that training is suddenly affordable for a much larger number of organizations.

I'm not sure the impact on GPU demand will be as big as people assume.

yifanl

It does, but proving that it can be done with cheaper (and more importantly for NVidia), lower margin chips breaks the spell that NVidia will just be eating everybody's lunch until the end of time.

aurareturn

If demand for AI chips will increase due to Jevon’s paradox, why would Nvidia’s chips become cheaper?

In the long run, yes, they will be cheaper due to more competition and better tech. But next month? It will be more expensive.

yifanl

The usage of existing but cheaper nvidia chips to make models of similar quality is the main takeaway.

It'll be much harder to convince people to buy the latest and greatest with this out there.

tedunangst

Selling 100 chips for $1 profit is less profitable than selling 20 chips for $10 profit.

HDThoreaun

Margin only goes down if a competitor shows up. Getting more "performance" per chip will actually let nvidia raise prices even more if they want.

deadbabe

Since you no longer need CUDA, AMD becomes a new viable option.

gamblor956

Important to note: the $5 million alleged cost is just the cpu compute cost for the final version of the model; it's not the cumulative cost of the research to date.

The analogous costs would be what OpenAI spent to go from GPT 4 to GPT 4o (i.e., to develop the reasoning model from the most up-to-date LLM model). $5 million is still less than what OpenAI spent but it's not a magnitude lower. (OpenAI spent up to $100 million on GPT4 but a fraction of that to get GPT 4o. Will update comment if I can find numbers for 4o before edit window closes)

fspeech

It doesn't make sense to compare individual models. A better way is to look at total compute consumed, normalized by the output. In the end what counts is the cost of providing tokens.

hodder

Jevons paradox isn't some iron law like gravity.

trgn

feels like it is in tech. any gains in hardware or algorithm advance, immediately get consumed by increase in data retention and software bloat.

fspeech

But why would the customers accept the high prices and high gross margin of Nvidia if they no longer fear missing out with insufficient hardware?

colinnordin

Great article.

>Now, you still want to train the best model you can by cleverly leveraging as much compute as you can and as many trillion tokens of high quality training data as possible, but that's just the beginning of the story in this new world; now, you could easily use incredibly huge amounts of compute just to do inference from these models at a very high level of confidence or when trying to solve extremely tough problems that require "genius level" reasoning to avoid all the potential pitfalls that would lead a regular LLM astray.

I think this is the most interesting part. We always knew a huge fraction of the compute would be on inference rather than training, but it feels like the newest developments is pushing this even further towards inference.

Combine that with the fact that you can run the full R1 (680B) distributed on 3 consumer computers [1].

If most of NVIDIAs moat is in being able to efficiently interconnect thousands of GPUs, what happens when that is only important to a small fraction of the overall AI compute?

[1]: https://x.com/awnihannun/status/1883276535643455790

tomrod

Conversely, how much larger can you scale if frontier models only currently need 3 consumer computers?

Imagine having 300. Could you build even better models? Is DeepSeek the right team to deliver that, or can OpenAI, Meta, HF, etc. adapt?

Going to be an interesting few months on the market. I think OpenAI lost a LOT in the board fiasco. I am bullish on HF. I anticipate Meta will lose folks to brain drain in response to management equivocation around company values. I don't put much stock into Google or Microsoft's AI capabilities, they are the new IBMs and are no longer innovating except at obvious margins.

stormfather

Google is silently catching up fast with Gemini. They're also pursuing next gen architectures like Titan. But most importantly, the frontier of AI capabilities is shifting towards using RL at inference (thinking) time to perform tasks. Who has more data than Google there? They have a gargantuan database of queries paired with subsequent web nav, actions, follow up queries etc. Nobody can recreate this, Bing failed to get enough marketshare. Also, when you think of RL talent, which company comes to mind? I think Google has everyone checkmated already.

shwaj

Can you say more about using RL at inference time, ideally with a pointer to read more about it? This doesn’t fit into my mental model, in a couple of ways. The main way is right in the name: “learning” isn’t something that happens at inference time; inference is generating results from already-trained models. Perhaps you’re conflating RL with multistage (e.g. “chain of thought”) inference? Or maybe you’re talking about feeding the result of inference-time interactions with the user back into subsequent rounds of training? I’m curious to hear more.

_DeadFred_

How quickly the narrative went from 'Google silently has the most advanced AI but they are afraid to release it' to 'Google is silently catching up' all using the same 'core Google competencies' to infer Google's position of strength. Wonder what the next lower level of Google silently leveraging their strength will be?

moffkalast

Never underestimate Google's ability to fall flat on their face when it comes to shipping products.

onlyrealcuzzo

If you watch this video, it explains well what the major difference is between DeepSeek and existing LLMs: https://www.youtube.com/watch?v=DCqqCLlsIBU

It seems like there is MUCH to gain by migrating to this approach - and it theoretically should not cost more to switch to that approach than vs the rewards to reap.

I expect all the major players are already working full-steam to incorporate this into their stacks as quickly as possible.

IMO, this seems incredibly bad to Nvidia, and incredibly good to everyone else.

I don't think this seems particularly bad for ChatGPT. They've built a strong brand. This should just help them reduce - by far - one of their largest expenses.

They'll have a slight disadvantage to say Google - who can much more easily switch from GPU to CPU. ChatGPT could have some growing pains there. Google would not.

wolfhumble

> I don't think this seems particularly bad for ChatGPT. They've built a strong brand. This should just help them reduce - by far - one of their largest expenses.

Often expenses like that are keeping your competitors away.

tomrod

That is a fantastic video, BTW.

danaris

This assumes no (or very small) diminishing returns effect.

I don't pretend to know much about the minutiae of LLM training, but it wouldn't surprise me at all if throwing massively more GPUs at this particular training paradigm only produces marginal increases in output quality.

tomrod

I believe the margin to expand is on CoT, where tokens can grow dramatically. If there is value in putting more compute towards it, there may still be returns to be captured on that margin.

simpaticoder

>Imagine having 300.

Would it not be useful to have multiple independent AIs observing and interacting to build a model of the world? I'm thinking something roughly like the "councelors" in the Civilization games, giving defense/economic/cultural advice, but generalized over any goal-oriented scenario (and including one to take the "user" role). A group of AIs with specific roles interacting with each other seems like a good area to explore, especially now given the downward scalability of LLMs.

JoshTko

This is exactly where Deepseeks enhancements come into play. Essentially deepseek lets the model think out loud via chain of thought (o1 and Claude also do this) but DS also does not supervise the chain of thought, and simply rewards CoT that get the answer correctly. This is just one of the half dozen training optimization that Deepseek has come up with.

tomrod

Yes; to my understanding that is MoE.

tw1984

> If most of NVIDIAs moat is in being able to efficiently interconnect thousands of GPUs

nah. it moat is CUDA and millions of devs using CUDA aka the ecosystem

mupuff1234

But if it's not combined with super high end chips with massive margins that moat is not worth anywhere close to 3T USD.

ReptileMan

And then some chineese startup create an amazing compiler that takes cuda and moves it to X (AMD, Intel, Asic) and we are back at square one.

So far it seems that the best investment is in RAM producers. Unlike compute the ram requirements seem to be stubborn.

01100011

Don't forget that "CUDA" involves more than language constructs and programming paradigms.

With NVDA, you get tools to deploy at scale, maximize utilization, debug errors and perf issues, share HW between workflows, etc. These things are not cheap to develop.

a_wild_dandan

Running a 680-billion parameter frontier model on a few Macs (at 13 tok/s!) is nuts. That'a two years after ChatGPT was released. That rate of progress just blows my mind.

qingcharles

And those are M2 Ultras. M4 Ultra is about to drop in the next few weeks/months, and I'm guessing it might have higher RAM configs, so you can probably run the same 680b on two of those beasts.

The higher performing chips, with one less interconnect, is going to give you significantly higher t/s.

bn-l

Link has all the params but running at 4 bit quant.

qingcharles

4-bit quant is generally kinda low, right?

I wonder how badly this quant affects the output on DeepSeek?

neuronic

> NVIDIAs moat

Offtopic, but your comment finally pushed me over the edge to semantic satiation [1] regarding the word "moat". It is incredible how this word turned up a short while ago and now it seems to be a key ingredient of every second comment.

[1] https://en.wikipedia.org/wiki/Semantic_satiation

mikestew

It is incredible how this word turned up a short while ago…

I’m sure if I looked, I could find quotes from Warren Buffet (the recognized originator of the term) going back a few decades. But your point stands.

kccqzy

The earliest occurrence of the word "moat" that I could find online from Buffett is from 1986: https://www.berkshirehathaway.com/letters/1986.html That shareholder letter is charmingly old-school.

Unfortunately letters before 1977 weren't available online so I wasn't able to search.

It also helps that I've been to several cities with an actual moat so this word is familiar to me.

mikeyouse

Yeah, he's been talking about "economic moats" since at least the 1990s. At least since 1995;

https://www.berkshirehathaway.com/letters/1995.html

fastasucan

The word moat was first used in english in the 15th century https://www.merriam-webster.com/dictionary/moat

neuronic

Yes my wording was rubbish I should have said "tuned up" in the HN bubble. Quick ctrl-f shows 35 uses in this thread without loading all comments.

I did not mean that it was literally invented a short while ago - a few months ago I had to look up what it means though (not native English).

cwmoore

https://en.wikipedia.org/wiki/Frequency_illusion

ljw1004

I'm struggling to understand how a moat can have a CRACK in it.

nateglims

perhaps if the moat is kept in place by some sort of berm or quay

simonw

This is excellent writing.

Even if you have no interest at all in stock market shorting strategies there is plenty of meaty technical content in here, including some of the clearest summaries I've seen anywhere of the interesting ideas from the DeepSeek v3 and R1 papers.

eigenvalue

Thanks Simon! I’m a big fan of your writing (and tools) so it means a lot coming from you.

punkspider

I was excited as soon as I saw the domain name. Even after a few months, this article[1] is still at the top of my mind. You have a certain way of writing.

I remember being surprised at first because I thought it would feel like a wall of text. But it was such a good read and I felt I gained so much.

1: https://youtubetranscriptoptimizer.com/blog/02_what_i_learne...

nejsjsjsbsb

I was put off by the domain by bias against something that sounds like a company blog. Especially a "YouTube something".

You may get more milage from excellent writing on a yourname.com. This is a piece that sells you not this product, plus it feels more timeless. In 2050 someone my point to this post. Better if it were on your own name.

eigenvalue

I really appreciate that, thanks so much!

dabeeeenster

Many thanks for writing this - its extremely interesting and very well written - I feel like I've been brought up to date which is hard in AI world!

andrewgross

> The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge.

I was under the impression that this was not how MoE models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer. There is no "expert" that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation. As far as I could tell from the Deepseek v3/v2 papers, their MoE approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an MOE nor an ability to extract the weights of the expert to run locally (aside from distillation or similar).

If there is someone more versed on the construction of MoE architectures I would love some help understanding what I missed here.

Kubuxu

Not sure about DeepSeek R1, but you are right in regards to previous MoE architectures.

It doesn’t reduce memory usage, as each subsequent token might require different expert buy it reduces per token compute/bandwidth usage. If you place experts in different GPUs, and run batched inference you would see these benefits.

andrewgross

Is there a concept of an expert that persists across layers? I thought each layer was essentially independent in terms of the "experts". I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though.

I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.

rahimnathwani

  I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though

Yes, I think that's what they describe in section 3.4 of the V3 paper. Section 2.1.2 talks about "token-to-expert affinity". I think there's a layer which calculates these affinities (between a token and an expert) and then sends the computation to the GPUs with the right experts.

This doesn't sound like it would work if you're running just one chat, as you need all the experts loaded at once if you want to avoid spending lots of time loading and unloading models. But at scale with batches of requests it should work. There's some discussion of this in 2.1.2 but it's beyond my current ability to comprehend!

rahimnathwani

  If you place experts in different GPUs

Right, this is described in the Deepseek V3 paper (section 3.4 on pages 18-20).

j7ake

This was an amazing summary of the landscape of ML currently.

I think the title does the article injustice, or maybe it’s too long for people to read to appreciate it (eg the deepseek stuff can be an article within itself).

Whatever the ones with longer attention span will benefit from this read.

Thanks for summarising this up!

metadat

The site is currently offline, here's a snapshot:

https://archive.today/y4utp

dang

We've changed the title to a different one suggested by the author.

eigenvalue

Thanks! I was a bit disappointed that no one saw it on HN because I think they’d like it a lot.

j7ake

I think they would like it a lot, but I think the title doesn’t match the content, and it takes too much reading before one realises it goes beyond the title.

Keep it up!

lxgr

Man, do I love myself a deep, well-researched long-form contrarian analysis published as a tangent of an already niche blog on a Sunday evening! The old web isn't dead yet :)

eigenvalue

Hah thanks, that’s my favorite piece of feedback yet on this.

liuliu

This is a humble and informed acrticle (comparing to others written by financial analysts the past a few days). But still have the flaw of over-estimating efficiency of deploying a 687B MoE model on commodity hardware (to use locally, cloud providers will do efficient batching and it is different): you cannot do that on any single Apple hardware (need to at least hook up 2 M2 Ultra). You can barely deploy that on desktop computers just because non-register DDR5 can have 64GiB per stick (so you are safe with 512 RAM). Now coming to PCIe bandwidth: 37B per token activation means exactly that, each activation requires new set of 37B weights, so you need to transfer 18GiB per token into VRAM (assuming 4-bit quant). PCIe 5 (5090) have 64GB/s transfer speed so your upper bound is limited to 4 tok/s with a well balanced propose built PC (and custom software). For programming tasks that usually requires ~3000 tokens for thinking, we are looking at 12 mins per interaction.

lvass

Is it really 37B different parameters for each token? Even with the "multi-token prediction system" that the article mentions?

liuliu

I don't think anyone uses MTP for inference right now. Even if you use MTP for drafting, you need to batching in the next round to "verify" it is the right token, if that happens you need to activate more experts.

DELETED: If you don't use MTP for drafting, and use MTP to skip generations, sure. But you also need to evaluate your use case to make sure you don't get penalized for doing that. Their evaluation in the paper don't use MTP for generation.

EDIT: Actually, you cannot use MTP other than drafting because you need to fill in these KV caches. So, during generation, you cannot save your compute with MTP (you save memory bandwidth, but this is more complicated for MoE model due to more activated experts).

hn_throwaway_99

I'm curious if someone more informed than me can comment on this part:

> Besides things like the rise of humanoid robots, which I suspect is going to take most people by surprise when they are rapidly able to perform a huge number of tasks that currently require an unskilled (or even skilled) human worker (e.g., doing laundry ...

I've always said that the real test for humanoid AI is folding laundry, because it's an incredibly difficult problem. And I'm not talking about giving a machine clothing piece-by-piece flattened so it just has to fold, I'm talking about saying to a robot "There's a dryer full of clothes. Go fold it into separate piles (e.g. underwear, tops, bottoms) and don't mix the husband's clothes with the wife's". That is, something most humans in the developed world have to do a couple times a week.

I've been following some of the big advances in humanoid robot AI, but the above task still seems miles away given current tech. So is the author's quote just more unsubstantiated hype that I'm constantly bombarded with in the AI space, or have there been advancements recently in robot AI that I'm unaware of?

rattray

https://physicalintelligence.company is working on this – see a demo where their robot does ~exactly what you said, I believe based on a "generalist" model (not pretrained on the tasks): https://www.youtube.com/watch?v=J-UTyb7lOEw

hn_throwaway_99

That's the same video I commented on below: https://news.ycombinator.com/item?id=42844967

There's a huge gulf between what is shown in that video and what is needed to replace a human doing that task.

delusional

There are so many cuts in that 1 minute video, Jesus Christ. You'd think it was produced for TikTok.

niccl

There's a laundry folding section at the end of that isn't cut. Looks reasonably impressive, if your standard is slightly above that of a teenager

hnuser123456

2 months ago, Boston Dynamics' Atlas was barely able to put solid objects in open cubbies. [1] Folding, hanging, and dresser drawer operation appears to be a few years out still.

https://www.youtube.com/watch?v=F_7IPm7f1vI

ieee2

I saw such robot's demos doing exactly that on youtube/x - not very precisely yet, but almost sufficiently enough. And it is just a beginning. Considering that majority of the laundry is very similar (shirts, t-shirts, trousers, etc..) I think this will be solved soon with enough training.

hn_throwaway_99

Can you share what you've seen? Because from what I've seen, I'm far from convinced. E.g. there is this, https://youtube.com/shorts/CICq5klTomY , which nominally does what I've described. Still, as impressive as that is, I think the distance from what that robot does to what a human can do is a lot farther than it seems. Besides noticing that the folded clothes are more like a neatly arranged pile, what about all the edge cases? What about static cling? Can it match socks? What if something gets stuck in the dryer?

I'm just very wary of looking at that video and saying "Look! It's 90% of the way there! And think how fast AI advances!", because that critical last 10% can often be harder than the first 90% and then some.

Nition

First problem with that demo is that putting all your clothes in a dryer is a very American thing. Much of the world pegs their washing on a line.