Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework

43 comments

·March 18, 2025

bloomingkales

Built in Rust for performance and in Python for extensibility

Omg, a team that knows how to selectively use tech as needed. Looking at the Rust web developers in corner.

tormeh

I do think Rust would be better for web dev if it had GC, but it doesn't, and no other language comes close to having as good ergonomics otherwise. And the memory management is something you learn once and then it's a bit verbose but no big deal. If you feel like you absolutely have to write custom data structures with circular references for your web server then I tentatively suggest that maybe you're doing web dev wrong.

In my team we onboarded a data scientist used to working in Python who had never used Rust onto a Rust project, and it was just not a big deal. Maybe I'm just fortunate when it comes to colleagues.

bloomingkales

Sounds like nonsense. Again, I asked in another comment for an example of some Rust “web” code that exemplifies what you are talking about. You mentioned “ergonomics”, and you mentioned “noob friendly”. I’d love to see some of this Rust code.

Show us, we’ll discuss.

I feel like some devs are so insecure that they really think , ugh, I can’t even fully explain the pathology of the Rust people without cursing them out.

You are not a better developer, that’s ALL I want to say to the Rust people. In fact, most of you are bad developers for doing what you have been doing with this language. You ALL must find a better way to show your intellectual prowess.

I heard you guys are even bugging the Linux people.

seangrogg

Unsure if the implication is that Rust is poorly suited for web development or what.

cayley_graph

It is, in my opinion (as an avid Rust user!). The type errors from most of the major web frameworks/ORMs (diesel, sqlx) are just awful, more often than not. Usually some inscrutable thing involving Send/Sync. Or some hilariously complicated type/trait hackery on the part of the library, attempting to save me from the former, that I'm never going to figure out.

Great language in many other settings, but not this one. At least not right now, but given my experience with async Rust in general, I'm not sure it ever will be.

echelon

> Usually some inscrutable thing involving Send/Sync.

What kind of queries are you writing? I never see SQLx emit anything like that. I always get back SQL errors.

Column mismatches are what I hit most in development, and they're pretty explicit.

The hairiest thing I see with SQLx is when I try to write custom type conversions for my own types to SQL fields. I sometimes have to delve into macros. But those errors are pretty self-explanatory too.

echelon

Rust is emerging as one of the best web programming languages out there.

Actix and Axum feel like Python's Flask.

Rust has decent Redis and connection pool libraries, but the SQL space needs more work. Diesel SQL is too ORM-y (I've never liked ORMs). While SQLx allows you to write "typechecked" SQL, it still has really annoying edge cases (WHERE IN clauses can't be typechecked, type bindings can get hairy, etc.)

I'm not very happy about the state of Rust's Elasticsearch libraries, either.

Rust probably needs a Rails/Django-like framework too for those that prefer a framework-oriented development lifecycle.

Rust also needs some observability frameworks. There are a few, but the choices are sparse.

I'd give Rust a 7.5/10 for web programming, and as far as the promise of the language goes, I'd give it an 11/10. Developing in Actix and Axum feels amazing. It's honestly better than Go and Python. The other pieces (database, API clients, etc.) will presumably get better in time.

And because of the way HTTP request flow logic is typically structured, 99.9% of the time you'll never hit Rust's borrow checker or have to worry about lifetimes. It's as if you've been given one of the best typed languages, best package managers, and nearly no tradeoffs. The server compiles down to a single static binary. It's multithreaded, and it's blindingly fast.

I'm picking Rust for every new web service I write these days.

impulser_

This is obviously heavily biased, because there is no way any reasonable person would think Axum or Actix are like Flask. That's just not possible with a language like Rust. The Rust standard library is horrible compared to Python or Go.

You need more dependencies to build a simple APi in Rust than you need in Python, and Go combined.

Axum, tokio, serde, serde_json, anyhow, sqlx and probably 5 more to fix the bad standard library.

In Python and Go you can build web app with the standard library.

TBH after adding in a database, Rust is probably not that much faster than Go and Go has everything you need in the standard library, compiles to binary, and package manager doesn't matter cuz you don't need one using Go.

bloomingkales

Show me some Rust web code that you think exemplifies what you are talking about. I think your example will speak for itself and close this argument. No reason to go back and forth.

pjmlp

Nowehere close to the pleothora of tooling and frameworks available in Java and .NET ecosystem for all kinds of distributed computing scenarios.

And if one misses an advanced ML type system, Scala, Kotlin, F# are there.

m00dy

same here, Rust with actix can even replace nginx.

null

[deleted]

IceHegel

they actually implemented a decent amount of the HTTP stuff in rust. if you look at the docs

null

[deleted]

saagarjha

Nvidia not name products after existing things in the ML space challenge: IMPOSSIBLE

More seriously, though:

> OpenAI Compatible Frontend – High performance OpenAI compatible http api server written in Rust.

Is this normal in this space? I know everyone has settled on copying the S3 API for block storage but I’m unsure if we’ve done the same for LLM serving.

DeveloperErrata

Increasingly so. Many other popular inference tools in this space also expose an OpenAI compatible API: VLLM, Llama.cpp, and LiteLLM all do.

lmeyerov

So this replaces triton for LLMs or?

aabhay

This is very narrowly focused on LLMs, whereas triton is still useful for running all kinds of ML models. In practice, Triton is a very poor choice for LLMs specifically because it has none of the required non negotiable features like KV caching built in.

changtimwu

same question here. Just asked Grok for a comparsion https://grok.com/share/bGVnYWN5_fa210574-f27b-45ae-9d95-19ed...

null

[deleted]

nitrogen99

I have deployed and developed on their Triton inferencing server and it was amazing. All very good C++ and well architected. This one has Rust, Go, Python and C++. Seriously? First, not many Rust devs in the AI community. How do you think you'll get community involvement. Ok, may be you don’t need it. Second, good luck maintaining such a polyglot system. I prefer at most 2-3 languages - main language (C++/Java), Python for extensibility and Shell, etc for deployment.

Carrok

As someone who spent the better part of a year trying to get various Nvidia inference products to work _at all_ even with a direct line to their developers, I will simply say "beware".

dlewis1788

Just curious what your issues with Triton were. We've done OK with it using it to serve LLM models w/ a classifier head via HF Transformers pipeline & Flash Attention 2, as well as serving text generation models with the vLLM back-end.

bytesandbits

triton is not that bad, TensorRT will give you nightmares

dlewis1788

100% - probably why vLLM is now the default back-end in Dynamo.

aabhay

Triton is not that bad at all, considering the wide scope of systems it has to support (tensorrt, onnx, multiple generations of pytorch, cuda, python). It was much nicer than the old Torchserve project which was JVM based.

raffraffraff

I've done very little with Nvidia software, but what I have done puts me off ever doing it again. I quit a job partially because it involved trying to get their shit to work. (There were other factors, but that was definitely on the 'GTFO' side)

vinni2

Can you share some of your wisdom on setting up a scalable inference infrastructure?

Carrok

Use Ray Serve. https://docs.ray.io/en/latest/serve/index.html

ipsum2

As someone who has run LLMs in production, using Ray is probably the worst idea. It's not optimized for language models, and is extremely slow. There's no KV-caching, model parallelism, and other basic table stakes features that are offered by Dynamo or other open source inference frameworks. Useful only if you have <1 QPS.

Use SGLang, vLLM, or text-generation-inference instead.

islewis

is this in reference to Triton?

Carrok

And NIM, yes.

null

[deleted]

HN

Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework

Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework