Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework
11 comments
·March 18, 2025bloomingkales
Built in Rust for performance and in Python for extensibility
Omg, a team that knows how to selectively use tech as needed. Looking at the Rust web developers in corner.
Carrok
As someone who spent the better part of a year trying to get various Nvidia inference products to work _at all_ even with a direct line to their developers, I will simply say "beware".
islewis
is this in reference to Triton?
vinni2
Can you share some of your wisdom on setting up a scalable inference infrastructure?
Carrok
Use Ray Serve. https://docs.ray.io/en/latest/serve/index.html
ipsum2
As someone who has run LLMs in production, using Ray is probably the worst idea. It's not optimized for language models, and is extremely slow. There's no KV-caching, model parallelism, and other basic table stakes features that are offered by Dynamo or other open source inference frameworks. Useful only if you have <1 QPS.
Use SGLang, vLLM, or text-generation-inference instead.
null
So this replaces triton for LLMs or?