Skip to content(if available)orjump to list(if available)

Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework

lmeyerov

So this replaces triton for LLMs or?

bloomingkales

Built in Rust for performance and in Python for extensibility

Omg, a team that knows how to selectively use tech as needed. Looking at the Rust web developers in corner.

Carrok

As someone who spent the better part of a year trying to get various Nvidia inference products to work _at all_ even with a direct line to their developers, I will simply say "beware".

islewis

is this in reference to Triton?

vinni2

Can you share some of your wisdom on setting up a scalable inference infrastructure?

Carrok

ipsum2

As someone who has run LLMs in production, using Ray is probably the worst idea. It's not optimized for language models, and is extremely slow. There's no KV-caching, model parallelism, and other basic table stakes features that are offered by Dynamo or other open source inference frameworks. Useful only if you have <1 QPS.

Use SGLang, vLLM, or text-generation-inference instead.

null

[deleted]