SSE sucks for transporting LLM tokens

nixpulvis

Author tries to avoid a database for storing tokens while the client is disconnected and ends up storing them in a pub/sub provider.

There's no solution other than to store the tokens somewhere, or drop them. You have to make a choice how long you want to allow reconnects for. And this is all pretty independent of the transport layer, as the author even mentioned themselves, you can resume even a new session as long as you have a prompt ID or something to tie it back to the original request.

I don't know enough about how the LLM providers stream results, but the original claim that inference is more expensive than transport is a good point, and caching tokens seems like a smart move. Unfortunately, we pay by the token, so I don't see the incentive for providers to spend time and money doing this for us.

devnull3

> the model has to re-run the generation, and the client has to start receiving tokens from scratch again.

I don't understand. The payload can be designed to have sequence number. In case of reconnect, send the last known sequence number. Sounds like a application level protocol problem and not transport. Am I missing something?

The pub/sub mentioned in the article essentially does the same thing.

petcat

The blog author is confusing SSE the protocol itself, with how the application is typically implemented. SSE is great and can trivially be implemented in a way that allows history, catch-up, and resuming. The "Pub/Sub" mentioned at the end of the exact application of SSE that the author wants.

medbrane

Indeed, the reconnect behavior is described in the protocol and the server will simply resume from the requested sequence id.

wyldfire

Not x86_64's Streaming SIMD Extensions, but Server-sent events [1]. SSE and AVX are probably not that bad at /handling/ LLM tokens...

[1] https://en.wikipedia.org/wiki/Server-sent_events

mirekrusin

Response API does support resuming.

But it's not gigantic improvement as models don't regenerate "lost"/past parts of conversation, they're heavily cached and were from pretty much day 1, that's why they have highly reduced cost.

sauercrowd

Seems like there's a few abstractions mixed up, the problems have nothing to do with SSE.

You can store the state in the SSE connection and have the problems described, and if you don't like those, you can move thr state to something distributed/persisted.

Pubsub is just a layer on top of SSE or websockets, cause guess how it'd end up sending things to the browser

ivan_gammel

I don’t get it. Client generates UUID for prompt, PUTs the prompt with this UUID on server. Server caches the generated output for reasonable time, so that subsequent PUTs get 200 instead of 201. Transport protocol failures then do not matter. If response isn’t 4x, just retry.

riskable

The way the current architecture works—as far as I know—is your assumed "server caches the generated output" step doesn't exist. What you get in your output is streamed directly from the LLM to your client. Which is, in theory, the most efficient way to do it.

That's why LLM outputs that get cut off mid-stream require the end user click the "retry" button and not the, "re-send me that last output" button (which doesn't exist).

I would imagine that a simpler approach would be to simply make the last prompt idempotent... Which would require caching on their servers; something that supposedly isn't happening right now. That way, if the user re-sends the last prompt the server just responds with the same exact output it just generated. Except LLMs often make mistakes and hallucinate things... So re-sending the last prompt and hoping for a better output isn't an uncommon thing.

Soooo... Back to my suggested workaround in my other comment: Pub/sub over WebSockets :D

ivan_gammel

The only reason LLM server responds with partial results instead of waiting and returning all at once is UX. It’s just too slow. But the problem of slow bulk responses isn’t unique for LLM and can be solved within HTTP 1.1 well enough. Doesn’t have to be the same server, can be a caching proxy in front of it. Any privacy concerns can be addressed by giving the user opportunity to tell server to cache/not to cache (can be as easy as submitting with PUT vs POST requests)

bjt

The user's last prompt can be sent with an idempotency key that changes each time the user initiates a new request. If that's the same, use the cache. If it's new, hit the LLM again.

debazel

But adding caching to SSE is trivial compared to completely changing your transfer protocol, so why wouldn't you just do that instead?

imsh4yy

Yep, came here expecting to read an interesting take on why SSE sucks or a better alternative, but this just reads like "skill issue." A term I very much dislike but seems appropriate here.

ivan_gammel

Significant part of relatively new technology stacks and tech slang is “skill issue”. A lot of problems were already solved or at least analyzed 40-20 years ago and hardly need to be re-invented, maybe just modernized.

anonymoushn

so sad to hear that about Streaming SIMD Extensions

tedivm

SSE just sucks for most use cases, we don't have to go through each one pointing it out.

riskable

Pub/sub via WebSockets seems like the simplest solution. You'll need to change your LLM serving architecture around a little bit to use a pub/sub system that a microservice can grab the output from (to send to the client) but it's not rocket science.

It's yet another system that needs some DRAM though. The good news is that you can auto-expire the queued up responses pretty fast :shrug:

No idea if it's worth it, though. Someone with access to the statistics surrounding dropped connections/repeated prompts at a big LLM service provider would need to do some math.

bragh

Corporate security hates websockets though, SSE is much easier for end-users to get approved.

nightshift1

I think it would be even more wasteful to continue inference in background for nothing if the user decided to leave without pressing the stop button. Saving the partial answer at the exact moment the client disappeared would be better.

normie3000

What is SSE?

devnull3

Server Sent Events [1]

[1] https://developer.mozilla.org/en-US/docs/Web/API/Server-sent...

bjt

Weird take. The id field in the SSE spec is there specifically so you can resume a stream. And that requires persistence/caching on the server side. Once you have those things, you're practically at the pubsub option that the article prefers.

the_mitsuhiko

Precisely. In fact SSE as a protocol was specifically designed to support resuming. It’s unfortunate that most APIs don’t support that but that’s not the fault of SSE.

inesranzo

> A better approach: Pub/Sub

Citation Needed.

More importantly, benchmarks needed.

Cannot claim something X is better approach than Y without benchmarks, it is an idea but needs to be proven to be better.

Until then, this post is nothing more than yet another opinion.

HN

SSE sucks for transporting LLM tokens

SSE sucks for transporting LLM tokens