Skip to content(if available)orjump to list(if available)

Thundering herd problem: Preventing the stampede

Ciantic

I've stumbled on this twice now, usually you can use just CDN caching, but I once solved it with redis locks, and once with simply filling the cache periodically in the background.

If you can, it's easier to have every client fetch from cache, and then a cron job e.g., every second, refresh the cache.

blakepelton

Some recent academic work suggests implementing caches directly in network switches. Tofino switches are programmable enough that academics can implement this today.

OrbitCache is one example, described in this paper: https://www.usenix.org/system/files/nsdi25-kim.pdf

It should solve the thundering herd problem, because the switch would "know" what outstanding cache misses it has pending, and the switch would park subsequent requests for the same key in switch memory until the reply comes back from the backend server. This has an advantage compared to a multi-threaded CPU-based cache, because it avoids performance overheads associated with multiple threads having to synchronize with each other to realize they are about to start a stampede.

A summary of OrbitCache will be published to my blog tomorrow. Here is a "draft link": https://danglingpointers.substack.com/p/4967f39c-7d6b-4486-a...

chmod775

This is just how you should implement any (clientside) cache in a concurrent situation. It's the obvious and correct way. I expect you'll find this pattern implemented with promises in thousands of javascript/typescript codebases.

This query will probably find loads already: https://github.com/search?q=language%3Atypescript+%22new+Map...

sriram_malhar

This particular example of thundering herd isn't convincing. First, the database has a cache too, and the first query would end up benefiting the other queries for the same key. The only extra overhead is of the network, which is something a distributed lock would also have.

I would think that in the rare instance of multiple concurrent requests for the same key where none of the caches have it cached, it might just be worth it to take the slightly increased hit (if any) of going to the db instead of complicated it further and slowing down everyone else with the same mechanism.

fidotron

This reads like LLM noise, with headings missing articles.

It also doesn't mentionn the most obvious solution to this problem: adding a random factor to retry timing during backoff, since a major cause of it is everyone coming back at the precise instant a service becomes available again, only to knock it offline.