Table of contentsClick link to navigate to the desired location
This content has been automatically translated from Ukrainian.
Thundering herd problem - is a situation in software development when many processes or requests simultaneously access the same resource. Each of them acts logically and correctly, but together they create a sharp spike in load that the system cannot handle.
The name is metaphorical: imagine a herd of animals that suddenly takes off. Individually, each animal is not a problem, but together they can trample everything in their path.
Example of Thundering Herd
In practice, this often looks like this: you have a cache with TTL. The key expires, and at that moment many requests come in. Each of them sees that the cache is empty and decides to go to the database or an external service on their own. As a result, instead of one controlled request, you get hundreds or thousands. The database becomes overloaded, latency increases, timeouts occur, and sometimes even a complete service failure. The cache, which was supposed to protect the system, actually turns into a trigger for a disaster.
This problem is particularly insidious because it arises not in calm conditions, but rather at the worst moments: during peak traffic, after service restarts, during deployments, or when an external dependency starts responding more slowly. It is almost impossible to reproduce locally or in staging, so in production it appears as a "random" incident without an obvious cause.
How to fight thundering herd?
You can fight thundering herd in various ways, but the idea is always the same: the system should not behave too synchronously. Often, it is sufficient to allow only one process to update the cache while all others either wait or read the previous value. In other cases, it is better to temporarily serve slightly stale data than to create an avalanche of requests to the database. Even a small detail like a random shift in TTL can significantly reduce the risk of all keys expiring at the same time.
A separate category of problems is retries. When a service crashes or responds slowly, naive retry logic forces clients to repeat requests synchronously. Instead of recovering, the system takes an even bigger hit. Therefore, delays, exponential backoff, and random jitter are not optimizations but necessities.
Thundering herd is not a bug in the code, but a consequence of architectural decisions. It appears where the system scales quantitatively but does not account for competition for shared resources. If you think not only about "how it works" but also about "how it behaves under load," this problem can be avoided at the design stage.
This post doesn't have any additions from the author yet.