All original content is created in Ukrainian. Not all content has been translated yet. Some posts may only be available in Ukrainian.Learn more

Thundering Herd Problem: what it is and why it breaks production

Post cover: Thundering Herd Problem: what it is and why it breaks production
Table of contentsClick link to navigate to the desired location
This content has been automatically translated from Ukrainian.
Thundering herd problem - is a situation in software development when many processes or requests simultaneously access the same resource. Each of them acts logically and correctly, but together they create a sharp spike in load that the system cannot handle.
The name is metaphorical: imagine a herd of animals that suddenly takes off. Individually, each animal is not a problem, but together they can trample everything in their path.

Example of Thundering Herd

In practice, this often looks like this: you have a cache with TTL. The key expires, and at that moment many requests come in. Each of them sees that the cache is empty and decides to go to the database or an external service on their own. As a result, instead of one controlled request, you get hundreds or thousands. The database becomes overloaded, latency increases, timeouts occur, and sometimes even a complete service failure. The cache, which was supposed to protect the system, actually turns into a trigger for a disaster.
This problem is particularly insidious because it arises not in calm conditions, but rather at the worst moments: during peak traffic, after service restarts, during deployments, or when an external dependency starts responding more slowly. It is almost impossible to reproduce locally or in staging, so in production it appears as a "random" incident without an obvious cause.

How to fight thundering herd?

You can fight thundering herd in various ways, but the idea is always the same: the system should not behave too synchronously. Often, it is sufficient to allow only one process to update the cache while all others either wait or read the previous value. In other cases, it is better to temporarily serve slightly stale data than to create an avalanche of requests to the database. Even a small detail like a random shift in TTL can significantly reduce the risk of all keys expiring at the same time.
A separate category of problems is retries. When a service crashes or responds slowly, naive retry logic forces clients to repeat requests synchronously. Instead of recovering, the system takes an even bigger hit. Therefore, delays, exponential backoff, and random jitter are not optimizations but necessities.
Thundering herd is not a bug in the code, but a consequence of architectural decisions. It appears where the system scales quantitatively but does not account for competition for shared resources. If you think not only about "how it works" but also about "how it behaves under load," this problem can be avoided at the design stage.

This post doesn't have any additions from the author yet.

What is the Vanilla Rails approach?
14 Nov 16:48

What is the Vanilla Rails approach?

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
What is Elasticsearch and how does it work?
22 Nov 12:35

What is Elasticsearch and how does it work?

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
What is a time-series database?
22 Nov 12:42

What is a time-series database?

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
22 Nov 12:49

What is VACUUM in PostgreSQL?

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
What is a B-Tree (Balanced Tree)?
22 Nov 12:58

What is a B-Tree (Balanced Tree)?

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
Fix for the issue of installing Ruby 3.4.3 (and not only) via RVM on macOS (Apple Silicon)
30 Dec 14:05

Fix for the issue of installing Ruby 3.4.3 (and not only) via RVM on macOS (Apple Silicon)

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
What is Exponential Backoff and Random Jitter?
15 Jan 15:24

What is Exponential Backoff and Random Jitter?

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
Connecting the Elasticsearch service to a Rails application (Coolify in the cloud, server on Hetzner).
15 Feb 13:45

Connecting the Elasticsearch service to a Rails application (Coolify in the cloud, server on Hetzner).

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
"No space left on device" - when Docker has consumed the entire disk
15 Feb 19:57

"No space left on device" - when Docker has consumed the entire disk

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska
15 Feb 20:03

Sidekiq 7.3.x and connection_pool 3.0 - incompatibility that breaks workers

Нотатки про Ruby та RoR
Нотатки про Ruby та RoR@kovbaska