The thundering herd problem isn't really about high levels of traffic. To the ex...

8note · on Aug 4, 2021

Hmm. I think of thundering Herd being about retries.

All your failing requests batch up when your retry strategy sucks, then you end up really high traffic on every retry, and very little in between

thraxil · on Aug 4, 2021

Retries without jitter are indeed a common source of thundering herd problems. Even with exponential backoff, if all the clients are retrying simultaneously, they'll hammer your servers over and over. Adding jitter (just a random amount of extra delay that's different for every client+retry), they get staggered and the requests are spread out.

derefr · on Aug 4, 2021

What do you do when you’re an API SaaS, and it’s your clients’ apps that are making thundering-herd requests?

Imagine you’re a service like Feedly, and one of your “direct customer” API clients — some feed-reader mobile client — has coded their apps such that all of their connected clients will re-request the specific user’s unique feed at exact, crontab-like 5-minute offsets from the start of the hour. So every five minutes, you get a huge burst of traffic, from all these clients—and it’s all different traffic, with nothing coalescesable.

You don’t control the client in this case, but nor can you simply ban them—they’re your paying customers! (Yes, you can “fire your customer”, but this would be most of your customers…)

And certainly, you can try to teach the devs of your client how to write their own jitter logic—but that rarely works out, as often it’s junior frontend devs who wrote the client-side code, and it’s hard to have a non-intermediated conversation with them.

thraxil · on Aug 4, 2021

If you have no control at all over the client, then ultimately, you have to just take it and build your service to handle that amount of traffic. Adding jitter is a technique that you use when writing clients. That's why I mentioned it in the context of retries. If you are writing a CDN per the article, at some point your CDN has to make requests back to the origin. If one of those requests fails and you retry, you add jitter there to avoid DoSing yourself. If you are working in a microservices architecture, you add jitter on retries between your services.

The best you can do with clients that are out of your control is to publish a client library/SDK for your API that is convenient for your customers to use and implements best practices like exponential backoff, jitter, etc. If you have documentation with code snippets that junior devs are likely to copy and paste, include it in those.

If you've painted yourself into a corner like you describe and are seeing extremely regular traffic patterns, you might be able to pre-cache. Ie, it's 12:01 and you know that a barrage is coming at 12:05. Start going down the list of clients/feeds that you know are likely to be requested based on recent traffic patterns and generate the response, putting it in your cache/CDN with a five minute TTL. Then at least a good portion of the requests should be served straight from there and not add load to the origin. There are obviously drawbacks/risks to that approach, but it might be all you can really do.

lelandbatey · on Aug 5, 2021

If you're extremely desperate, you can start adding conditional jitter (somewhere within 5ms - 200 ms) to your load balancer/reverse proxy, such as your NGINX/Envoy/Apache box, which sits in front of your API. You can make the jitter conditional on count of concurrent requests or on latency spikes. It's an extreme last resort, and may require a bit of custom work via custom module or extension, but it is possible.

In general, try to avoid not having any control over the client and if you must lack control over the client (such as if you're a pure SaaS company selling a public API), you can apply jitter based on API key in addition to other metrics I mentioned above.

As better engineers than I used to say at a previous engagemen: "if it's not in the SLA, it's an opportunity for optimization"

derefr · on Aug 5, 2021

I like the “jitter based on API key” idea.

It’s somewhat hard in our case, as our direct customers (like the mobile app I mentioned) have API keys with us, but they don’t tell us about which user of theirs is making the request. And often they’ll run an HTTP gateway (in part so that they don’t have to embed their API key for our service in their client app), so we don’t even get to see the originating user IPs for these requests, either. We just get these huge spikes of periodic traffic, all from the same IP, all with the same API key, all about different things, and all delivered over a bunch of independent, concurrent TCP connections.

I’ve been considering a few options:

- Require users that have such a “multiple users behind an API gateway” setup, to tag their proxied requests with per-user API sub-keys, so we can jitter/schedule based on those.

- Since these customers like API gateways so much, we could just build a better API gateway for them to run; one that benefits us. (E.g. by Nagle-ing requests together into fewer, larger batch requests.) Requests that come as a single large batch request, could be scheduled by our backend at an optimal concurrency level, rather than trying to deal with huge concurrency bursts as we are now.

- Force users to rewrite their software to “play nice”, by introducing heavy-handed rate-limiting. Try to tune it so that the only possible way to avoid 429s is to either do gateway-side request queuing, or to introduce per-client schedule offsets (i.e. placing users on a hash ring by their ID, so for a periodic-per-5-minutes request, equal numbers of client apps are set to make the request at T+0, vs. T+2.5.)

- Introduce a middleware / reverse-proxy that holds an unbounded-in-size no-expire request queue, with one queue per API key, where requests are popped fairly from each queue (or prioritized according to the plan the user is paying for). Ensure backends only select(1) requests out from the middleware’s downstream sockets as quickly as they’re able to handle them. Require API requests to have explicit TTLs — a time after which serving the request would no longer be useful. If a backend pops a request and finds that it’s past its TTL, it discards it, answering it with an immediate 504 error.

ithkuil · on Aug 4, 2021

Jitter is one way to solve it. Request coalescing is another.

It depends on the request type. Is it cacheable? Do you require a per-client side effect? ...

arghwhat · on Aug 4, 2021

Request coalescing in a shared cache does not solve thundering herd, it just reduces propagation to backend services. Your cache is still subject to a thundering herd, and may be unable to keep up.

The only way to solve thundering herd - which is that a load of all requests arrive within a short timespan - is to distribute requests over larger timespan.

Reducing your herd size by having fewer requests does not solve thundering herd, but may make it bearable.

arghwhat · on Aug 4, 2021

Retries tend to amplify it, but a more common cause is scheduled tasks in clients/end user devices.

E.g. all clients checking for an update at 10:00 UTC every day, all clients polling for new data at fixed times, etc.

thaumasiotes · on Aug 4, 2021

Where does your perspective differ from what I said above?