Introduction
Availability guarantees for a service are measured in terms of number of 9s. For instance, a service ‘A’ with 99% availability would be referred to as having 2 9s of availability. 99% availability means for every 100 calls handled by service A, 1 can fail. If a client for service A can’t tolerate this failure rate, i.e., 1 call failing every 100 calls, the client can add a single retry for every such failure to improve the perceived availability of service A from 99% to 99.99%. With the retry implemented, service A appears to have 1 failure for every 10000 calls vs 1 failure every 100 calls. This is a 100x improvement to availability with just 1 client retry.
This can be formulated as follows:
Perceived client side availability = (1 – (1 – availability of service) (n +1))*100
where n is the number of retries and 1 is added to account for the original call.
In case of service A above, client side availability with 1 retry is calculated as follows:
= (1 – (1-99/100)^(1+1))*100 = 99.99% availability.
As simple as this is to improve client side perceived availability of a service, it comes at the expense of added latency. In most cases, this increased latency is a tolerable trade-off for improved availability. However, as the system becomes complicated over time with deeper call stack, a lurking risk of retry storm starts emerging, which if not taken care of can take down the entire system. In the following write-up, we will understand why the risk of retry storm surfaces and what are common techniques to stay ahead of this situation.
Risks With Retries
For a simple architecture with one or two layers of call stack, retries work fine with limited downside. However, as your overall system continues to add layers of microservice calls, every such retry from a different layer of call stack creates a perfect recipe for Retry Storm that can bring down the entire system. Let us understand how.
Consider the case of a deeply nested architecture:
- Service W calls -> Service X calls -> Service Y calls -> Service Z
- Where each service has 2 retries for service failure.
With this setup, if Service Z has an outage, it can receive up to 27x traffic due to retries. Below is how the call pattern looks like in our service chain:
- Service Y calls Service Z up to 3 times (1 original call and 2 retries)
- Service X calls Service Y up to 3 times, leading to a total of 9 max calls to service Z
- Service W calls Service X up to 3 times, leading to a total of 27 max calls to service Z as the previous step leads to 9 calls per attempt.
The simple formula to compute this load on service Z is = X ^ n where X is max number of calls from a client and n is depth at which failure occurs. In our case where service Z fails, the formula will be 3^3 = 27 calls.
Most services are not designed to handle this type of increased exponential workload, and this type of insurmountable increase in traffic can affect other services in the call chain, creating new outages in the system until full recovery happens.
Root Cause
There are multiple articles describing best practices for retries like jitter, backoff strategy, and timeouts. While these are useful, they won’t completely help us get out of the situation that we discussed above. This happens because when services perform a retry, it considers every retry in isolation. This leads to an exponential addition of requests as we go deeper into the call path even if we implement jitter, backoff, or timeouts at an individual layer.
Recommendations
Limit number of retries from every service layer
One way to avoid a retry storm is to stop treating every retry in isolation. Instead, try limiting the overall number of requests from a particular service layer. This ensures that if a dependency is having an outage, it only sees a previously decided-upon rate increase in traffic until recovery is completed. For instance, consider a retry limit of 2% applied by a service which is sending a TPS of 5k to another dependency. If this dependency experiences an outage, the overall increase in traffic after taking retries into consideration is only 100 TPS (2% of 5k TPS), which is easier to handle when the dependency is already suffering an outage or scaling issues. This can be implemented using a token bucket algorithm at a service level.
Decide which layers will retry
Instead of every service in the call path performing retries, it is always effective to delegate the responsibility to perform retries to one or two service layers in the call path. This ensures that services are not unnecessarily retrying when faced with outages. Combining this approach with backoff, jitter, as well as effective timeouts can help reduce overall pressure on the system.
Limit Scale Ups
An unfortunate downside of a dependency facing an outage is that our services start experiencing increased latency as dependency calls start timing out. This also leads to a situation where our service capacity might start scaling up as overall latency increases, leading to reduced TPS handled by the overall fleet. This leads to capacity auto-scaling, if set up for service, to kick in and start spinning up more hosts. This can further exacerbate the impact from outage. Limiting scale-ups in such situations to a small number with appropriate alarming will lead to human intervention preventing unintended scale-up of service capacity.
Employ Circuit Breakers
With circuit breakers, calls to downstream services are entirely stopped when specific thresholds for errors are exceeded. For instance, we can decide to stop the traffic to a service if more than 10% of requests start failing. However, the biggest downside with circuit breakers is that it introduces bi-modal behavior into the system, making it difficult to test the system. The other downside with this approach, unlike the token bucket algorithm approach described previously, is that this can lead to longer recovery as downstream service may still have capacity to process a certain number of requests, but we are shutting down entire traffic from clients.
Concluding Thoughts
The management of retries in distributed systems has far-reaching implications beyond just technical reliability. It impacts operational efficiency, user experience, security, and business continuity. Properly addressing these aspects can lead to significant benefits across various dimensions of a business and its technology stack.
Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.