This is an opinionated guide that outlines my general recommendations for introducing resiliency tooling to a system that makes or accepts requests over a network. I include specific advice but I keep the content agnostic of specific programming languages or frameworks. I use HTTP in some examples because of its prevalence but the concepts and techniques are applicable to most protocols.
Certainly, this is not a comprehensive guide for all possible resiliency tools and operational practices. This is a generalized approach that I’ve developed from my experience working with teams to introduce system reliability tooling and practices over time. I’ve written the sections in the same order that I normally introduce these techniques in a real system. Note, though, that there is no universal approach to resiliency. Balance any advice I give against hard data and an understanding of your system. Above all, make sure you have insight and observability into your system so that you can measure the impact of any change.
We’ll start with this baseline setup:
In this guide, I’ll use "system" to generically refer to any kind of running system such as an HTTP service. I’ll use "method" to refer to an individual action performed within a system such as an HTTP handler. Each resiliency feature will build on this baseline system and method that make a network call to a second system and method.
1. Cancellation Propagation
The first resiliency tool to implement for any system is cancellation propagation. This is the practice of detecting, as early as possible, that an incoming request to a method no longer needs a response. When cancellation is detected then all ongoing work related to that request, including any outgoing requests that are still pending, should also be canceled.
The purpose of cancellation propagation is to minimize, or prevent, work in a system that is guaranteed to serve no purpose. To illustrate, let’s consider this sequence of operations:
This sequence shows the interaction between four systems. System 1 makes a request to system 2 which, in order to complete its work, makes its own request to system 3. System 3 also makes a request to system 4. As each request completes, the response is sent back up the chain and system 1 receives its response at the end. While a bit contrived, this is representative of a lot of real world systems that have network dependencies and especially those deployed in a "service oriented architecture" or "microservice" arrangement.
The problem of request cancellation shows up when any one of these systems stops waiting for a response. For example, let’s look at what happens if system 1 stops waiting either because of an internal error or a client-side timeout:
This scenario is nearly identical to the previous. All the same work is completed but the final response has nowhere to go because system 1 disconnected near the beginning of the process. In effect, all work performed after system 1 disconnected was wasted. This kind of wasted effort is often not impactful at a small scale but when a system is under duress, such as being under unusually high load or experiencing a shortage of system resources, then performing unnecessary work can accelerate system failure.
Cancellation propagation, especially when handled correctly by all systems in an interaction, can prevent this kind of wasted work:
As each system detects a client disconnect for its incoming method call it then disconnects any currently outgoing calls and cancels any logic that was being executed. Variations of timing of either detection of a cancellation or handling of that cancellation can still lead to wasted work which makes this kind of solution imperfect but usually quite effective in practice.
Support for cancellation propagation is usually built into your lower level
server code and any framework you use on top of that. For example, Go has this
feature built into the standard library HTTP server using the language’s
context feature, Python’s asyncio plus aiohttp provide a series of
exceptions that are raised when a request is canceled, and the grpc framework
for Java has a callback style system for delivering cancellation events. If your
programming language or framework does not provide you with client cancellation
signals then I don’t recommend trying to build it yourself. The number and
variety of cancellation signals on something like an HTTP stack will represent
a challenge to correctly implement. Rather, you’re better off either switching
development stacks or skipping this step and relying on method timeout values.
2. Server-Side Method Timeouts
The second resiliency tool, or first if you cannot implement cancellation
propagation, is to establish a maximum duration of time for which your method
can continue executing before it cancels itself. When this duration is exceeded
then your method should stop and return some kind of exception or error code to
the caller. In the case of an HTTP service for example, you would implement this
by returning a 5xx status code when your method reaches its deadline.
I recommend implementing a timeout for a few reasons. One is that a timeout establishes a finite boundary for all other tools and behaviors. Having a system with no upper limit on method execution time could result in other resiliency tools becoming detrimental in some cases. Another reason I recommend a timeout is that it presents an opportunity for you, or your team, to explicitly consider the expected and required response times of your system. In particular, this is a starting point for defining service level objectives (SLOs[1]) which are an important component for properly configuring other resiliency tooling. Additionally, a timeout can act as an imperfect replacement for cancellation propagation in systems without access to cancellation signals or for cases where a signal is never sent.
You can set this timeout based on either design or observation. My personal preference is to set it based on design in the sense that the method was built to meet a specific response time goal. I prefer this approach because it leverages an existing goal, or SLO, of the system and doesn’t require investigation or guesswork. However, not all methods are built with a specific response time in mind. In these cases you need to monitor the response time of your method, especially under heavy load, and determine the range of response times that are considered acceptable.
In either case you’ll need to determine a hard cutoff time for each incoming request. My process for this is to take my SLO value or measured response time under load and apply a multiplier. For example, if my method is expected to complete within 150ms in the 99th percentile (p99) then I would set the timeout to 300ms or 450ms. The extra time allows my method to degrade by becoming slower than expected but allows it to respond if the client is still waiting. This timeout value then effectively becomes the maximum, or p100, duration of the method. There are conditions such as CPU resource exhaustion that can cause the method to become latent beyond this value but the timeout will be used in all non-catastrophic conditions.
3. Simple Concurrency Limits
Cancellation propagation and server side method timeouts are oriented around the incoming, or input, side of a system. Now we’ll focus more on the outgoing side of things.
The next step I recommend is finding all places in a system that make outgoing calls to another system, group them by method, and limit the maximum concurrent calls to those remote methods. A remote method may be called in several places but they should all share the same maximum concurrency.
The two easiest ways to get started on this in most programming languages are to use either a counting semaphore or an executor pool.
A counting semaphore is a kind of lock that allows a finite but greater than one number of lock holders and is usually available either in the language or through a standard library. Once the allowed count of holders have the lock then new attempts to acquire the lock naturally enqueue, similar to how mutexes work in most cases. If you use this method then be sure to make the semaphore size configurable so that you can easily adjust it later and, especially, ensure that any code waiting for the lock to become available also respects your method’s timeout and cancellation.
An executor pool is another option that may be easier to manage in some languages or frameworks. An executor pool is a combination of a queue and a finite set of "executors" that process the queue. An executor is a generic term for "something capable of executing work". In thread based systems the executor pool may be called a thread pool and the executor is a thread. In virtual thread or "green thread"[2] based systems then the pool may be called a worker pool. Like a counting semaphore, executor pools have a finite size that you should ensure is configurable and that any work waiting on an executor can still be canceled if needed.
There is no perfect formula for choosing your concurrency limit regardless of which tool you use. You can estimate a limit in cases of easily measurable and finite resources. For example, if you know the average or median CPU cost of making an outgoing call then you can estimate the number of concurrent calls that would saturate your CPU limits and set your concurrency limit below that value. Likewise, if you sum the concurrency limits of all outgoing calls then the value should be lower than your system’s maximum file descriptor count.
However, you aren’t in control of all the variables when it comes to outgoing calls because your system’s dependencies will have their own limits and behaviors. The best way to fine tune your concurrency limits is with load testing. I recommend starting with limits designed only to protect your system’s critical resources, run a series of tests that increase concurrency over time, look for concurrency values that result in performance decreasing or other negative effects, and then lowering your concurrency limits based on your findings. As you make changes to your system, or as your dependencies change, you may need to test again to ensure that your limit values are still correct.
Having concurrency limits in place not only provides a protective measure for your system but also introduces an important quality of "backpressure". Backpressure describes a system’s resistance to throughput greater than it can handle. In the case of concurrency limits, if the total concurrent input would result in exceeding the output limit then requests begin to enqueue rather than be immediately acted on. This behavior combined with your system’s method timeouts and cancellation propagation helps prevent your system and your system’s dependencies from being overloaded during a load or traffic spike.
As a final note, concurrency limits introduce either an explicit or implicit queue into a system. Queues are a valuable tool but they must be managed over time. I strongly recommend that you begin monitoring queue depth, or the number of requests currently enqueued, as a key metric of your system and for every queue that you introduce. Queue depth is an important indicator of whether you have sufficient capacity for the traffic going through your system. As your system matures, especially for high throughput or latency sensitive systems, you may also need to introduce more advanced queue management techniques such as those described in Netflix’s concurrent-limits project[3] or even other forms of adaptable traffic management based on queue depth such as load shedding[4]. Those types of tools, though, should only be introduced based on proven need because they represent a much higher level of complexity and operational overhead to manage.
4. Client-Side Method Timeouts
Remote method calls over a network can become slow for a large variety of reasons that include both the remote system being overloaded and issues with the software or hardware that make up the network in between. The reason for unexpected latency is rarely clear from the client side of an operation. A system that has server-side method timeouts already guards against client-side latency. However, there are advantages to establishing a second timeout for each outgoing method call.
Let’s imagine a method with a 300ms timeout that also has multi-stage logic involving remote dependencies:
In this example, the method has 300ms to complete. The first step is to perform a remote method call to fetch some data which takes 50ms. The second step is to take that information and fan-out three calls to another dependency where the slowest call in the fan-out takes 125ms. Finally, the fan-out results are combined in a third step that makes a final remote call that takes 75ms. In this scenario the method completes within its timeout and the execution is a success. However, an extra 50ms in any of the steps would result in a timeout for the method.
A client timeout is an effort to fail faster when there is a high chance of failure and reduce unnecessary or wasted work. The goal is to set timeout values that are reasonable given a remote dependency’s performance. There are a few ways to determine these timeout values.
If your dependencies offer a response time SLA or some other expectation of response time then that is the value to consider. Most often, method response times are measured within that method’s system such that they do not include any network latency between clients and the system. For example, if a dependency has a 100ms response time SLA for a method then it may always respond in 100ms but your client will still see a greater duration due to the network. It’s important to add a buffer to the SLA value to account for that network latency. For example, if the dependency is in the same data center then you might add only 10% - 20% but if the dependency is half-way across the planet then you might need to add several hundred milliseconds.
I’ve rarely seen a response time SLA in practice so the more likely method for determining the appropriate timeout is to test, measure, and estimate. I suggest looking at historical performance metrics for the dependency, such as your client-side recorded p99 response time, and especially metrics from any load testing. This should give you a reasonable expectation for how the dependency will perform. From there, apply a small multiplier to give the system some flexibility. For example, if you regularly see 200ms response times in the p99 when under load then you might set the timeout to 250ms or 300ms.
Here’s the same example sequence from earlier but annotated with example timeout values:
Something worth mentioning about the example is that the sum total of client timeouts is larger than the method timeout. This is often not an issue. The mental model being applied here is that we’re looking at how long each outgoing call takes in isolation and attempting to determine whether it will fail based on latency. It’s less about meeting the method timeout and more about detecting failure in specific outgoing calls. My experience is that this model works well in practice for most cases and provides a fairly easy to manage configuration of a system. The maximum response time of the method making outgoing calls doesn’t change because it’s still bounded by its own timeout and the time it takes to complete a request successfully doesn’t change. The overall response time during error conditions, though, should be reduced as an artifact of more quickly identifying failed outgoing calls.
As your system matures, if you find that you need more advanced techniques of managing outgoing timeouts then I recommend you first consider identifying aspects of your method that can be made optional by using static or cached fallback values. You could more aggressively engage the fallback to save time rather than adjusting timeout values directly. This branches out into the concept of graceful degradation[5] which includes a wider array of tools such as circuit breaking. Alternatively, you might investigate a different timeout model that allows for flex such as a dynamic time budget for the method that accounts for both faster and slower executions when allocating timeouts to client calls.
5. Status Based Retries
Cancellation propagation, timeouts, and concurrency limits are designed to place
bounds on otherwise unbounded behavior. Once a system has those bounds then it
becomes safer to introduce features that leverage traffic multiplication in
order to maximize success. In this case, I suggest the first multiplicative
feature to be a retry of any outgoing call based on its response status. By
status, here, I mostly mean success or failure but some protocols, like HTTP,
may subdivide success or failure into more specific signals such as 201
Created or 400 Bad Request that can be handled with more nuance.
A status based retry inspects the response from a remote method call and makes the call again if it failed. This process repeats until either the call succeeds, the outgoing call reaches its timeout, or the method making the call reaches its timeout. As noted in the diagram, you should put a status based retry policy before the concurrency controls but after the client-side timeout. This ensures that all retries respect the timeout, are bounded by the concurrency limits, and do not unfairly monopolize a unit of concurrency.
One of the most complex parts of a status based retry is determining whether a call is safe to retry. Generally, retries of any kind should only be implemented for methods that you know are idempotent[6] and recoverable. Retrying non-idempotent methods will eventually lead to negative effects and retrying unrecoverable errors generates wasted work.
5.1. Backoff After Retry
If your remote method calls retry immediately after each failed attempt then you risk causing more downtime for your own system. This is especially true when the failures are due to a degraded state in a remote system. At best, a zero delay retry may increase the recovery time of your dependency by exhausting a scarce resource. Worse, if every caller of that remote method retries immediately then they each become one node in a denial of service[7] attack that engages as soon as the shared dependency encounters a degraded state. These types of "retry storms" can create sudden, orders of magnitude increases in the traffic your dependency is trying to handle. This, in turn, can create an incident that is only resolvable by shutting down all the clients of the over burdened system if it cannot be scaled up.
To avoid these issues, every retry policy must have a backoff policy. A backoff policy determines the amount of time between retries. Each backoff policy must account for two properties: delay scaling and jitter.
Delay scaling describes the process of increasing the amount of time spent in backoff between subsequent requests. The most commonly described delay scaling choice is exponential backoff[8]. Exponential scaling means that each retry attempt happens after some compounding multiple of the original backoff, usually by doubling. This type of delay scaling is useful across a wide variety of systems. The only cases I’ve personally encountered where exponential backoff performed poorly were when managing methods that both make remote method calls and must complete their function in a short amount of time (ex: p99 of less than 30ms). For those cases I’ve come to recommend a static backoff duration rather than scaling it but note that these kinds of cases are rare.
Jitter describes a random amount of variation in the backoff. Adding jitter helps prevent retries from being attempted at exactly the same time when there are multiple, concurrent calls to the remote method or system that fail. Jitter can be calculated multiple different ways[9]. I recommend starting with a simple, "decorrelated" jitter which means setting a minimum and maximum jitter time, generating a random value between those points, and adding the result to the backoff time.
6. Request Hedging
Status based retries are a multiplicative tool that can work around temporary or spurious error conditions. Request hedging is a multiplicative tool that can work around temporary or spurious response time delays.
Hedging is a technique where a single request is made first. If a response is not received within an expected time range, such as a remote method’s p75 or p90 response time, then a second concurrent request is issued. This process is repeated until either a response is received or the maximum client timeout is reached. Once the client reaches a terminal condition then all outstanding requests are canceled.
Hedging is mostly focused on working around conditions within one of multiple instances of a remote system. There are specialized conditions, such as temporary network malfunctions or gaming a LIFO queue, that can differently impact requests to the same instance of a system but the multiplicative nature of hedging works best for target systems with some form of horizontal scaling. For example, if one instance of the remote system is overloaded but another instance is not then that is the sort of condition hedging works around. To illustrate this, I’ll introduce the concept of "runtime" into the ongoing diagram:
In this case a "runtime" represents an individual instance of a system that is identical to all others. This is a common deployment configuration for horizontally scaling systems. The same code and the same methods are deployed multiple times across different resource pools, such as virtual hosts or containers, with the expectation that they will each handle a subset of the total load directed at the system. Often, this arrangement is invisible to clients if traffic is distributed across runtimes using a load balancer. So the actual arrangement could look like:
If somewhat even load balancing is in play then request hedging often results in subsequent requests being routed to different runtimes which may be under different conditions and are able to more quickly respond. Note, however, that the behavior of the load balancer (whether client-side or server-side) affects the usefulness of this practice. For example, a load balancer between two systems that uses "sticky sessions" or some other mechanism that consistently routes requests from one source runtime to the same target runtime will greatly diminish the positive effects of hedging and could make hedging a net negative in the system because it works by multiplying traffic.
If you have determined that there is appropriate load balancing to make hedging worthwhile then the next step is to determine the hedging delay, or the amount of time between each request. Similar to client-side timeouts, the hedging delay should be based on either the SLA or the measured response time of the dependency. For example, if you have configured your client-side timeouts based on the p99 response time plus some buffer or multiplier then you might use the p90 or p75 response time as the hedging delay. The rationale is that the timeout is set based on when a response is considered so late that it has a high likelihood of failure. Hedging needs to engage when it appears that the request will be late but before it’s actually too late.
7. Connection Pooling
With this last item, I’m going to shift the perspective slightly. The other tools and techniques I’ve presented are concerned with either limiting otherwise unbounded behavior or using traffic multiplication to work around issues. All of these tools, though, address behaviors within a system. With connection pooling we’re going to focus on the network between two systems and in a way that is mostly irrespective of what the two systems are doing.
Especially if using HTTP, most standard libraries or frameworks implement a connection pool for outgoing requests. A connection pool is a collection of already established connections between systems that can be reused for outgoing requests rather than requiring that a new connection be created on-demand. Connection pools are often discussed from a perspective of performance but I’m going to focus on the aspects of a pool that can affect the resiliency or reliability of a system.
To start, a good mental model to develop about a connection pool is that it is a variation of a cache. Creating a new network connection can be an expensive task relative to the time or cost of actually executing a request. In the HTTP(S) world a new connection must generally:
Resolve a DNS address to an IP address
Create a TCP connection to that IP
Establish TLS state
Send the request
While it seems simple, each of these steps has sub-steps which can often include multiple back and forth communications over the network. To illustrate, an HTTP request without a pool might look like this:
In this example, there are four "round trips" through the network before the HTTP request can even begin. This often happens quickly, potentially within single digit milliseconds, when systems are located geographically near each other and on the same network. Consider, though, two systems that have a 50ms round trip time (RTT). Let’s discard DNS for the moment because it’s often resolved locally and only consider TCP and TLS. In the worst case these may represent three full round trips. A 50ms RTT would mean that creating a connection takes roughly 150ms. If your SLO for a response time is 200ms or less then you’re unlikely to ever succeed if there is a constant 150ms added to every interaction with your dependency.
Assuming you cannot move the two systems physically closer to reduce the RTT then your only option for reducing the time cost of creating a connection is to discover new physics. More practically, you can assume the full cost of creating a connection and reuse it as often as possible to avoid paying that cost again. If you track the average time to make a request and the first request must create a new connection then your average might start at, for example, 250ms with 150ms being attributable to the connection. Each new request that uses the same connection, though, brings that average down because they take only 100ms to get a response. The more you reuse the connection then the more the average trends towards the actual request and response time. This process of averaging the cost down through reuse is called amortization. The ideal amortization brings the average overhead cost of the connection to zero.
Amortization of connection cost is the fundamental benefit of a connection pool and it delivers this value by creating a cache of connections for reuse.
7.1 Cache Invalidation Of Connection Pools
Every cache must have some way to invalidate items or else they can become inaccurate which is sometimes referred to as "stale". If connections from a pool become stale then they result in errors when used. This usually surfaces as either making requests to a runtime that no longer exists or failing to make use of new runtimes that have been added since the connections were made.
Most HTTP connection pools invalidate connections based on:
Lower level network errors
Request cancellation (HTTP/1.X only)
Detection of
Connection: closeheadersMaximum idle time
Maximum lifetime
Most of the invalidation signals are automatic and in response to events encountered while the system is running. The maximum idle and lifetime values are potentially something you need to reconfigure from the defaults. For example, if you have an environment where runtime IPs are discovered through DNS then the max lifetime value should match some multiple of your DNS TTL value. The rationale here is that invalidating on every TTL may be too frequent, especially for short TTL values like 60s or 120s, but never invalidating means your pool won’t be in sync with the current list of available runtimes. You have to choose how much risk to take on in terms of number of DNS updates you can potentially ignore.
Configuring the lifetime and idle values requires specific knowledge of how your infrastructure works. As a counter example to DNS above, the default operating model in Kubernetes has inter-system traffic target virtual IP addresses that act as a TCP load balancer[10]. Periodically refreshing the connection pool can help ensure a more equal distribution of requests across runtimes of the target system if it has, for example, scaled up to provide more runtimes.
There is no one size fits all configuration. Any choice you make should be validated with testing and monitoring with a focus on the amortization rate you are getting.
7.2 Connection Drain
Connection pools are typically assigned a size which represents the maximum number of connections they should hold in the cache at any given time. Pools commonly fill lazily, based on usage, until the desired connection count is achieved though some implementations may fill eagerly. Pools also come in two styles that I call "blocking" and "non-blocking" to describe how they handle requests when all connections are currently in use.
A blocking pool at full utilization enqueues additional requests until a connection becomes available. If you have this type of pool then you must ensure that the size is coordinated with your concurrency limits for outgoing requests or it could unnecessarily enqueue requests. A non-blocking pool at full utilization will create a new connection for each additional request rather than enqueue the request. As connections are returned to the pool it will discard any excess based on the pool’s size.
Both of these styles have their trade-offs and both should have their size coordinated with your concurrency limits for best effect. Both also suffer from an ability to be suddenly drained of connections. I mentioned earlier that request cancellation is typically a signal to discard a connection from a pool. If, for example, a remote method call becomes consistently slower to respond than your configured client timeout then all calls to that dependency could result in a connection being discarded from your pool. This quickly results in an empty connection pool which means that all subsequent requests result in creation of a new connection. While this will certainly impact performance, the more dangerous condition is that this behavior can exhaust the maximum file descriptor limit of a system if the request and cancellation rates are high enough.
Unfortunately, this problem is inherent in connection oriented, request/response protocols like HTTP/1.X. Often, the simple concurrency limits in place will protect from this but I have encountered incidents of this exact problem in multiple systems before. If you are having connection drain issues due to cancellations or timeouts then my best advice is to switch away from the connection oriented protocol (HTTP/1.X) to one that can multiplex requests on a single connection (HTTP/2.X). This allows for connections to be retained regardless of the status of a request.
7.3 Adapting For Request Multiplexing
Multiplexing protocols, like HTTP/2.X, do not associate connections with requests. Instead a connection becomes a transport and multiple requests can be sent over the same connection regardless of how many responses have come back. Request cancellation in a multiplexing protocol involves sending an explicit "cancel" signal to the other side rather than closing the connection.
Switching to a multiplexing protocol, like HTTP/2.X, often greatly changes the behavior and utility of a connection pool. In the case of HTTP/2.X, the vast majority of clients and connection pools are designed to make exactly one connection to each hostname and re-use it indefinitely. Creating or recreating connections becomes a rare task. This is how multiplexing solves the cancellation related connection draining and resource exhaustion issues. In a way, multiplexing makes connection pooling mostly irrelevant. However, multiplexing introduces a new set of operational issues to manage.
The first issue with multiplexing is that it amplifies the effects from the earlier section on cache invalidation. When more requests use the same connection then the risk and impact of a stale connection become higher. Systems using TCP load balancing are especially affected because now each runtime connects to exactly one other runtime due to using a single connection for all outgoing requests. So the advice of periodically rotating the connection is still relevant. Additionally, you should consider having a protocol aware proxy in place of TCP load balancing. For example, use an instance of nginx or haproxy as the TCP destination that can then more fairly route individual HTTP requests.
Another major issue with multiplexing is that it is subject to large latency conditions during incidents of packet loss[11]. I’ve managed incidents in HTTP/2.X based systems that were caused by this exact issue. One incident involved a series of email exchanges between me and a cloud provider where we found that our unusual latency issues aligned with their switch maintenance schedule and the resulting packet loss. Unfortunately, this is also an inherent issue in the protocol. I’ve had marginal success with configuring or modifying the pool to round robin across three or five connections but whether that can actually help during packet loss depends on too many and too specific variables to be a reliable resolution. Instead, the future HTTP/3.X over QUIC should provide some relief.
Next Steps
The tools and techniques I’ve described to this point represent the foundational or fundamental resiliency tools that I think all network systems should have. In my experience, most systems don’t need much more than I’ve presented here. Even what I’ve suggested so far likely takes weeks or months of investigation, monitoring, testing, configuring, and re-testing to safely and successfully integrate.
That being said, I’ve worked on systems where these techniques were not enough. I’ve mostly experienced this for systems that are much higher throughput, more latency sensitive, or more critical than others. There are additional techniques available for these cases but they require a greater level of complexity, operational overhead, and depth of knowledge about your system to manage. From here, consider distributed rate limiting, advanced concurrency controls, adaptive queue management, load shedding, and circuit breaking as topics of research.