This article covers the principles, design, and nuances of the circuit breaker pattern as a software tool. I describe the inner workings and common malfunctions of circuit breakers as well as offer suggestions for managing those malfunctions.
I approach this topic from a perspective of the software pattern overall and do not cover specific implementations or programming languages. If you are adding circuit breakers to your project for the first time, building a new circuit breaker implementation, or want to better understand what your current circuit breaker tool is doing then this article should provide valuable information.
Circuit Breaker Basics
In the context of software development, a circuit breaker is a design pattern that involves monitoring the rate and completion status of a specific action and bypassing that action for a predetermined time when the action’s error rate exceeds some minimum threshold. Circuit breaker implementations are typically modeled as wrappers, or middleware, so they can both monitor and intercept requests to complete the guarded action. When the error rate of an action is below the threshold then the circuit breaker acts as a pass through that transparently records the status of each action execution. However, if the error rate exceeds the threshold then the circuit breaker "trips" or "opens". Once open, it immediately returns or raises an exception on each attempt rather than executing the guarded action. This behavior continues until the circuit is closed by some other signal.
The top three reasons why a circuit breaker is introduced into a system are:
Triggering alternative behaviors when the circuit is open.
Failing faster in the case of errors due to timeout.
Providing relief to a dependency by temporarily stopping the flow of traffic.
Circuit breakers can provide these benefits when used carefully but implementations are often much easier to install than they are to manage. Having an understanding of the design and inner workings of the pattern makes it easier to conceptualize what an implementation is doing and how to better configure or manage that implementation.
Opening A Circuit
The circuit breaker begins in a closed state and the guarded action is executed on each attempt. The determining factor for opening a circuit is the calculation of an error rate and comparison to some threshold value. The error rate is calculated over some window of time which is usually configurable.
Rolling Window
The most common choice of windowing strategy is a rolling window which represents a recent period of time and shifts in units of a smaller, fixed duration. From a software perspective, this is generally implemented using a "bucket" technique where each bucket represents the rolling interval. To illustrate, let’s consider a window of 6 seconds where each bucket represents 1 second:
All new values are collected in the T-0 bucket which represents the current 1
second span of time. As each second passes, the oldest bucket is dropped and a
new T-0 bucket is created. This rolling window strategy is commonly used
because optimal or efficient algorithms are fairly prevalent and implementations
often already exist for working with real-time, high throughput data. Other
window strategies may, technically, be used but if you survey circuit breaker
implementations you will find that the overwhelming majority use a rolling
window.
Error Rate Calculation
The rolling window contents are evaluated before each attempt to execute the guarded action. The core metric in a circuit breaker is the error rate or the error percentage. When the error rate of a window exceeds the configured threshold then the circuit opens. However, calculation of an error rate is not as straightforward as it may appear. To illustrate, let’s use the same window from before:
This window is tracking total, attempted requests and failed requests. When evaluated, the window presents a 26% error rate. The failure counts are clearly increasing over time and the most recent bucket shows a 76% error rate within that one second interval. If the trend continues then the circuit will eventually open as the buckets representing healthier intervals of time roll out of the window.
There are three important considerations for calculating an error rate. The first is window size. The second is window data size. The third is choosing an appropriate value for "total" when calculating a percentage. Let’s start with the first aspect of window and bucket size.
Impacts Of Window Sizing
The window size can affect the sensitivity of the error rate, or the degree to which the error rate is influenced by temporary conditions. To illustrate:
This window contains two buckets, T-1 and T-2, where all attempts failed but
the error rate remains lower than the threshold because the majority of buckets
contain enough success data. In this scenario, the guarded action encountered a
temporary fault or degraded state which then recovered within two seconds. A
third of requests in the window failed but the circuit remained closed and the
system recovered.
Generally, the larger a window is then the slower the error rate rises or falls assuming the rate of new data is fairly consistent over time because every bucket is treated equally in the calculation. Certainly, if failures are also associated with a large increase in usage then the size of the window will have less impact but that is also one of the exact scenarios that circuit breakers are intended to address. The process of fine tuning a circuit’s window size is about making it tolerant to expected levels of errors.
Empty Windows And Small Data
Rolling windows with optimal sizing can still result in skewed decision making if the window is mostly empty. An empty window is a common condition for all systems when they first start and regularly for systems with low throughput. When a window is mostly empty then a smaller set of data determines the error rate. To illustrate, let’s consider a system that has a dependency that often takes much longer than usual to respond if that dependency has been idle for some time. This is sometimes referred to as a "cold start" cost. Once the dependency is "warmed up" then it responds quickly. If the dependency is infrequently used then a common window state might look like this:
The system made only 3 requests in this example but the first 2 of them failed due to reaching a timeout while waiting for the cold start. The third request succeeded and the dependency is now ready to service subsequent requests quickly. However, the error rate at this point in time is 66% which is likely to be greater than a circuit breaker’s threshold. If the circuit opens then the system will stop making requests to an otherwise healthy dependency.
The lack of historical data means that only the current bucket is driving the error rate. Failure counts in an empty window, whether expected or randomly encountered, can quickly weight the error rate to a high value which then triggers the circuit to open. A common solution for this problem is to require a minimum data count before evaluating the error rate to anything other than 0%. In the above example, we might set the minimum count to ten or twenty to ensure that failures are not due to cold start and that there are usually enough data points to outweigh cold start failures if they are encountered.
Choosing A Total Value
An error rate calculation is usually a simple division of failures over the
total to get a percentage. However, choosing an appropriate total is made more
difficult by the use of a window policy like a rolling window. Logically, the
error rate is calculated as RATE = FAILURES / ATTEMPTS but attempts and their
respective failures do not usually happen at exactly the same time, especially
if the action being executed fails due to a timeout condition. To illustrate,
let’s look at an example window of a system that executes a circuit breaker
guarded action every four seconds and uses a two second timeout:
This is a specifically crafted scenario but it represents a practical problem. Recording attempted, or total, actions when they are started and comparing that value to failures which may not happen within the same bucket can result in skew. In the example we end up with a 200% failure rate which is not a meaningful value. An extension of this problem results in division by zero when failures are recorded but all attempts have rolled out of the window.
To account for this, the actual error rate calculation for a circuit breaker
should be either RATE = FAILURES / (FAILURES + SUCCESS) or
RATE = FAILURES / COMPLETED where SUCCESS or COMPLETED are recorded when
an action ends rather than when it begins.
I have seen the mistake of recording the start of an action to the window in several circuit breaking implementations. I have also made this mistake, myself, when building similar systems. I recommend you audit your implementation for this behavior.
Closing A Circuit
When a circuit opens then the underlying action is bypassed until the circuit closes again. If the action is bypassed then no more values are recorded in the rolling window. If a circuit relies only on the values in the window to determine when to close then it may behave poorly in common scenarios. To illustrate, let’s look at an example window:
In the example, we can see that an issue occurs around T-2 and continues
through the current time with each more recent bucket containing a higher
failure count. Let’s assume the circuit is configured to open at a 50% error
rate and look at the state again after two seconds:
In this scenario, the failures recorded in more recent buckets are continuing to increase. This matches a common pattern of failure for dependencies where the error rate suddenly spikes and then continues to remain high for the duration of the unhealthy state. The total error rate within the window now exceeds 50% so the circuit transitions to an open state. After two more seconds the window looks like:
Here we see that the window continues to remove a bucket every specified duration and adds a new, empty bucket to the leading edge of the window. The circuit is open and the action is bypassed which means there are zero attempts and zero failures recorded in the new buckets. As a result, the total number of data points in the window decreases while the circuit is open. In the case of an error spike, as illustrated above, this means that the failure rate actually increases with each bucket duration that the circuit is open. To generalize, an open circuit results in error rate calculations becoming progressively biased towards the buckets closest to the moment the circuit opened. For example, here is the window in two more seconds:
At this point, the window contains only two buckets with values and evaluates to about a 90% error rate. This trend will continue until all buckets are removed from the window at which point it will evaluate to a zero error rate again.
This scenario is constructed specifically to maximize this effect for illustration purposes but it does represent a common scenario for circuit breakers. If the rolling window is the only factor in deciding to close a circuit then most open circuits would remain open for an entire window of time. To account for this, virtually all circuit breaker implementations ignore the rolling window of data once open and, instead, rely on a behavior commonly referred to as the "half-open state".
Half-Open State And Periodic Probes
The term "circuit breaker" implies binary states of open and closed. Virtually all circuit breaker implementations, though, support a third state called "half-open" in which a limited number of actions are allowed to execute even though the circuit is open. The purpose for half-open is to enable the circuit to close more quickly than the rolling window error rate calculation would allow in situations where the cause of past errors is resolved.
The most common half-open implementations follow the same logic. When the circuit opens then a timer is started. Once the timer has exceeded some configurable duration then the circuit enters half-open and allows one or more actions to execute. These actions are effectively probes to determine if the source of errors is resolved. If the probes fail then the circuit transitions back to open and the timer is reset. If enough probes succeed then the circuit resets the rolling window of data and transitions to closed. The number of probes allowed during half-open and the success rate of the probes needed to close the circuit are configurable values in most implementations. A common default is to only issue a single probe during each half-open state.
An important aspect of the half-open state in virtually all implementations is that a successful probe is the only way to transition to a closed state. In an open state, the rolling window is no longer considered and is only used to open the circuit. This behavior entirely mitigates the effect of large windows or any error conditions that would result in a biased window after opening the circuit.
Circuit Breaker Malfunctions
Most circuit breaker behaviors are predictable for a given configuration. If the error rate exceeds the threshold then the circuit opens. If a half-open probe is successful then the circuit closes. However, even these simple rules can lead to emergent and undesirable behaviors.
Flapping
Flapping describes a circuit that is continually opening and closing without coming to a stable state. In practice, I’ve seen flapping emerge most commonly when the system guarding an action with a circuit breaker is also the primary reason why the action fails at a high enough rate to open the circuit. The scenario where I’ve seen the most flapping is when there is a large and sustained increase in throughput to the guarded action for which the dependency involved in the action cannot keep up. For example, if the guarded action is an HTTP call from one system to another and the rate of those HTTP calls increases beyond the receiving system’s capacity then a flapping pattern emerges:
The circuit is closed. Requests are sent from client to server.
The receiving system becomes overwhelmed. Responses begin to time out or fail with error codes.
The failure rate exceeds the threshold. The circuit opens.
The receiving system quickly recovers once the excess traffic is blocked by the circuit breaker.
The circuit enters half-open. The probe is successful because the receiving system is healthy again.
The circuit closes. Repeat.
In a way, the circuit is behaving as expected. However, this type of flapping pattern does not always stabilize automatically. For example, if the source of traffic is persistent, such as cases of event streaming or queue processing, then the issue may require manual intervention to resolve because the amount of actions to execute continues to build while the circuit is in an open state.
A secondary effect of flapping that I’ve seen in systems is the suppression of auto-scaling policies. Systems that auto-scale generally scale up or down based on system usage metrics such as CPU usage. Auto-scale policies usually require their target metrics to exceed some threshold for a specified period of time in order to prevent temporary spikes or dips in usage from triggering scale events. A flapping circuit results in large but disconnected bursts of usage such that it may prevent an auto-scale policy from triggering because the increased usage is not continuous enough.
A flapping circuit is an indicator that a system is likely missing a different reliability tool. If you experience flapping then you should look into additional tooling such as back pressure, rate limiting, and concurrency controls. Generally, I recommend looking into these tools before adding circuit breaking, anyway, because they are critical behaviors in a reliable system.
Accidental Denial Of Service
Circuits are configured to open at a specified error rate and that rate is usually set to a high percentage value. For example, 50% is a common default error rate for implementations that are either modeled after or are ports of Hystrix[1] (an open source implementation from Netflix that helped popularize the usage of the design pattern in general).
An issue with this behavior is that the circuit breaker model assumes that all executions of the guarded action are equal. However, if the circuit is used in a multi-tenant system then it’s possible, if not likely, that the guarded action is not being used evenly by all tenants. The 50% error rate, for example, might actually represent failures for only a single tenant while actions on behalf of all other tenants have a 0% error rate. If that single tenant also represents a majority of the throughput then the faulty tenant is more likely to be used for the half-open probes. The result is that a single faulty tenant can deny service to other tenants using the same system.
If your usage model includes a single action being executed by heterogeneous sources, like a multi-tenant system, then one possible method of isolating impacts is to create a circuit for each source. This method is easiest to implement when the number of sources is mostly static and the behavior of each source is known because each circuit will need to be fine tuned to behave optimally. Arranging the circuits this way can help ensure that single source errors do not affect other sources of traffic.
If the number of sources is large or dynamic, such as many multi-tenant systems, then creating a circuit per source is likely impractical. A possible alternative is to categorize requests into a finite set that each have their own circuit. For example, if you perform progressive rollouts of features then you might consider directing new feature traffic through a second circuit so that errors related to the change do not accidentally deny service to users of the stable version.
If you are initially adding circuits to your system then I recommend starting with only one circuit per action and only adding more once you’ve proven doing so would help. Every new circuit will have its own operational costs such as the need to fine tune and monitor.
Conclusion
Circuit breakers are, to me, a surprisingly complex tool despite having an otherwise simple design. The basic mechanics of a circuit breaker can result in complex and emergent behaviors, not all of which are good. Your choice of window size, bucket size, error rate calculation, minimum data requirements, and which subsets of traffic flow through which circuits can all have a large impact on whether the circuit provides the desired values or encounters a malfunction.
Circuit breakers can play a critical role in systems that have been designed to tolerate missing dependencies or systems where quickly going to 0% success is better than sustaining 50% success. Use of a circuit breaker in any system should be well thought out and should likely come after other, more foundational resiliency tools have been integrated such as timeouts, retries with backoff, concurrency controls, hedging, rate limiting, and backpressure.
A Personal Opinion
An open circuit can be a valuable tool in some situations. However, circuit breakers achieve their value by driving systems to early failure. For example, a circuit configured to open at a 60% error rate that measures exactly 60% will immediately convert to a 100% error rate by opening the circuit despite the fact that 40% of those executions may have succeeded. This all or nothing approach has a place but is not the best fit as a general purpose tool that applies to all systems.
My experience has been that most error conditions in a complex or distributed system have nuance and variations. A circuit breaker applies binary thinking to what are often non-binary issues. My current practice is to use probabilistic models like load shedding[2] in place of circuit breakers for nearly every system until it becomes clear that a circuit would provide a better behavior.