This article is an in-depth, technical description of software load shedding as it would be used in distributed or network available systems. I’ll present a useful but imperfect mental model or conceptual framework for understanding and implementing load shedding. The primary focus of this article is on the probabilistic, or predictive, model of load shedding but it does briefly cover other forms. The content is oriented around the principles and design of load shedding and does not focus on any particular implementation or programming language.
Load Shedding Basics
Load shedding is the intentional reduction of work in a system in order to avoid overload, prevent catastrophic failure, and ensure that critical operations are successful even when the system is in a degraded state. Reducing work in this context usually means preventing work from starting but can also mean canceling work in progress. The choice of which work to reject may be made probabilistically or deterministically. Probabilistic load shedding involves the use of estimates, predictions, or even random chance to determine which work to reject. Deterministic load shedding uses a strict set of procedural rules where a given set of inputs always results in the same decision. Most real-world systems that use load shedding use a mix of both models.
In software, load shedding tools intercept attempts to execute some action, optionally monitor or record aspects of that execution, and decide whether to prevent the action from executing. The internals of each implementation can vary greatly and are often bespoke for a particular system or integration. Conceptually, though, a load shedding system that uses both probabilistic and deterministic models consists of three primary components:
Capacity utilization metrics
Failure probability calculation functions
Decision making policies
Capacity Utilization
Load shedding is primarily a tool used to prevent system overload. The fundamental concept of "overload" is that a system is operating beyond its maximum capacity in some way. Inherent in load shedding, then, is the concept of capacity, or a resource of finite limit that is consumed by doing work.
A single system may have any number of individual capacities that can be overloaded. The most common capacities included with load shedding descriptions are CPU usage, memory usage, and queue depth but virtually any quantifiable metric may be used. The only requirement for a metric to represent a capacity is that it can be compared against a maximum allowed value. Percentage based metrics such as CPU usage or error rates usually have a natural limit of 100% that makes them an ideal fit for determining capacity. Metrics without a natural limit, such as queue depth or latency, can still be used by establishing an artificial ceiling based on the designed or discovered limitations of a system.
Most capacity usage metrics are generated by collecting real-time or sampled data points and aggregating those values over a period of time. A common approach is to accumulate data in a rolling window and evaluate it with a weighted average. For example, a system might sample CPU usage every second and calculate the current usage as an average of the past ten seconds but with a bias for more recent values in the window. See Opening A Circuit[1] from my article on circuit breaking for more details on rolling window policies and how they can influence metric aggregation.
Each metric being collected and aggregated is then compared to its capacity to determine the percent utilization. For example, a set of metrics might look like this:
| Metric | Current | Max | Result |
|---|---|---|---|
CPU | .25 | 1.0 | .25 |
Memory (MiB) | 450 | 512 | .88 |
Error Rate | .15 | 1.0 | .15 |
Queue Depth | 50 | 1024 | .05 |
Avg Latency (ms) | 145 | 250 | .58 |
The goal is to produce a set of values that all represent the percentage of maximum capacity used. The exact maximum values are arbitrarily defined. In most cases, you will need to research or experiment to determine the appropriate values. For example, CPU usage can be calculated in a way that has a natural upper limit of 100% but there is no natural limit for latency. Setting maximum capacity limits requires making informed choices about the known, expected, or designed limits of your system. Each value must be fine tuned to your specific use case and the specific action being guarded by load shedding.
Failure Probability Calculation
Capacity usage metrics, like those presented in the previous section, indicate proximity to capacity. However, that does not necessarily indicate how likely a system or action is to fail. This is where the predictive, or probabilistic, aspect of load shedding comes in. For each potential execution of an action that is guarded by load shedding, the system must calculate the probability that allowing the action to execute will result in either failure of the system due to overload or failure of the individual action. This probability is then used to determine whether the action should be rejected.
To illustrate, let’s narrow the focus to the CPU usage metric and a server with a single endpoint. Let’s assume the endpoint can handle concurrent requests, has a reliable execution duration, and each execution consumes 1% CPU while it is ongoing. When the system is receiving requests slower than it can process them then we get a healthy system.
The response time is flat because each request takes the same amount of time and the CPU usage is flat because requests are completed before new requests come in so load doesn’t build within the system. However, if the rate of requests increases such that load does begin to build, like you might generate with a load test, then the system will eventually become unstable:
This chart simulates the same system under continually increasing load. Each request being handled concurrently consumes one point of the total CPU capacity. Because requests are being made faster than they complete the system begins to accumulate ongoing work which drives up the total CPU usage. At this rate of requests the system will eventually fail due to consuming all of the CPU capacity. However, the system becomes degraded before CPU exhaustion because the response time of the service suddenly increases at an inflection point. As the CPU capacity consumption approaches 100% the response time grows non-linearly and eventually becomes infinite at 100% usage. This latency growth loosely approximates the behavior of most real world systems where the overhead costs of concurrency rapidly increase when a system is near capacity. This is intended to provide a simplified aspect of the Universal Scalability Law[2] which, greatly simplifying, establishes a framework for understanding how performance is affected as systems scale with a focus on the non-linear effects of queuing delays and coordination overhead.
The goal, from here, is to find a way to convert CPU capacity usage into a failure probability. From the previous chart we can determine that the system is capable of operating optimally up to an 80% CPU usage which is where the non-linear growth in response times begins. In effect, any request accepted while the CPU is below 80% has a 0% chance of overloading the system. Once the CPU capacity usage is 80% or above, however, each accepted request increases the probability of failure because of the compounding effects of CPU usage and latency growth related to each added request. We might approximate risk of failure, then, as a function of CPU capacity usage using a piecewise function like
This function produces a 0% probability value for any CPU usage under 80%. When over 80%, the function produces a linear interpolation of the relative distance to 100%. For example, 81% usage would result in 5% probability and 90% usage would result in a 50% probability. If plotted over the previous chart we can see the probability growth over time:
This particular method of calculating failure probability is simplistic but we’re considering a simplistic system for this scenario. This system has only one metric: CPU usage. It also has only one source of requests and one behavior that it implements. Let’s look at what happens if we shed load proportionately to the failure probability we’re calculating:
In this scenario, requests are still arriving at the same rate as before. The system acts on 100% of requests until the CPU usage hits 80%. However, once CPU is over 80% then the load shedding tool begins bypassing the actual work by immediately rejecting a proportion of requests equal to the risk of failure. When the failure probability formula outputs 50% then any incoming request has a 50% chance of being rejected. When the probability is 95% then a request has a 95% chance of being rejected.
This example system doesn’t work better than it did before. Every action rejected by the load shedding tool in order to prevent overload represents a failure of the system to perform its primary function. However, artificially lowering the acceptance rate of requests based on the probability of failure allows the system to remain functioning even if in a degraded state. It is able to complete any requests it accepts and the system is capable of adapting to a constantly changing CPU usage value such that it is always handling the largest number of requests possible without going into overload. This is the core principle of probabilistic load shedding.
Example Probability Formula
The formula you use to calculate a failure probability from a capacity usage metric must be fine tuned to the behaviors of your system. There is no one size fits all formula. For systems with a single responsibility or systems where all requests have the same average cost then a potential starting place is the formula I presented in the earlier illustrations but modified to use results from your own load testing to determine the boundaries.
The formula I presented was
but this is too specific to the illustration examples to be meaningful. The formula generalizes as
In this formula UPPER indicates the point at which failure is 100% probable,
LOWER indicates the point at which failure is first greater than 0% probable,
and CURVE applies an exponent that modifies the output. When any given metric
value is below the lower limit then the formula outputs 0%. When it is between
the lower and upper limits then it outputs a linear interpolation of the value
between the lower and upper. This can provide a convenient way to progressively
scale failure probability based on proximity to maximum capacity. Some examples:
Using an exponent other than the value one allows for bending the line to fit different growth profiles:
To illustrate the effects of adding a curve, here is the chart from the previous section showing the effect of load shedding based directly on failure probability and using an exponent value of one:
Next is the exact same system and traffic pattern but the failure probability is curved using an exponent of 0.25 or 1/4:
The new exponent causes the probability to grow much faster for smaller values and then slow down as it approaches 100%. The result is that the second system much more aggressively rejects requests at lower failure probability values. By accepting fewer requests the second system represents a higher error rate to clients but more effectively keeps capacity usage below the threshold that causes increased latency so all accepted requests are completed faster than the first system.
This formula is a practical example that I’ve implemented in real-world systems. However, those systems were aggressively single purpose, all requests had the same cost on average, and all requests had the same priority or criticality. That’s a specific profile of a system that cannot be assumed for most other real-world systems.
To emphasize again, there is no universal formula that you can safely use in all load shedding scenarios. Any formula used must be derived from system data and thoroughly tested.
Load Shed Decision Making
In the first section we detailed the requirements for defining capacity usage metrics. In the previous section we detailed the process of converting those capacity usage metrics into failure probabilities. The final piece of the design is a decision making policy that is capable of effectively using those failure probabilities to determine whether to shed load. Decision making has three primary stages:
Application of deterministic rules
Selection or calculation of a final failure probability
Converting failure probability to a rejection rate
Deterministic Load Shedding
So far, I’ve written mostly about the probabilistic model because it provides unique and effective qualities when applied to dynamic and distributed systems. However, not all forms of load shedding are best modeled as probabilities. Deterministic load shedding provides the same underlying value, the reduction of work to prevent an overload, but through application of procedural rules. Deterministic decision making does not operate on percent chance. Though, it is important to note that deterministic policies are not necessarily static. They may be based on dynamically changing conditions in the system including capacity usage metrics. They may also be based on information about a single request rather than the overall state of a system. The main difference is that deterministic policies are binary, on or off, and the same conditions always result in the same decision.
Deterministic policies that may be familiar include rate limiting, concurrency controls, and quotas. Rate limiting is a good example of deterministic load shedding that many systems already implement. Each incoming request is attributed to a source by IP, authentication data, or some other value and each source is assigned a maximum usage rate within a window of time. When the maximum usage is exceeded then all requests from that source are rejected until the time window progresses. Quotas, in this context, are a superset of rate limiting in that they represent a limitation on the use of finite resources by a particular, identifiable source within a window of time. For example, disk storage is a finite resource and a given user or group of users may share a 512MiB allocation. Any request that would cause the storage usage to exceed that amount is denied until the quota is expanded or the owner deletes something.
Deterministic policies may also make use of capacity usage information. For example, a system that sheds 100% of low priority requests when the CPU usage is above 50% could be considered a deterministic policy. Some concurrency controls, such as shedding 100% of load when the number of concurrent actions is over some set limit, also fit this model. For some use cases, it is entirely possible to shed load based only on deterministic policies rather than probabilistic ones.
Adaptive Queue Management
The most advanced and complex deterministic load shedding rules usually center around a queue. Queue, here, does not refer to a specific implementation, such as AMQP[3], but generalizes to all conditions in which work must wait to be completed either intentionally or as a side-effect. For example, if a system is receiving requests to perform some action at a rate higher than it is capable of completing them then it must either reject or enqueue those requests. Queues exist at every level of common software stacks including the operating system, the specific programming language runtime in use, and even the specific libraries or frameworks used in a project. Queue management is a deep specialty that I can’t fully include in this article so I’ll limit the example to one, relatively simple and deterministic policy. For an introduction to more advanced and adaptive queue management I recommend Performance Under Load[4] from the Netflix blog.
To give an example of deterministic load shedding based on a queue we’ll focus on the measure of "time in queue", or the amount of time a request for work spends waiting before it can begin. Consider, for example, an HTTP/1.X server that is handling incoming requests. It may use threads, asynchronous networking, or a combination of the two to process requests but it always has a finite capacity for concurrent handling of requests because there are a finite number of processing cores available on which actual work may be done. Periodically, or with a dedicated thread, the server must accept the backlog of connections and schedule the execution of code that will read, parse, and handle the request data. If all the threads, or executing capacity units, of the server are currently in use then the handling of each new request must be enqueued, waiting for the next available slot for execution.
Sudden and sustained increases in traffic beyond the server’s ability to timely process items in the queue leads to a backlog. The majority of systems that take and enqueue requests, like the hypothetical HTTP/1.X server, arrange their queue in FIFO order which ensures that the oldest requests are processed before newer requests. A problem with this model is that it may inadvertently prioritize work that is irrelevant. For example, if the HTTP/1.X server becomes backlogged and some requests are being enqueued for multiple seconds then it is unlikely that the client that made the oldest request is still waiting for a response but the likelihood of the most recently enqueued request having an active listener for a response is high.
One way to handle this is to track the difference between the time originally enqueued and the time that handling begins. Establish a maximum time allowed in the queue and reject any requests that have waited more than the max time. This emulates properties of a priority queue and sheds any load where the priority becomes too low.
Note that this is not a suggestion to add this behavior to any system. This is just one example of how queue management can fit into a deterministic load shedding system. Adaptive queue management techniques like this represent a substantial branch of load shedding and an area of active research and development. I highly recommend reading more about the topic from other sources.
Final Probability Calculation
If no deterministic rule results in the rejection of a request then the probabilistic behavior engages. However, a system may have multiple capacity usage metrics defined which each result in a different failure probability. The system must merge those values into a single failure probability in order to make a decision.
Practically, the two best options for this process are either selecting the maximum value or computing an aggregate of the values such as a weighted average. Though, this stage may be as complex as you need. For example, you could also establish a deterministic set of rules that select a probability based on attributes of the request. From my experience, selecting the maximum value is the most effective option because it represents the most proximal point of failure in the system which demands the most aggressive action to prevent.
Convert Probability To Rejection
The final stage of load shedding is deciding whether a specific request should be shed. If a deterministic policy is active then the request is shed. If not then the system must make a choice based on probability. The logic that most systems implement looks like this:
function should_reject(probability_reject) {
p = random_number_between_zero_and_one()
return p < probability_reject
}The system generates a random number and compares that with the current rejection rate. As the rejection rate grows then so do the decisions to reject requests. Making the decision is straightforward. The challenge is determining a correct rejection rate.
The most straightforward approach to generating a rejection rate is to use the failure probability, directly, as the rejection rate. For example, if the probability of failure due to CPU is 90% then the probability of rejecting a request is 90%. This model can be effective for simpler systems that have a consistent priority of work. However, some systems will need to scale the rejection rate based on priority.
Priority Weighted Decisions
To illustrate priority decision making let’s start with treating all requests the same and applying the failure probability, directly, as the rejection rate. For this example we’ll look at CPU usage as the metric, use the example probability formula I presented earlier, and set the failure probability to begin scaling at 80% CPU usage. Our formula will be:
which evaluates to this chart:
When the probability is used directly as the rejection rate then we get this perfectly matching and linear relationship between the two:
This simple model of using failure probability as the rejection rate assumes that all requests are of equal importance. I have worked in real world systems where this assumption held true. However, it’s more common that there exists a hierarchy with some requests, or sources of requests, being considered more important than others.
A common technique is to tag, or label, requests in some way with a priority
value. An action passing through a load shedding system might be labeled as one
of CRITICAL, HIGH, NORMAL, or LOW. The actual labels used in a
system are usually tailored for that system or for the organization running the
system. For example, a SaaS company might instead use labels such as
ENTERPRISE_CUSTOMER, PAYING_CUSTOMER, FREE_TIER, or INTEGRATION_BOTS,
etc.
The specific names and meanings of the priority labels are not significant. What
is important is that there are a finite number of priorities and that each
request can be deterministically assigned to a priority. From there, priorities
can be mapped to scaling functions that modify the rejection rate based on their
relative importance. To illustrate, let’s use the priority labels of CRITICAL,
HIGH, NORMAL, and LOW. For this example we want each priority to shed
load only after the next lower priority is shedding at 100%. In other words,
CRITICAL traffic should not be shed unless all HIGH traffic is already
being shed and HIGH traffic should not be shed unless all NORMAL
traffic is being shed, and so on. We can do this using the same example
formula I’ve previously presented for calculating failure probability and
assigning each priority a different upper and lower threshold with the value of
x changing from CPU usage to the current failure probability:
This results in the following chart:
For another perspective we can chart the rejection rates as a function of CPU usage:
Because we scale priorities based on the failure probability rather than the CPU usage, directly, we inherit the established safe range of 0% - 80% usage and only begin shedding load when the probability of failure is positive.
This specific configuration is likely not ideal for any real system but it’s intended to illustrate the concept of using different scaling methods for different priorities of traffic. In this case, each priority is subject to 100% load shedding before a higher priority is affected. In the worst case, all non-critical traffic is shed at a 100% rate and critical actions begin shedding proportionate to the system’s proximity to catastrophic failure conditions until the system enters an actual catastrophic state.
Load Shedding Malfunctions
The variety of metrics, formulas, and rules that can be used in a load shedding system contribute a nearly unbounded number of ways in which load shedding may malfunction. This is why it’s important to establish load shedding configurations based on a detailed understanding of a system, real world data, and thorough testing. There are, though, some common malfunctions worth mentioning.
High Rejection Costs
Load shedding operates on the principle that rejecting work, rather than doing it, results in less capacity usage of system resources. However, rejecting work has its own costs that need to be considered.
The first thing to consider is that rejection still incurs capacity usage. The earlier a unit of work can be rejected then, typically, the less it costs but the trade-off is that the system has less information on which to make a decision. For example, an HTTP server with a load shedding behavior likely must still accept the connection for an incoming request and parse the request headers in order to provide the load shedding system with enough information to determine its priority. If the system decides to reject the request then there may be an additional cost to signal that rejection so that resources can then be freed, such as writing an error response before closing the connection. Even a system that sheds 100% of traffic can eventually enter a catastrophic state if the cost of rejecting requests requires more capacity than the system has.
It’s important to measure the cost of rejection at scale during some type of load testing. This may identify behaviors of the system that make rejection more expensive than handling requests when under load.
Assumption Of Equal Costs
In a previous section I demonstrated how prioritization of requests can be used to target load shedding at less valuable requests. However, all of the examples I’ve shown so far make the assumption that all requests, regardless of priority, have roughly the same cost and, therefore, equally contribute to the probability of failure in a system near its capacity. I have worked on real world systems where this assumption held true but it’s more common for different functions or behaviors within the same system to have different costs. For example, a web server might have one endpoint that returns data from a database query and a second endpoint that generates thumbnails of images. The capacity usage of those two things is likely significantly different and applying the same load shedding rate to both is not going to result in an optimal outcome.
There are a few ways to manage this kind of issue. One is to install a unique load shedding instance around each different kind of work that a system does. Each instance may re-use the same underlying metrics but present different failure probability calculations and even use a different set of deterministic rules. The result is that each instance can be fine tuned for the action it guards rather than needing to apply to an entire system.
An alternative approach is to use a shared load shedding instance but incorporate some concept of request cost or request weight into the failure probability calculation. In this model, each type of request or behavior is deterministically assigned units of cost. For CPU this might mean assigning the anticipated percent usage or the amount of CPU time typically required to complete an action. This information is then provided to the appropriate failure probability function in order to get a value for each specific request.
Any system that has multiple behaviors with sufficiently different capacity usage needs to account for that difference in load shedding policies.
Accidental Cool Down
If failure probability is determined by metrics associated with executions of the underlying action, such as error rate or latency, and the range of resulting rejection probabilities includes 100% then the system may enter an unintended "cool down" state. Once the rejection probability reaches 100% then no more actions are performed until the probability is calculated as less than 100%. Metrics based on executions are not updated when the action is not being executed so there is a potential for the rejection rate to lock in at 100% until the underlying window resets, such as a rolling window policy refreshing all of its internal buckets.
This condition affects systems that use time windows to collect and aggregate data and the size of the window determines the duration of the effect. For example, if an error rate calculation is derived from a 30 second window then the cool down period is usually 30 seconds. I detail why this is likely in the Closing A Circuit[5] section of my other article about circuit breaking.
This condition especially affects systems that use streaming statistical methods because they usually do not incorporate a concept of time. If no new data are added to the calculation then the result never changes. This could result in an infinite "cool down".
If failure and rejection probabilities are derived from data collected when executing an action then the system should be configured to accept some number of requests even when the rejection probability is 100% so that it can continue recording new data which may result in new probabilities.
One way to resolve this issue is to use a rejection probability scaling function that guarantees the output cannot reach 100%. For example, the rejection rate when calculated from an execution metric might be limited to 99%. The main drawback of this practice is that there is no way to fully shed load regardless of the current capacity. For example, a limit of 99% means that 1% of requests are accepted when the system is in its most degraded state. If the system is also receiving one or two orders of magnitude greater traffic due to a spike then the load shedding may be insufficient to protect the service.
A better solution to this issue is to borrow aspects of the half-open state[6] that is used by circuit breakers. In this model, the load shedding system tracks how long it has stayed at a 100% rejection probability. Every specific duration of time, such as 100ms or 250ms, the system accepts one or more requests. These requests are allowed to complete and their results are added to the metrics windows. This type of solution allows the load shedding system to operate at a 100% rejection rate, ensure that execution derived metrics continue to receive new data, and also strictly controls the load shedding exceptions.
Suppression Of Autoscaling
If your system is designed to automatically scale up or down the available resources based on usage then you must take care to align any load shedding configuration with your autoscale policies. Shedding too much load can result in the suppression of autoscale policies by preventing resource usage from consistently exceeding its scaling threshold.
Most autoscale policies trigger when the average utilization of a resource across all current instances of a system exceeds some configured threshold for a set amount of time. For example, an autoscaling web service may have a policy to add a new instance when the average CPU usage across all existing instances exceeds 60% for five minutes. The policy may also remove an instance when the average CPU is below 50% for ten minutes. If load shedding is too aggressive or begins near or below 60% CPU usage then it may result in this autoscale policy failing to scale up because load shedding resulted in low or inconsistently high CPU usage over any five minute window. An especially aggressive load shedding could even result in a scale down if it artificially keeps the CPU usage below 50%.
Load shedding policies need to be coordinated with autoscaling policies. Additionally, you should test your combination of policies with load testing that demonstrates the system’s ability to both shed load and appropriately scale.
Conclusion
Load shedding is a highly adaptable framework for augmenting system resiliency. Much of the complexity of load shedding is not in the overall design but in its open ended nature where you, the user or developer, must make informed decisions regarding nearly every aspect of the tool. The simplest use cases may directly adapt capacity usage to probability of rejection. More advanced cases may require request prioritization, complex scaling functions, and even request cost estimation. Effective load shedding requires a detailed understanding of both the designed and actual constraints and performance of your system.
Generally, load shedding should not be the first resiliency tool that you consider incorporating into a system. First look to more foundational tools such as timeouts, retries with backoff, concurrency controls, hedging, rate limiting, and backpressure. Load shedding can be an invaluable tool for improving systems at the later stages of maturity.
Supplemental Information
From here, I’d recommend reading through some examples and case studies of load shedding in real-world systems. I’ve collected some of my favorite load shedding related material in the table below. In addition, I have a reference implementation of a load shedding framework for Go/Golang at https://github.com/kevinconway/loadshed.
| Link | Description | Archive |
|---|---|---|
This list contains a good collection of articles on load shedding and other forms of resiliency tooling or techniques. | ||
A 2019 article about a case where Amazon found that load shedding was a more reliable tool than connection limits for some cases. This article contains some valuable suggestions for different load shedding metrics and conditions. | ||
Using load shedding to survive a success disaster–CRE life lessons | A 2016 article from Google that discusses the importance and nuance of request priority and cost estimation. | |
A page from the 2017 Google SRE handbook that includes some valuable details related to load shedding. In particular, there is a section on client-side load shedding that presents a rejection probability formula as a function of requests being rejected by a server that is also shedding load. | ||
A 2018 write-up of adaptive concurrency limits which can be one of multiple deterministic load shedding rules in a system. | ||
A story from 2020 about one of the first probabilistic load shedding tools introduced to the Netflix stack. | ||
[Video] Scalability is Quantifiable: The Universal Scalability Law | A recording of a talk given in 2018 that does a great job of explaining some of the fundamental concepts behind scalability and system load. While this talk doesn’t directly address load shedding it does detail some of the foundational principals on which all overload management is based. | N/A |