Stay up to date with the latest from Uber Engineering

Cinnamon Auto-Tuner: Adaptive Concurrency in the Wild

07 December 2023 / Global

This is the third part that wraps the series of blog posts on Cinnamon Loadshedder. After giving an overall picture in part one, and diving into the use of PID regulator in part two, we will share how we made adaptive concurrency limiting work in production. Read on to see how we adopted a decade-old idea, the TCP-Vegas algorithm, and adjusted it to work in production across a wide range of services and settings.

Introduction

The big challenge in any load shedder is that it needs an estimate of the capacity of the service. Essentially, how many requests can the service handle? If the estimate is too low, the service will be too quick at rejecting requests and it will run ineffectively, as it is not utilizing all the available resources. On the other hand if the estimate is too high, the service tries to respond to more requests than it is able to and latencies will go up, and potentially bring down the service.

A common denominator for capacity is the maximum number of concurrent requests the service can handle and, in essence, you really want to select the value shown in the diagram below, such that you maximize the throughput of the service.

Figure 1: The relationship between max concurrent requests and throughput. At some point the service can’t handle more and it drops fast.

This optimal value can’t really be hardcoded, because it varies heavily:

Between services (e.g., some services handle 1 request at a time, while others can handle 100s concurrently)
Within a service (e.g., some endpoints return a lot of data and others return a little)
Between the nodes for a service (e.g., due to hardware differences and/or varying load on the hosts, where the nodes are running)
Throughout the day (e.g., the “shape” of the workload can easily change throughout the day, or when a new version of the service is deployed)

Thus, for Cinnamon to be useful in many different services, we need an automatic way to estimate the optimal value. And this automatic way should not require any tuning by service owners, as we want to deploy it to thousands of microservices. Thus, we opted to implement a solution that continuously estimates the capacity and tries to hit the optimal value above. Or more precisely, it continuously estimates the maximum number of concurrent requests for every endpoint based on the latencies it observes for the endpoints. We call this number for the inflight limit and this blog post is dedicated to explaining how we arrived at it.

Architecture

To recap, the diagram below shows the Cinnamon architecture, and this post will dive into the area marked by the red box in Figure 2.

The component responsible for continuously estimating the inflight is called Auto-tuner. Auto-tuner runs in a background goroutine and updates the limit, which the scheduler then uses to figure out whether it can pop any requests from the request priority queue and handle, as shown in Figure 3.

Figure 3: Prioritized request scheduling

We won’t be diving more into the scheduler, so the rest of this post is only about the background goroutine that updates the inflight limit. At a very high level, Auto-tuner compares the observed request processing latency with some reference latency, and based on that adjusts to the inflight limit value. As we base this logic on the TCP-Vegas algorithm, we’ll first describe how TCP-Vegas works, and then how to make it work without configuration.

TCP-Vegas

The underlying idea is to track how long a given request takes and compare it with a reference time. If the current requests take a lot longer than the reference time, then the service is too loaded and you can decrease the inflight limit. On the other hand, if the request timing is much smaller than the reference, then the service can handle more requests and so you can increase the inflight limit. Lastly, if the timing is not too different, then the service is fine and you don’t change anything.

The big question is how you then define these concepts:

How much is longer and smaller?
When changing the inflight value, by how much does one change it?
What is the reference time (and how to set it)?

TCP-Vegas answers this, where the pseudo-code below shows a vanilla implementation of TCP Vegas. In this snippet sample is the latest observed request time, targetLatency is the reference time and currentLimit is the current inflight limit:

The idea is to take the ratio between targetLatency and sample along with the current inflight limit (aka. currentLimit) and then depending on the queue we either increase or decrease the limit. TCP Vegas uses the logarithmic function to smooth the changes.

Visualizing this, Figure 3 shows what the relation between sample and targetLatency should be for either decreasing, increasing, or not changing the inflight limit. Essentially, what you can see is that the ratio span, where we don’t change anything, gets more narrow as the inflight limit increases. In other words, the smaller the limit is, the more the latencies have to vary before we do anything.

Figure 4: The lower the limit – the more tolerance to latency increase.

An important step for TCP-Vegas is how to set the targetLatency value. Most implementations set it by setting it to the smallest observed sample time encountered (and then reset it at fixed intervals). That way the targetLatency is automatically updated as the service runs, and query patterns change.

In the next section we’ll describe some of the encountered issues with this “vanilla” TCP-Vegas and how we fixed them.

Auto-Tuner

A vanilla TCP-Vegas works well in stable environments, such as test scenarios where the incoming rate of requests is stable, you run the same hardware, etc. But, when hitting production, a vanilla TCP-Vegas is simply not good enough, as you also have to deal with:

Loaded vs. not loaded: The performance of a service depends heavily on how loaded it is, and even adding a bit of pressure on the service can increase the response times. Thus, if the reference time is recorded under no load, it can’t really be used in loaded cases.
Query pattern: Some services serve many different query patterns (e.g., consider a SQL database, that can return anything from a single row to thousands of rows). These have very different latencies and further they can change dramatically over time. Thus, if the reference time is recorded during a simple one-row request it does not really work, when comparing with a request that returns 1000s of rows.
Bulky workloads: For some services, the rate of incoming requests can increase 10x in a matter of seconds and it should just work well. This happens, e.g., when a big batch job starts and it needs to fetch a bunch of data. When the service is able to handle the load, we need to make sure that it also can, without rejecting too many requests.
Ever-increasing latencies: During long overloads, by periodically resetting the reference time, one can easily get into the situation where the reference time will be reset to a higher latency, which causes more acceptance of higher latencies, which will then cause the reference time to reset to a higher latency and so forth. That means that latencies are not protected during long overloads.
Non-overload: Any service is typically not overloaded for the majority of the time and in those cases, the inflight limit might grow indefinitely, because the reference time and the sample times are not very different.

The above issues can be boiled down to a) adjust the sample to not only be a single value, but actually representing a distribution, b) improving how we “reset” the targetLatency, and c) make bounds on the inflight limit. We’ll discuss these in the following.

Latency Sample

Taking a single latency sample and comparing it with the reference time makes the comparison highly susceptible to spiky requests caused by e.g., network routing imbalances, a temporarily loaded CPU, network timeouts, etc., all of which can severely skew the numbers within a short interval.

To fix this we aggregate the request latencies and use the aggregated value as the sample value. Specifically, as shown in Figure 4, we collect request latencies over a time interval, and 1) take a percentile of these values, and 2) smoothe the percentile.

Figure 5: The aggregation process from individual request timings to a smoothed value.

We only record successful requests and they only include the time spent inside the business logic (and thus exclude any time spent in the Cinnamon queue). Each interval is at least 2 seconds and maximum 30 seconds (both can be configured), where what dictates the length is the number of requests – i.e., each interval needs to have at least 250 requests before we use it. When aggregating we use the 90th percentile (i.e., P90) as default and a median filter and exponential smoothing for smoothing.

Choosing the right duration for the interval is essentially a trade-off between the amount of signal you get (e.g., number of requests) versus how fast you want to react. Having 2 seconds as a minimum works really well for most of our services, as they generally respond within 10-50 milliseconds, and thus you quickly get hundreds of requests and their response times. The problem is more the relatively low number of services that have response times of 1-10 seconds. For them, we opt to cap the reaction time to 30 seconds, and then use whatever number of requests we have seen so far. So far it has worked well.

Similarly choosing a good percentile is a trade-off between when to react versus reducing the noise at the upper percentiles. So far we have found the 90th percentile to be adequate. The only issues we have encountered with P90 are services, where, e.g., the service is backed by a cache, and the cache-hit ratio is ~90%. Then for some intervals, the P90 latency will be very low (i.e., due to higher cache-hits) and some intervals will be very high (i.e., due to lower cache-hits). In these cases, we typically change the quantile to be P95 or P80 to avoid these cases.

As we save the response time for every successful request, we have made a trade-off of memory consumption vs. precision, in that we use the approximate datastructure t-digest to aggregate durations into a quantile. It means we lose a bit of accuracy, for stable memory usage, which is important for, e.g., services with many endpoints and/or serving many requests.

Periodic Reset

The targetLatency value describes logically, what the target response time should be. In an ideal case, where the load and the type of requests are the same, targetLatency should be a constant. Unfortunately production is far from that. To solve this, a simple solution is to use the lowest recorded sample observed as the targetLatency and then periodically reset it to the latest observed sample.

This solution is simple, but it has one major drawback–namely that during long overloads, it leads to an ever-increasing targetLatency, and thus an ever-increasing inflight limit (as shown in Figure 6, where we used this solution in an earlier version of Cinnamon).

Figure 6: The ever drifting *targetLatency* issue in effect. When resetting *targetLatency* the new *targetLatency* is captured at a higher limit, which leads to both of them drifting up. Note that at ~14:10 the overload stopped.

The issue is that when the targetLatency is periodically reset, it is set to the latest sample under load. This sample is higher and we will likely not see a lower value, because the service is already loaded. An increased targetLatency means that a higher latency value does not lead to a decreasing inflight, but rather the contrary, where the inflight limit is increased. This process then repeats and you get the picture above.

In essence, the issue is that when resetting the targetLatency to, e.g., a higher value, we don’t really know whether it will improve the throughput of the service. Thus, to fix this we’re using a concept from statistics called covariance, that tells whether two series of values are correlated in that higher values of one leads to higher values of the other. And the nice property of covariance is that it is easy to calculate. In our case, covariance allows us to see whether the last few inflight values are actually leading to higher throughput. The diagram below shows the idea of covariance between observed inflight requests and throughput. In the first case (Figure 7A), increasing inflight (thus processing more concurrent requests) leads to higher throughput (which is good). On the lower graph (Figure 7B), increasing the inflight requests is bad, as it decreases the throughput):

Figure 7A: Positive covariance between the number of inflight requests and throughput.

Figure 7B: Negative covariance between the number of inflight requests and throughput.

The idea is that we have a running window (typically lasting 50 intervals), where we convert the highest number of concurrent requests processed in addition to the latency sample to an estimated throughput (as given by Little’s Law). Based on these 50 intervals, we calculate the covariance. If the covariance is negative, Auto-tuner will always reduce the inflight limit, when resetting the targetLatency. Note that we have to use the actual observed number of concurrent requests and not the limit, as we might have cases, where the limit is, e.g., 20, but only 3 requests were processed at the same time.

The diagram below shows the effect of using covariance for the same overload case shown in Figure 6. Essentially, the inflight is stable, and so are the request latencies (i.e., the latency sample graph).

Figure 8: The effect of covariance when resetting the inflight limit. Both inflight limit and the latency samples (i.e., request timing) are now stable.

When looking at the graph in Figure 7B, you could consider to be more selective around which of the data points to use for the covariance. As an example in 7B, it seems that there is positive covariance for inflight values between 10 and 35, and therefore, if the periodic reset would select an inflight in this area you would actually want to increase the inflight limit. So far we haven’t really seen a need for it as just using all the last 50 data points has been enough. It is likely because the history is bounded, because during long overloads the periodic reset ends up oscillating between positive or negative covariance.

Bounds on Inflight

During normal mode, when a service is doing fine and the sampled latencies are close to the targetLatency, a vanilla TCP-Vegas would allow the inflight limit to increase unbounded. This happens because the ratio between sample and targetLatency will be close to 1 and thus TCP-Vegas increases the limit, even though the actual number of concurrent requests is stable. The unbounded inflight limit is a problem when/if the service becomes overloaded, as the inflight limit is very high and thus it takes “forever” to decrease the inflight limit to actually protect the service.

Therefore it is critical to bound the inflight limit, to cap the time it takes to decrease the limit when overloads starts. In our case, we use the recorded number of concurrently processed requests and multiply it with 10 and use that as the limit. Thus, if you have an interval where a service has processed 10 concurrent requests, the inflight limit would be capped at 100. On the one hand this bounds the time it takes to decrease the limit, while still allowing a burst of requests to be handled by the service.

Lastly, we also have a lower bound in place, that is equal to the number of CPU cores assigned to the service. The intuition behind it is that we want to utilize the cores as much as possible, so if you have an endpoint that is CPU bound each request gets a full CPU core to run. You could argue that there might be cases where one request needs multiple cores, but we haven’t seen that being a problem in practice. If Auto-tuner hits the lower bound multiple times in a row it indicates that the targetLatency value is wrong and thus it also triggers a reset of targetLatency.

Adaptive Concurrency in Action

As mentioned, all the fun (with TCP-Vegas) happens when it is deployed and used in production. Thus, in this section we’ll demonstrate some situations where Auto-tuner has helped out.

Load Tests in Production

To verify Auto-tuner performance we used the Ballast load testing framework. Ballast allowed us to run load tests in production safely by providing the following important features:

Load one service node
Mark test traffic as low-priority (so that Cinnamon shed it first)
Increase the load gradually

In the graphs below, you can see the result from one of such load tests. The top graph shows throughput per priority, where we highlight only the test traffic. The middle graph shows latency samples aggregated over 10 second intervals. Here we use M3 timers to see the min, mean, and max values. Finally, the bottom graph shows the raw inflight limit value emitted by the service nodes.

The load test starts at T_a. During the test, we gradually increase low-priority traffic (marked as t4 and t5 on the top graph, where t stands for tier). After a certain point (moment T_b), the load becomes too high, and the overload manifests itself on the latency graph in the middle. Auto-tuner reacts by lowering the inflight limit. The inflight limit stops at the value of 16, which is the lowest possible value for this service. After hitting this lower bound several times in a row, Auto-tuner discards the current targetLatency (moment T_c): this is a protective measure that saves Auto-tuner from being guided by an outlier targetLatency. After this reset, the limit slightly overshoots, and finds the equilibrium at the value of 18 (moment T_d). The limit stays at this value until the test ends (moment T_e).

Overload

Let’s dive into another interesting case, this time from production, where Auto-tuner saved the day for one of our services. Essentially one of our services (a.k.a., foo) became overloaded, not because of an increase in traffic, but because one of its dependencies degraded and became substantially slower. The response times of the downstream service (let’s call it bar) did increase, but most importantly it did not die completely, because Auto-tuner reduced the inflight in foo. If not, foo would have overwhelmed bar completely and thus all requests would have been errors.

The graphs (Figure 10A) below are focused on foo in a single data center with 9 service nodes. The top graph shows inbound traffic and what was accepted by Cinnamon. The middle graph shows hander latency excluding time spent in the Cinnamon queue. And in the bottom graph you can see the latency samples observed by Auto-tuner (similar to Figure 9), where we include the minimum, the mean, and the maximum observed sample in each interval across the service nodes.

Figure 10A: Throughput and latency during overload

Note that in this case the latency profile of the endpoint rapidly changes, because the downstream dependency starts to take a lot longer. Thus Cinnamon needs to kick in to reduce traffic to the downstream (to help it recover), which means sending fewer concurrent requests. What is further interesting is that while P99 and P90 latencies do increase, the P50 does not change. Had Auto-tuner not decreased the inflight, then the P50 response times would have increased significantly as well.

The graphs (Figure 10B) below show how the inflight limit and its utilization (i.e., ratio of inflight and inflightLimit) behave, along with the CPU utilization of the service nodes. Looking at the inflight limit value, the behavior is similar to what we’ve seen in the load test. Normally, the limit oscillates around a relatively high value. When the overload starts, the limit goes down. Furthermore, you see the inflight limit increasing around 17:45 and that is because the P90 latency drops as seen in the graph above.

Figure 10B: Inflight limit (top); the ratio between the number of inflight requests
and the inflight limit (middle); CPU usage (bottom)

What is interesting is that when you look at the inflight limit utilization graph, it normally hovers around 10% (due to our 10x inflight bound), but then goes up to 100%, as Auto-tuner reduces the inflight (and we have enough traffic to take up the inflight spots.)

The provided CPU usage graph illustrates that the service was overloaded without running out of CPU. Even though the amount of traffic hasn’t increased, the service wasn’t able to service the requested throughput, due to an increase in latency.

Summary

Adaptive concurrency limiting significantly simplified onboarding to Cinnamon by eliminating the need for manual inflight limit tuning. More importantly, it keeps the Cinnamon configuration up to date with the current state of the service and the environment it operates in.

Before choosing TCP-Vegas we experimented with other congestion control algorithms like AIMD and gradient descent, but we found that using TCP-Vegas leads to more stable inflight limits and is better at handled latency variations at low inflight numbers

For additional information on starting a similar project, the open-source concurrency-limits library by Netflix^® is a great starting point.

Netflix is a registered trademark of Netflix, Inc. in the United States and/or other countries. No endorsement by Netflix, Inc. is implied by the use of these marks.

Vladimir Gavrilenko

Vladimir Gavrilenko is a Software Engineer at Uber on the Inventory and Catalog team based in Aarhus, where he focuses on reliability and scalability.

Jakob Holdgaard Thomsen

Jakob Holdgaard Thomsen is a Principal Engineer at Uber, working out of the Aarhus office, helping to make Uber's systems more performant and more reliable.

Jesper Lindstrom Nielsen

Jesper Lindstrom Nielsen is a Staff Engineer at Uber on the Inventory and Catalog team based in Aarhus, trying to make everything run a bit faster and scale a bit more.

Timothy Smyth

Timothy Smyth is a Staff Engineer at Uber on the Delivery Backend Platform team based in New York City, helping to increase the resilience of Uber Delivery.

Posted by Vladimir Gavrilenko, Jakob Holdgaard Thomsen, Jesper Lindstrom Nielsen, Timothy Smyth

Category:

Engineering

Backend

Uber AI