At Uber, all stateful workloads run on a common containerized platform across a large fleet of hosts. Stateful workloads include MySQL®, Apache Cassandra®, ElasticSearch®, Apache Kafka®, Apache HDFS™, Redis™, Docstore, Schemaless, etc., and in many cases these workloads are co-located on the same physical hosts.
With 65,000 physical hosts, 2.4 million cores, and 200,000 containers, increasing utilization to reduce cost is an important and continuous effort. Until recently efforts were blocked due to CPU throttling, which indicates that not enough resources have been allocated.
It turned out that the issue was how the Linux kernel allocates time for processes to run. In this post we will describe how switching from CPU quotas to cpusets (also known as CPU pinning) allowed us to trade a slight increase in P50 latencies for a significant drop in P99 latencies. This in turn allowed us to reduce fleet-wide core allocation by up to 11% due to less variance in resource requirements.
Cgroups, Quotas, and Cpusets
CPU quotas and cpusets are Linux kernel scheduler features. The Linux kernel implements resource isolation through cgroups, and all container platforms are based on this. Typically a container maps to a cgroup, which controls the resources of any process running in the container.
There are two types of cgroups (controllers in Linux terms) for performing CPU isolation: CPU and cpuset. They both control how much CPU a set of processes is allowed to use, but in two different ways: through CPU time quotas and CPU pinning, respectively.
The CPU controller enables isolation using quotas. For a CPU set you specify the fraction of CPUs you want to allow (cores). This is translated into a quota for a given time period (typically 100 ms) using this formula:
quota = core_count * period
In the above example there is a container that needs 2 cores, which translates to 200 ms of CPU time per period.
CPU Quotas and Throttling
Unfortunately, this approach turns out to be problematic due to multi-processing/threading inside the container. This makes the container use up the quota too quickly, causing it to get throttled for the rest of the period. This is illustrated below:
This ends up being a real problem for containers that serve low-latency requests. Suddenly, requests that usually take a few ms to complete can take more than 100 ms due to throttling.
The simple fix is to just allocate more CPU time to the processes. While that works, it’s also expensive at scale. Another solution is to not use isolation at all. However, that’s a very bad idea for co-located workloads since one process can completely starve out other processes.
Avoiding Throttling Using Cpusets
The cpuset controller uses CPU pinning instead of quotas—it basically limits which cores a container can run on. This means that it is possible to distribute all containers across different cores so that each core only serves a single container. This then achieves full isolation, and there is no longer any need for quotas or throttling, in other words, it is possible to trade bursting and ease of configuration with consistency in latency and more cumbersome core management. The example above then looks like this:
Here, two containers run on two different sets of cores. They are allowed to use all the time they can on these cores, but they cannot burst out into unallocated cores.
The result of this is that P99 latencies become much more stable. Here is an example of throttling on a production database cluster (each line is a container) while enabling cpusets. As expected, all throttling goes away:
Throttling goes away because the containers are able to freely use all the allocated cores. More interestingly, the P99 latencies also improve since containers are able to process requests at a steady rate. In this case, the latencies drop by around 50% due to the removal of heavy throttling:
At this point it’s worth noting that there are also negative effects of using cpusets. In particular, P50 latencies usually increase a bit since it is no longer possible to burst into unallocated cores. The net result is that P50 and P99 latencies get much closer, which is typically preferable. This is discussed more towards the end of this article.
In order to use cpusets, containers must be bound to cores. Allocating cores correctly requires a bit of background on how modern CPU architectures work since wrong allocation can cause significant performance degradations.
A CPU is typically structured around:
- A physical machine can have multiple CPU sockets
- Each socket has an independent L3 cache
- Each CPU has multiple cores
- Each core has independent L2/L1 caches
- Each core can have hyperthreads
- Hyperthreads are often regarded as cores, but allocating 2 hyperthreads instead of 1 might only increase performance by 1.3x
All of this means that picking the right cores actually matter. A final complication is that numbering is not consecutive, or sometimes not even deterministic—for example, the topology might look like this:
In this case, a container is scheduled across both physical sockets and different cores, which leads to bad performance—we have seen P99 latencies degrade by as much as 500% due to wrong socket allocations. To handle this, the scheduler must collect the exact hardware topology from the kernel and use that to allocate cores. The raw information is available in /proc/cpuinfo:
Using this information, we can allocate cores that are physically close to each other:
Downsides and Limitations
While cpusets solve the issue of large tail latencies, there are some limitations and tradeoffs:
Fractional cores cannot be allocated. This is not an issue for database processes because they tend to be big, so rounding up or down is not a problem. It does mean, however, that the number of containers cannot be greater than the number of cores, which for some workloads is problematic.
System-wide processes can still steal time. For example, services running on the host through systemd, kernel workers, etc., still need to run somewhere. It is theoretically possible to also assign them to a limited set of cores, but that can be tricky since they require time proportional to the load on the system. One workaround is to use real-time process scheduling on a subset of containers—we will cover this in a later blog post.
Defragmentation is required. Over time, the available cores will get fragmented, and processes need to be moved around to create continuous blocks of available cores. This can be done online, but moving from one physical socket to another will mean that memory access is suddenly remote. This can also be mitigated, but is also a topic for a later blog post about NUMA.
No bursting. Sometimes you might actually want to use unallocated resources on the host to speed up the running containers. In this post, we have talked about exclusive cpusets, but it is possible to allocate the same core to many containers (i.e. cgroups), and it’s also possible to combine cpusets with quotas. This allows for bursting, but is yet another topic for another blog post.
Switching to cpusets for stateful workloads has been a major improvement to Uber. It allowed us to achieve much more stable database level latencies, and we saved around 11% of our cores by reducing overprovisioning to handle spikes due to throttling. Since there can be no bursting, containers of the same size now behave the same across hosts, again leading to more consistent performance.
Uber’s stateful deployment platform is developed internally, but Kubernetes® also supports cpusets by using a static policy.
For details about how we tested quotas and cpusets, see the appendix.
If these efforts are interesting to you, then please consider applying to join our team!