
At Uber, all stateful workloads run on a common containerized platform across a large fleet of hosts. Stateful workloads include MySQL®, Apache Cassandra®, ElasticSearch®, Apache Kafka®, Apache HDFS™, Redis™, Docstore, Schemaless, etc., and in many cases these workloads are co-located on the same physical hosts.
With 65,000 physical hosts, 2.4 million cores, and 200,000 containers, increasing utilization to reduce cost is an important and continuous effort. Until recently efforts were blocked due to CPU throttling, which indicates that not enough resources have been allocated.
It turned out that the issue was how the Linux kernel allocates time for processes to run. In this post we will describe how switching from CPU quotas to cpusets (also known as CPU pinning) allowed us to trade a slight increase in P50 latencies for a significant drop in P99 latencies. This in turn allowed us to reduce fleet-wide core allocation by up to 11% due to less variance in resource requirements.
Cgroups, Quotas, and Cpusets
CPU quotas and cpusets are Linux kernel scheduler features. The Linux kernel implements resource isolation through cgroups, and all container platforms are based on this. Typically a container maps to a cgroup, which controls the resources of any process running in the container.
There are two types of cgroups (controllers in Linux terms) for performing CPU isolation: CPU and cpuset. They both control how much CPU a set of processes is allowed to use, but in two different ways: through CPU time quotas and CPU pinning, respectively.
CPU Quotas
The CPU controller enables isolation using quotas. For a CPU set you specify the fraction of CPUs you want to allow (cores). This is translated into a quota for a given time period (typically 100 ms) using this formula:
quota = core_count * period
In the above example there is a container that needs 2 cores, which translates to 200 ms of CPU time per period.
CPU Quotas and Throttling
Unfortunately, this approach turns out to be problematic due to multi-processing/threading inside the container. This makes the container use up the quota too quickly, causing it to get throttled for the rest of the period. This is illustrated below:
This ends up being a real problem for containers that serve low-latency requests. Suddenly, requests that usually take a few ms to complete can take more than 100 ms due to throttling.
The simple fix is to just allocate more CPU time to the processes. While that works, it’s also expensive at scale. Another solution is to not use isolation at all. However, that’s a very bad idea for co-located workloads since one process can completely starve out other processes.
Avoiding Throttling Using Cpusets
The cpuset controller uses CPU pinning instead of quotas—it basically limits which cores a container can run on. This means that it is possible to distribute all containers across different cores so that each core only serves a single container. This then achieves full isolation, and there is no longer any need for quotas or throttling, in other words, it is possible to trade bursting and ease of configuration with consistency in latency and more cumbersome core management. The example above then looks like this:
Here, two containers run on two different sets of cores. They are allowed to use all the time they can on these cores, but they cannot burst out into unallocated cores.
The result of this is that P99 latencies become much more stable. Here is an example of throttling on a production database cluster (each line is a container) while enabling cpusets. As expected, all throttling goes away:
Throttling goes away because the containers are able to freely use all the allocated cores. More interestingly, the P99 latencies also improve since containers are able to process requests at a steady rate. In this case, the latencies drop by around 50% due to the removal of heavy throttling:
At this point it’s worth noting that there are also negative effects of using cpusets. In particular, P50 latencies usually increase a bit since it is no longer possible to burst into unallocated cores. The net result is that P50 and P99 latencies get much closer, which is typically preferable. This is discussed more towards the end of this article.
Assigning CPUs
In order to use cpusets, containers must be bound to cores. Allocating cores correctly requires a bit of background on how modern CPU architectures work since wrong allocation can cause significant performance degradations.
A CPU is typically structured around:
- A physical machine can have multiple CPU sockets
- Each socket has an independent L3 cache
- Each CPU has multiple cores
- Each core has independent L2/L1 caches
- Each core can have hyperthreads
- Hyperthreads are often regarded as cores, but allocating 2 hyperthreads instead of 1 might only increase performance by 1.3x
All of this means that picking the right cores actually matter. A final complication is that numbering is not consecutive, or sometimes not even deterministic—for example, the topology might look like this:
In this case, a container is scheduled across both physical sockets and different cores, which leads to bad performance—we have seen P99 latencies degrade by as much as 500% due to wrong socket allocations. To handle this, the scheduler must collect the exact hardware topology from the kernel and use that to allocate cores. The raw information is available in /proc/cpuinfo: