Unthrottled: Fixing CPU Limits in the Cloud

This post is the first in a two-part series.

This year, my teammates and I solved a CPU throttling issue that affects nearly every container orchestrator with hard limits, including Kubernetes, Docker, and Mesos. In doing so, we lowered worst-case response latency in one of Indeed’s applications from over two seconds to 30 milliseconds. In this two-part series, I’ll explain our journey to find the root cause and how we ultimately arrived at the solution.

10 MPH road sign

Photo by twinsfisch on Unsplash

The issue began last year, shortly after the v4.18 release of the Linux kernel. We saw an increase in tail response times for our web applications, but when we looked at CPU usage, everything seemed fine. Upon further investigation, it was clear that the incidence of high response times directly correlated to periods of high CPU throttling. Something was off. Normal CPU usage and high throttling shouldn’t have been possible. We eventually found the culprit, but first we had to understand the mechanisms at work.

Background: How container CPU constraints work

Almost all container orchestrators rely on the kernel control group (cgroup) mechanisms to manage resource constraints. When hard CPU limits are set in a container orchestrator, the kernel uses Completely Fair Scheduler (CFS) Cgroup bandwidth control to enforce those limits. The CFS-Cgroup bandwidth control mechanism manages CPU allocation using two settings: quota and period. When an application has used its allotted CPU quota for a given period, it gets throttled until the next period.

All CPU metrics for a cgroup are located in /sys/fs/cgroup/cpu,cpuacct/<container>. Quota and period settings are in cpu.cfs_quota_us and cpu.cfs_period_us.

CPU metrics for a cgroup

You can also view throttling metrics in cpu.stat. Inside cpu.stat you’ll find:

  • nr_periods – number of periods that any thread in the cgroup was runnable
  • nr_throttled – number of runnable periods in which the application used its entire quota and was throttled
  • throttled_time – sum total amount of time individual threads within the cgroup were throttled

During our investigation into the response time regression, one engineer noticed that applications with slow response times saw excessive amounts of periods throttled (nr_throttled). We divided nr_throttled by nr_periods to find a crucial metric for identifying excessively throttled applications. We call this metric “throttled percentage.” We didn’t like using throttled_time for this purpose because it can vary widely between applications depending on the extent of thread usage.

A conceptual model of CPU constraints

To see how CPU constraints work, consider an example. A single-threaded application is running on a CPU with cgroup constraints. This application needs 200 milliseconds of processing time to complete a request. Unconstrained, its response graph would look something like this.

A request comes in at time 0, the application is scheduled on the processor for 200 consecutive milliseconds, and responds at time 200ms

Now, say we assign a CPU limit of .4 CPU to the application. This means the application gets 40ms of run time for every 100ms period—even if the CPU has no other work to do. The 200ms request now takes 440ms to complete.

A request comes in at time 0, the application runs for 5, 100ms periods in which it runs for 40ms, and then is throttled for 60 in each period. Response is completed at 440ms

If we gather metrics at time 1000ms, statistics for our example are:

Metric Value Reasoning
nr_periods 5 From 440ms to 1000ms the application had nothing to do and as such was not runnable.
nr_throttled 4 The application is not throttled in the fifth period because it is no longer runnable.
throttled_time 240ms For every 100ms period, the application can only run for 40ms and is throttled for 60ms. It has been throttled for 4 periods, so 4 multiplied by 60 equals 240ms.
throttled percentage 80% 4 nr_throttled divided by 5 nr_periods.

But that’s at the high-level, not real life. There are a couple of problems with this conceptual model. First, we live in a world of multi-core, multi-threaded applications. Second, if all this were completely true, our problematic application shouldn’t have hit throttling before exhausting its CPU quota.

Reproducing the problem

We knew a succinct reproducing test case would help convince the kernel community that a problem actually existed and needed to be fixed. We tried a number of stress tests and Bash scripts, but struggled to reliably reproduce the behavior.

Our breakthrough came after we considered that many web applications use asynchronous worker threads. In that threading model, each worker is given a small task to accomplish. For example, these workers might handle IO or some other small amount of work. To reproduce this type of workload, we created a small reproducer in C called Fibtest. Instead of using unpredictable IO, we used a combination of the Fibonacci sequence and sleeps to mimic the behavior of these worker threads. We split these between fast threads and slow worker threads. The fast threads run through as many iterations of the Fibonacci sequence as possible. The slow threads complete 100 iterations and then sleep for 10ms.

To the scheduler, these slow threads act much like asynchronous worker threads, in that they do a small amount of work and then block. Remember, our goal was not to produce the most Fibonacci iterations. Instead, we wanted a test case that could reliably reproduce a high amount of throttling with simultaneous low CPU usage. By pinning these fast and slow threads each to their own CPU, we finally had a test case that could reproduce the CPU throttling behavior.

The first throttling fix / regression

Our next step was to use Fibtest as the condition for running a git bisect on the kernel. Using this technique, we were able to quickly discover the commit that introduced the excessive throttling: 512ac999d275 “sched/fair: Fix bandwidth timer clock drift condition”. This change was introduced in the 4.18 kernel. Testing a kernel after removing this commit fixed our issue of low CPU usage with high throttling. However, as we analyzed the commit and the related sources, the fix looked perfectly valid. And more confusingly, this commit was also introduced to fix inadvertent throttling.

The issue this commit fixed was exemplified by throttling that appeared to have no correlation with actual CPU usage. This was due to clock-skew between the cores that resulted in the kernel prematurely expiring the quota for a period.

Fortunately, this problem was much rarer, as most of our nodes were running kernels that already had the fix. One unlucky application ran into this problem, though. That application was mostly idle and allocated 4.1 CPUs. The resulting CPU usage and throttle percentage graphs looked like this.

CPU usage graph with 4 CPUs allocated and usage not exceeding .5 CPU

CPU usage graph with 4 CPUs allocated and usage not exceeding .5 CPU

Graph of throttled percentage showing excessive throttling

Graph of throttled percentage showing excessive throttling

Commit 512ac999d275 fixed the issue and was backported onto many of the Linux-stable trees. The commit was applied to most major distribution kernels, including RHEL, CentOS, and Ubuntu. As a result, some users have probably seen throttling improvements. However, many others are likely seeing the problem that initiated this investigation.

At this point in our journey, we found a major issue, created a reproducer, and identified the causal commit. This commit appeared completely correct but had some negative side-effects. In part two of this series, I’ll further explain the root cause, update the conceptual model to explain how CFS-Cgroup CPU constraints actually work, and describe the solution we eventually pushed into the kernel.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone