Unthrottled: Fixing CPU Limits in the Cloud

This post is the first in a two-part series.

This year, my teammates and I solved a CPU throttling issue that affects nearly every container orchestrator with hard limits, including Kubernetes, Docker, and Mesos. In doing so, we lowered worst-case response latency in one of Indeed’s applications from over two seconds to 30 milliseconds. In this two-part series, I’ll explain our journey to find the root cause and how we ultimately arrived at the solution.

10 MPH speed limit road sign

Photo by twinsfisch on Unsplash

The issue began last year, shortly after the v4.18 release of the Linux kernel. We saw an increase in tail response times for our web applications, but when we looked at CPU usage, everything seemed fine. Upon further investigation, it was clear that the incidence of high response times directly correlated to periods of high CPU throttling. Something was off. Normal CPU usage and high throttling shouldn’t have been possible. We eventually found the culprit, but first we had to understand the mechanisms at work.

Background: How container CPU constraints work

Almost all container orchestrators rely on the kernel control group (cgroup) mechanisms to manage resource constraints. When hard CPU limits are set in a container orchestrator, the kernel uses Completely Fair Scheduler (CFS) Cgroup bandwidth control to enforce those limits. The CFS-Cgroup bandwidth control mechanism manages CPU allocation using two settings: quota and period. When an application has used its allotted CPU quota for a given period, it gets throttled until the next period.

All CPU metrics for a cgroup are located in /sys/fs/cgroup/cpu,cpuacct/<container>. Quota and period settings are in cpu.cfs_quota_us and cpu.cfs_period_us.

CPU metrics for a cgroup

You can also view throttling metrics in cpu.stat. Inside cpu.stat you’ll find:

  • nr_periods – number of periods that any thread in the cgroup was runnable
  • nr_throttled – number of runnable periods in which the application used its entire quota and was throttled
  • throttled_time – sum total amount of time individual threads within the cgroup were throttled

During our investigation into the response time regression, one engineer noticed that applications with slow response times saw excessive amounts of periods throttled (nr_throttled). We divided nr_throttled by nr_periods to find a crucial metric for identifying excessively throttled applications. We call this metric “throttled percentage.” We didn’t like using throttled_time for this purpose because it can vary widely between applications depending on the extent of thread usage.

A conceptual model of CPU constraints

To see how CPU constraints work, consider an example. A single-threaded application is running on a CPU with cgroup constraints. This application needs 200 milliseconds of processing time to complete a request. Unconstrained, its response graph would look something like this.

A request comes in at time 0, the application is scheduled on the processor for 200 consecutive milliseconds, and responds at time 200ms

Now, say we assign a CPU limit of .4 CPU to the application. This means the application gets 40ms of run time for every 100ms period—even if the CPU has no other work to do. The 200ms request now takes 440ms to complete.

A request comes in at time 0, the application runs for 5, 100ms periods in which it runs for 40ms, and then is throttled for 60 in each period. Response is completed at 440ms

If we gather metrics at time 1000ms, statistics for our example are:

Metric Value Reasoning
nr_periods 5 From 440ms to 1000ms the application had nothing to do and as such was not runnable.
nr_throttled 4 The application is not throttled in the fifth period because it is no longer runnable.
throttled_time 240ms For every 100ms period, the application can only run for 40ms and is throttled for 60ms. It has been throttled for 4 periods, so 4 multiplied by 60 equals 240ms.
throttled percentage 80% 4 nr_throttled divided by 5 nr_periods.

But that’s at the high-level, not real life. There are a couple of problems with this conceptual model. First, we live in a world of multi-core, multi-threaded applications. Second, if all this were completely true, our problematic application shouldn’t have hit throttling before exhausting its CPU quota.

Reproducing the problem

We knew a succinct reproducing test case would help convince the kernel community that a problem actually existed and needed to be fixed. We tried a number of stress tests and Bash scripts, but struggled to reliably reproduce the behavior.

Our breakthrough came after we considered that many web applications use asynchronous worker threads. In that threading model, each worker is given a small task to accomplish. For example, these workers might handle IO or some other small amount of work. To reproduce this type of workload, we created a small reproducer in C called Fibtest. Instead of using unpredictable IO, we used a combination of the Fibonacci sequence and sleeps to mimic the behavior of these worker threads. We split these between fast threads and slow worker threads. The fast threads run through as many iterations of the Fibonacci sequence as possible. The slow threads complete 100 iterations and then sleep for 10ms.

To the scheduler, these slow threads act much like asynchronous worker threads, in that they do a small amount of work and then block. Remember, our goal was not to produce the most Fibonacci iterations. Instead, we wanted a test case that could reliably reproduce a high amount of throttling with simultaneous low CPU usage. By pinning these fast and slow threads each to their own CPU, we finally had a test case that could reproduce the CPU throttling behavior.

The first throttling fix / regression

Our next step was to use Fibtest as the condition for running a git bisect on the kernel. Using this technique, we were able to quickly discover the commit that introduced the excessive throttling: 512ac999d275 “sched/fair: Fix bandwidth timer clock drift condition”. This change was introduced in the 4.18 kernel. Testing a kernel after removing this commit fixed our issue of low CPU usage with high throttling. However, as we analyzed the commit and the related sources, the fix looked perfectly valid. And more confusingly, this commit was also introduced to fix inadvertent throttling.

The issue this commit fixed was exemplified by throttling that appeared to have no correlation with actual CPU usage. This was due to clock-skew between the cores that resulted in the kernel prematurely expiring the quota for a period.

Fortunately, this problem was much rarer, as most of our nodes were running kernels that already had the fix. One unlucky application ran into this problem, though. That application was mostly idle and allocated 4.1 CPUs. The resulting CPU usage and throttle percentage graphs looked like this.

CPU usage graph with 4 CPUs allocated and usage not exceeding .5 CPU

CPU usage graph with 4 CPUs allocated and usage not exceeding .5 CPU

Graph of throttled percentage showing excessive throttling

Graph of throttled percentage showing excessive throttling

Commit 512ac999d275 fixed the issue and was backported onto many of the Linux-stable trees. The commit was applied to most major distribution kernels, including RHEL, CentOS, and Ubuntu. As a result, some users have probably seen throttling improvements. However, many others are likely seeing the problem that initiated this investigation.

At this point in our journey, we found a major issue, created a reproducer, and identified the causal commit. This commit appeared completely correct but had some negative side-effects. In part two of this series, I’ll further explain the root cause, update the conceptual model to explain how CFS-Cgroup CPU constraints actually work, and describe the solution we eventually pushed into the kernel.

Fixing CPU Limits in the Cloud—cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

The FOSS Contributor Fund: Forming a Community of Adopters

In January 2019, Indeed launched a new program that democratizes the way we provide financial support to open source projects that we use. We call it The FOSS Contributor Fund. The fund enables Indeed employees who make open source contributions to nominate and vote for projects. Each month, the winning project receives funding. This program encourages support of projects we use and more engagement with the open source community.

FOSS Contributor Fund logo

Join our community of FOSS Fund Adopters

Now, we want to help other companies start similar funds. Our goal is to collaborate for the benefit of the open source community. Regardless of a company’s size or resources, we want to discover what we can accomplish when we work together. Indeed is forming a community of FOSS Fund Adopters—companies that will run their own FOSS Contributor Fund initiatives in 2020. We invite you to join us and other FOSS Funders in sharing knowledge and experiences. We’re looking for adopters who are willing to run the same experiment we ran, or something similar. We will work with the community of Funders to set up regular events, exploring different models of open source support and funding. 

We’ve seen great results

In our blog post at the six month mark, we described how the program helped encourage Indeed employees to make open source contributions. Since program launch, we’ve seen thousands of such contributions. Indeedians have reported and fixed bugs. They’ve reviewed pull requests and developed features. They’ve improved documentation and designs. 

Even better, Indeed developers now have an avenue to advocate for projects in need of our support. And the program has inspired some employees to make their first open source contributions.

The FOSS Contributor Fund is one of the ways Indeed’s Open Source Program Office honors our commitment to helping sustain the projects we depend on. We gave our open source contributors a voice in the process, and we’ve seen some great benefits from doing so: increased contribution activity, better visibility into our dependencies, and a short list of projects where we can send new contributors. 

Watching the program evolve and grow is exciting. We’ve learned a lot this year and look forward to more growth in 2020. Now, we’d like you to join us. 

Use Indeed’s blueprint to start your FOSS Fund

To find out how we administer the FOSS Fund at Indeed, read our blueprint (released under a Creative Commons license). We’ve also released an open source tool called Starfish that we use to determine voter eligibility. In the coming months, FOSS Funders will publish additional documentation and tools to support these programs. We want to make it easy for anyone to run their own FOSS Fund.

If you are interested in joining the community of FOSS Fund Adopters, want more information, or would like to join a Q&A session, please email us at opensource@indeed.com

Learn more about Indeed’s open source program.

The FOSS Contributor Fund—cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

Being Just Reliable Enough

One Saturday morning, as I settled in on the couch for a nice do-nothing day of watching college football, my wife reminded me that I had agreed to rake the leaves after putting it off for the last two weekends. Being a good neighbor and not wanting another homeowners’ association (HOA) violation (and it being a bye week for the Longhorns), I grabbed my rake and went outside to work.

fall leaves

There were a lot of leaves. I would say my yard was 100% covered in leaves. I began to rake the leaves and with a modest effort I was able to collect about 90% of the leaves into five piles, which I then transferred into those bags you buy at Home Depot or Costco.

The yard looked infinitely better, but there were still plenty of leaves in the yard. I had the rake, I had the bags, I was already outside, and I was already dirty, so I went to work raking the entire yard again to get the remaining 10% I had missed in the first pass. This took about the same amount of time, but wasn’t nearly as fulfilling. My piles weren’t as impressive, and I was only able to get 90% of the remaining leaves into piles and then into bags, but I had cleared 99% of the leaves.

Still having plenty of daylight and knowing I could do better, I went to work on that last 1%. Now, I don’t know if you know this about leaves, but single leaves can be slippery and evasive. When you don’t have a lot of leaves to clump together to get stuck in the rake it may take two, three, sometimes four passes over the same area to get any good leaf accumulation into your pile. This third pass over the yard was considerably more time consuming, but I was able to get 90% of that remaining 1%. I had now cleared 99.9% of the leaves in my yard.

As I sat back and admired my now mostly leaf-free yard, I could see some individual leaves that had escaped my rake and even some new leaves that had just fallen from the trees. There weren’t too many, but they were there. Wanting to do a good job, I started canvassing the yard on my hands and knees, picking up individual leaves one by one. As you can imagine, this was very tedious and it took much longer to do the whole yard, but I was able to pick up 90% of the remaining 0.1%. I had now cleared 99.99% of the leaves in my yard.

The sun was starting to set and all that was left were mostly little leaf fragments that could only really be picked up by tweezers.

I went inside and asked my wife, “Where are the tweezers?” “Why do you need tweezers to paint the fence?” she asked. “Paint the fence?” I thought. Oh, yeah. I had also agreed to paint the fence today. I told her I hadn’t started on the fence yet and wouldn’t be able to do that this weekend because it was getting late and the Cowboys were playing the next day. She was not happy.

Yes, this story is ridiculous and contrived, but it demonstrates some good points that we apply to how we manage system reliability and new feature velocity at Indeed.

Where did I go wrong? 

It was way before I thought about getting the tweezers. When I started raking, my definition of a successfully raked yard was too vague. I did not have a service level objective (SLO) specifying the percentage of my yard that could be covered in leaves and still be considered well-raked by my clients.

Should I have defined the SLO?

I could have defined the SLO, but I might have based it on what I was capable of achieving. I was capable of picking up bits and pieces of leaves with tweezers until I had a 99.999% leaf-free yard. I could have also gone in the other direction (if it wasn’t a bye week) and determined that raking 90% of the leaves would be sufficient. 

SLOs should be driven by the clients who care about them 

The clients in my story are my HOA and my wife. My HOA cites me when my yard is only 50% raked for an extended period of time. My wife says she is happy when I rake 99% of the leaves once a year. For the SLO, we would take the higher of the two. I could have quit raking leaves after the second pass when I reached 99% and had time to paint the fence (depending on the SLO for the number of coats of paint).

But, I still did a good job, right?

I did, but I far exceeded my undefined SLO of 99% by two 9s, and yet I was not rewarded. Sadly, I was punished, because my wife didn’t care about the work I did on that remaining 1% and was upset that I didn’t have the time to meet my other obligation of painting the fence.

This brings us to the moral of the story:

We need to have the right SLOs and work to exceed them, but not by much. 

At Indeed, when our SLOs describe what our users care about, we avoid the effort of adding unnecessary 9s. We then use that saved effort to deploy more features faster, achieving a balance between reliability and velocity.

About the author

Andrew Ford is a site reliability engineer (SRE) at Indeed, who enjoys solving database reliability and scalability problems. He can be found on the couch from the start of College Gameday to the end of the East Coast game most Saturdays from September to December.

Do you enjoy defining SLOs that your clients care about? Check out SRE openings at Indeed!

Being Just Reliable Enough—cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone