D-Curve: An Improved Method for Defining Non-Contractual Churn with Type I and Type II Errors

Businesses need to know when customers end their business relationships, an act called “churn.” In a subscription business model, a customer churns by actively canceling their contract. The company can therefore detect and record this churn with absolute certainty. But when no explicit contract exists, churn is more passive and difficult to detect. Without any direct feedback from the customer, companies cannot determine whether the customer has lapsed temporarily or permanently.

Until now, detecting churn in such non-contractual relationships has been mostly arbitrary and more art than science.

Various analysts deal with the non-contractual churn conundrum in different ways. One popular approach is to assume the customer has churned if they lapse for a sufficiently long consecutive period of time. A problem with this approach, apart from it being guesswork, is that the chosen threshold for the length of the lapse period is often too high. This causes the business to wait too long to identify any churn problems. In Prediction of Advertiser Churn for Google Adwords, the authors are only able to measure churn after 12 months! Such a long wait period reduces the value of churn detection and the business’s ability to address problems. In analyses that estimate the churn period as a specified percentile of a distribution of buy cycles—time between successive customer purchases—choosing an optimal percentile (90th, 95th, 99th, etc) is difficult.

In this blog post, we present an improved scientific approach for defining non-contractual churn. Our approach avoids the struggle of choosing an optimal percentile by minimizing a well-defined objective function of type I and II errors.

Theory

Churn period (d) is the minimum length of consecutive silent (no transaction) periods beyond which a customer is considered to have ended their business relationship. Companies commonly partition a book of business into active and churned customers. Where customer relationships are non-contractual, any specified d will have associated type I & II errors. Therefore we should choose a definition that minimizes an objective function of these errors. In our approach, we specify the function to be a weighted average of the errors.

where:

  • e1(d) is the expected type I error associated with churn definition d; Type I error is labeling the customer as churned when they are active;
  • e2(d) is the expected type II error associated with churn definition d; Type II error is labeling the customer as active when they have churned;
  • w is the weight the analyst places on type I errors relative to type II errors; it can be interpreted as the relative costs of the errors.

The optimal churn definition, denote d*, therefore minimizes F(d). We call F(d) the d-curve.

To compute the error functions, e1(d) and e2(d), we need to introduce another set of notation:

  • ci represents the true churn status of customer i, 0=Active, 1=Churned;
  • li represents the number of consecutive periods customer i has lapsed.

With the above definitions, e1(d) and e2(d) are derived as follows.

From (2) and (3), we see that e1(d) is the overall proportion of active customers mislabeled as churned. Similarly, e2(d) is the overall proportion of churned customers mislabeled as active.

Implementing the theory

Suppose you have data that has recorded the periods associated with all customer transactions from time S to T.

To determine the optimal churn definition, complete the following experiment:

  1. Specify the minimum number of periods, D, beyond which you are almost sure that the customer has truly churned. You can do this by empirically examining distributions of customer buy cycles (the difference in periods between successive customer transaction dates) and choosing a sufficiently high percentile. We’ll call D the validation period. This means that the subjects of the experiment have to be limited to the subset of customers who have at least one transaction prior to T-D; else we cannot calculate the customer’s true churn status, ci. Also, the length of the entire data (T-S) should be long enough to allow you to evaluate the selected domain of churn definitions for the d-curve, F(d). For example, if the domain is {d:d<K+1), then T-S must exceed K+D.
  2. If you are only interested in voluntary churn, remove all customers otherwise terminated involuntarily by the company.
  3. For each customer i, determine the last purchase period as of time T:
    Calculate lapse period as of time T:
    And calculate the true churn status:
  4. For each customer i, calculate the last purchase period as of time T-D:Calculate lapse period as of time T-D:
  5. Select the domain of churn definitions, {d:dK}, on which you want to minimize F(d).
  6. For each churn definition in the selected domain, d =0, 1, 2…K, predict churn status for each customer as of time T-D, and measure the type I and II errors (e1(d) and e2(d)). Notice that e1(d) and e2(d) can be calculated from the data as follows:
  7. Select an appropriate weight, w.
  8. For d=0, 1, 2, …K, derive F(d) using (1).
  9. Choose the d that minimizes F(d) as your optimal d.

Results from real world application

We identified one of Indeed’s non-contractual products—job sponsorship—and applied both the percentile and d-curve methods to defining its churn period. We used monthly transaction data from September 2016 (S) through September 2019 (T).

Note that while the trends and insights we share are consistent with actual findings, we adjusted the actual results to protect the security of Indeed’s data.

Percentiles method

In this approach, we calculate the buy cycles for each customer. We can then represent each customer by a summary statistic (mean, median, and max) of their buy cycles. We then generate the distribution of the summary statistic across different customers:

Quantiles Mean Median Max
0 1 1 1
0.2 2 2 2
0.4 2 2 2
0.6 3.5 3 5
0.8 4.7 3 9
0.9 6.2 5 13
0.95 8 7 17
0.99 15 15 25
1 38 38 38
All figures illustrative

 

These results illustrate the analytic dilemma associated with the percentiles method. The distribution varies by the choice of summary statistic. Even with a given summary statistic, it’s not clear which percentile (90th, 95th or 99th) is optimal. Apart from that, any reasonable choice of percentile results in unnecessarily high churn definitions. For example, the 95th percentile of the distribution of mean buy-cycles is 8 months, while that of maximum buy-cycles is 17 months! And we will see in the next approach that while such longer definitions have lower type I errors, they have higher type II errors.

The d-curve approach deals with all of these problems by choosing the churn definition with the minimum weighted sum of the type I and II errors.

D-curve approach

We parameterized our model as follows:

  • w=0.5
  • D=12
  • S= 09-2016
  • K=12
  • T=09-2019
  • T-D=09-2018
Churn period Type I error (%) Type II error (%) Weighted error (%)
0 100.0 0.0 50.0
1 43.8 6.4 25.1
2 33.0 13.1 23.1
3 26.6 19.0 22.8
4 21.9 24.8 23.3
5 17.8 30.8 24.3
6 14.7 36.8 25.7
7 12.2 42.4 27.3
8 10.4 46.7 28.6
9 8.9 50.8 29.9
10 8.0 54.1 31.0
11 6.9 58.2 32.5
12 5.8 62.6 34.2

Using the d-curve, we choose 3 months as our optimal churn definition. A hypothesis test at 1% level of significance rejects the null hypothesis that the error for d=3 equals that of d=4.

More applications for the d-curve

We have formulated a framework for optimally selecting thresholds. While we apply the approach to define churn periods for non-contractual relationships, our approach has many other real world applications, chief of which is determining threshold probabilities in classification.

Acknowledgements

We are particularly grateful to Trey Causey, Ehsan Fakharizadi and Yaoyi Chen for their review and excellent feedback. We are, however, responsible for any mistakes in the post.


Cross-posted on Medium.

Unthrottled: How a Valid Fix Becomes a Regression

This post is the second in a two-part series.

In a previous post, I outlined how we recognized a major throttling issue involving CFS-Cgroup bandwidth control. To uncover the problem, we created a reproducer and used git bisect to identify the causal commit. But that commit appeared completely valid, which added even more complications. In this post, I’ll explain how we uncovered the root of the throttling problem and how we solved it.

Photo of busy highway at night

Photo by Jake Givens on Unsplash

Scheduling on multiple CPUs with many threads

While accurate, the conceptual model in my prior post fails to fully capture the kernel scheduler’s complexity. If you’re not familiar with the scheduling process, reading the kernel documentation might lead you to believe the kernel tracks the amount of time used. Instead, it tracks the amount of time still available. Here’s how that works.

The kernel scheduler uses a global quota bucket located in cfs_bandwidth->quota. It allocates slices of this quota to each core (cfs_rq->runtime_remaining) on an as-needed basis. This slice amount defaults to 5ms, but you can tune it via the kernel.sched_cfs_bandwidth_slice_us sysctl tunable.

If all threads in a cgroup stop being runnable on a particular CPU, such as blocking on IO, the kernel returns all but 1ms of this slack quota to the global bucket. The kernel leaves 1ms behind, because this decreases global bucket lock contention for many high performance computing applications. At the end of the period, the scheduler expires any remaining core-local time slice and refills the global quota bucket.

That’s at least how it has worked since commit 512ac999 and v4.18 of the kernel.

To clarify, here’s an example of a multi-threaded daemon with two worker threads, each pinned to their own core. The top graph shows the cgroup’s global quota over time. This starts with 20ms of quota, which correlates to .2 CPU. The middle graph shows the quota assigned to per-CPU queues, and the bottom graph shows when the workers were actually running on their CPU.

Multi-threaded daemon with two worker threads

 

Time Action
10ms
  • A request comes in for worker 1. 
  • A slice of quota is transferred from the global quota to the per-CPU queue for CPU 1.  
  • Worker 1 takes exactly 5ms to process and respond to the request.
17ms
  • A request comes in for worker 2. 
  • A slice of quota is transferred from the global quota to the per-CPU queue for CPU 2.

The chance that worker 1 takes precisely 5ms to respond to a request is incredibly unrealistic. What happens if the request requires some other amount of processing time? Multi-threaded daemon with two worker threads

Time Action
30ms
  • A request comes in for worker 1. 
  • Worker 1 needs only 1ms to process the request, leaving 4ms remaining on the per-CPU bucket for CPU 1.
  • Since there is time remaining on the per-CPU run queue, but there are no more runnable threads on CPU 1, a timer is set to return the slack quota back to the global bucket. This timer is set for 7ms after worker 1 stops running.
38ms
  • The slack timer set on CPU 1 triggers and returns all but 1 ms of quota back to the global quota pool.  
  • This leaves 1 ms of quota on CPU 1.
41ms
  • Worker 2 receives a long request. 
  • All the remaining time is transferred from the global bucket to CPU 2’s per-CPU bucket, and worker 2 uses all of it.
49ms
  • Worker 2 on CPU 2 is now throttled without completing the request.  
  • This occurs in spite of the fact that CPU 1 still has 1ms of quota.

While 1ms might not have much impact on a two-core machine, those milliseconds add up on high-core count machines. If we hit this behavior on an 88 core (n) machine, we could potentially strand 87 (n-1) milliseconds per period. That’s 87ms or 870 millicores or .87 CPU that could potentially be unusable. That’s how we hit low-quota usage with excessive throttling. Aha!

Back when 8- and 10-core machines were considered huge, this issue went largely unnoticed. Now that core count is all the rage, this problem has become much more apparent. This is why we noticed an increase in throttling for the same application when run on higher core count machines.


Note: If an application only has 100ms of quota (1 CPU), and the kernel uses 5ms slices, the application can only use 20 cores before running out of quota (100 ms / 5 ms slice = 20 slices). Any threads scheduled on the other 68 cores in an 88-core behemoth are then throttled and must wait for slack time to be returned to the global bucket before running.

Resolving a long-standing bug

How is it, then, that a patch that fixed a clock-skew throttling problem resulted in all this other throttling? In part one of this series, we identified 512ac999 as the causal commit. When I returned to the patch and picked it apart, I noticed this.

-       if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+       if (cfs_rq->expires_seq == cfs_b->expires_seq) {
               /* extend local deadline, drift is bounded above by 2 ticks */
                cfs_rq->runtime_expires += TICK_NSEC;
       } else {
                /* global deadline is ahead, expiration has passed */
                cfs_rq->runtime_remaining = 0;
        }

The pre-patch code expired runtime if and only if the per-CPU expire time matched the global expire time (cfs_rq->runtime_expires != cfs_b->runtime_expires). By instrumenting the kernel, I proved that this condition was almost never the case on my nodes. Therefore, those 1 milliseconds never expired. The patch changed this logic from being clock time based to a period sequence count, resolving a long-standing bug in the kernel.

The original intention of that code was to expire any remaining CPU-local time at the end of the period. Commit 512ac999 actually fixed this so the quota properly expired. This results in quota being strictly limited for each period.

When CFS-Cgroup bandwidth control was initially created, time-sharing on supercomputers was one of the key features. This strict enforcement worked well for those CPU-bound applications since they used all their quota in each period anyway, and none of it ever expired. For Java web applications with tons of tiny worker threads, this meant tons of quota expiring each period, 1ms at a time.

The solution

Once we knew what was going on, we needed to fix the issue. We approached the problem in several different ways.

First, we tried implementing “rollover minutes” that banked expiring quota and made it usable in the next period. This created a thundering herd problem on the global bucket lock at the period boundary. Then, we tried to make quota expiration configurable separate from the period. This led to other issues where bursty applications could consume way more quota in some periods. We also tried returning all the slack quota when threads became unable to run, but this led to a ton of lock contention and some performance issues. Ben Segall, the author of the CFS scheduler, suggested tracking the core-local slack and reclaiming it only when needed. This solution had performance issues of its own on high-core count machines.

As it turns out, the solution was actually staring us right in the face the whole time. No one had noticed any issues with CFS CPU bandwidth constraints since 2014. Then, the expiration bug was fixed in commit 512ac999, and lots of people started reporting the throttling problem.

So, why not remove the expiration logic altogether? That’s the solution we ended up pushing back into the mainline kernel. Now, instead of being strictly limited to a quota amount of time per period, we still strictly enforce average CPU usage over a longer time window. Additionally, the amount that an application can burst is limited to 1ms for each CPU queue. You can read the whole conversation and see the five subsequent patch revisions on the Linux kernel mailing list archives.

These changes are now a part of the 5.4+ mainline kernels. They have been backported onto many available kernels:

  • Linux-stable: 4.14.154+, 4.19.84+, 5.3.9+
  • Ubuntu: 4.15.0-67+, 5.3.0-24+
  • Redhat Enterprise Linux:
    • RHEL 7: 3.10.0-1062.8.1.el7+
    • RHEL 8: 4.18.0-147.2.1.el8_1+
  • CoreOS: v4.19.84+

The results

In the best-case scenario, this fix enables a .87 increase in usable CPU for each instance of our affected applications, or a corresponding decrease in required CPU quota. These benefits will unlock increased application density and decreased application response times across our clusters.

Graph showing decrease in required CPU load

How to mitigate the issue

Here’s what you can do to prevent CFS-Cgroup bandwidth control from creating a throttling issue on your systems:

  • Monitor your throttled percentage
  • Upgrade your kernels
  • If you are using Kubernetes, use whole CPU quotas, as this decreases the number of schedulable CPUs available to the cgroup
  • Increase quota where necessary

Ongoing scheduler developments

Konstantin Khlebnikov of Yandex proposed patches to the Linux kernel mailing list to create a “burst bank.” These changes are feasible now that we have removed the expiration logic, as described above. These bursting patches could enable even tighter packing of applications with small quota limits. If you find this idea interesting, join us on the Linux kernel mailing list and show your support.

To read more about kernel scheduler bugs in Kubernetes, see these interesting GitHub issues:

Please also feel free to tweet your questions to me @dchiluk.


How a Valid Fix Becomes a Regression—cross-posted on Medium.

Unthrottled: Fixing CPU Limits in the Cloud

This post is the first in a two-part series.

This year, my teammates and I solved a CPU throttling issue that affects nearly every container orchestrator with hard limits, including Kubernetes, Docker, and Mesos. In doing so, we lowered worst-case response latency in one of Indeed’s applications from over two seconds to 30 milliseconds. In this two-part series, I’ll explain our journey to find the root cause and how we ultimately arrived at the solution.

10 MPH speed limit road sign

Photo by twinsfisch on Unsplash

The issue began last year, shortly after the v4.18 release of the Linux kernel. We saw an increase in tail response times for our web applications, but when we looked at CPU usage, everything seemed fine. Upon further investigation, it was clear that the incidence of high response times directly correlated to periods of high CPU throttling. Something was off. Normal CPU usage and high throttling shouldn’t have been possible. We eventually found the culprit, but first we had to understand the mechanisms at work.

Background: How container CPU constraints work

Almost all container orchestrators rely on the kernel control group (cgroup) mechanisms to manage resource constraints. When hard CPU limits are set in a container orchestrator, the kernel uses Completely Fair Scheduler (CFS) Cgroup bandwidth control to enforce those limits. The CFS-Cgroup bandwidth control mechanism manages CPU allocation using two settings: quota and period. When an application has used its allotted CPU quota for a given period, it gets throttled until the next period.

All CPU metrics for a cgroup are located in /sys/fs/cgroup/cpu,cpuacct/<container>. Quota and period settings are in cpu.cfs_quota_us and cpu.cfs_period_us.

CPU metrics for a cgroup

You can also view throttling metrics in cpu.stat. Inside cpu.stat you’ll find:

  • nr_periods – number of periods that any thread in the cgroup was runnable
  • nr_throttled – number of runnable periods in which the application used its entire quota and was throttled
  • throttled_time – sum total amount of time individual threads within the cgroup were throttled

During our investigation into the response time regression, one engineer noticed that applications with slow response times saw excessive amounts of periods throttled (nr_throttled). We divided nr_throttled by nr_periods to find a crucial metric for identifying excessively throttled applications. We call this metric “throttled percentage.” We didn’t like using throttled_time for this purpose because it can vary widely between applications depending on the extent of thread usage.

A conceptual model of CPU constraints

To see how CPU constraints work, consider an example. A single-threaded application is running on a CPU with cgroup constraints. This application needs 200 milliseconds of processing time to complete a request. Unconstrained, its response graph would look something like this.

A request comes in at time 0, the application is scheduled on the processor for 200 consecutive milliseconds, and responds at time 200ms

Now, say we assign a CPU limit of .4 CPU to the application. This means the application gets 40ms of run time for every 100ms period—even if the CPU has no other work to do. The 200ms request now takes 440ms to complete.

A request comes in at time 0, the application runs for 5, 100ms periods in which it runs for 40ms, and then is throttled for 60 in each period. Response is completed at 440ms

If we gather metrics at time 1000ms, statistics for our example are:

Metric Value Reasoning
nr_periods 5 From 440ms to 1000ms the application had nothing to do and as such was not runnable.
nr_throttled 4 The application is not throttled in the fifth period because it is no longer runnable.
throttled_time 240ms For every 100ms period, the application can only run for 40ms and is throttled for 60ms. It has been throttled for 4 periods, so 4 multiplied by 60 equals 240ms.
throttled percentage 80% 4 nr_throttled divided by 5 nr_periods.

But that’s at the high-level, not real life. There are a couple of problems with this conceptual model. First, we live in a world of multi-core, multi-threaded applications. Second, if all this were completely true, our problematic application shouldn’t have hit throttling before exhausting its CPU quota.

Reproducing the problem

We knew a succinct reproducing test case would help convince the kernel community that a problem actually existed and needed to be fixed. We tried a number of stress tests and Bash scripts, but struggled to reliably reproduce the behavior.

Our breakthrough came after we considered that many web applications use asynchronous worker threads. In that threading model, each worker is given a small task to accomplish. For example, these workers might handle IO or some other small amount of work. To reproduce this type of workload, we created a small reproducer in C called Fibtest. Instead of using unpredictable IO, we used a combination of the Fibonacci sequence and sleeps to mimic the behavior of these worker threads. We split these between fast threads and slow worker threads. The fast threads run through as many iterations of the Fibonacci sequence as possible. The slow threads complete 100 iterations and then sleep for 10ms.

To the scheduler, these slow threads act much like asynchronous worker threads, in that they do a small amount of work and then block. Remember, our goal was not to produce the most Fibonacci iterations. Instead, we wanted a test case that could reliably reproduce a high amount of throttling with simultaneous low CPU usage. By pinning these fast and slow threads each to their own CPU, we finally had a test case that could reproduce the CPU throttling behavior.

The first throttling fix / regression

Our next step was to use Fibtest as the condition for running a git bisect on the kernel. Using this technique, we were able to quickly discover the commit that introduced the excessive throttling: 512ac999d275 “sched/fair: Fix bandwidth timer clock drift condition”. This change was introduced in the 4.18 kernel. Testing a kernel after removing this commit fixed our issue of low CPU usage with high throttling. However, as we analyzed the commit and the related sources, the fix looked perfectly valid. And more confusingly, this commit was also introduced to fix inadvertent throttling.

The issue this commit fixed was exemplified by throttling that appeared to have no correlation with actual CPU usage. This was due to clock-skew between the cores that resulted in the kernel prematurely expiring the quota for a period.

Fortunately, this problem was much rarer, as most of our nodes were running kernels that already had the fix. One unlucky application ran into this problem, though. That application was mostly idle and allocated 4.1 CPUs. The resulting CPU usage and throttle percentage graphs looked like this.

CPU usage graph with 4 CPUs allocated and usage not exceeding .5 CPU

CPU usage graph with 4 CPUs allocated and usage not exceeding .5 CPU

Graph of throttled percentage showing excessive throttling

Graph of throttled percentage showing excessive throttling

Commit 512ac999d275 fixed the issue and was backported onto many of the Linux-stable trees. The commit was applied to most major distribution kernels, including RHEL, CentOS, and Ubuntu. As a result, some users have probably seen throttling improvements. However, many others are likely seeing the problem that initiated this investigation.

At this point in our journey, we found a major issue, created a reproducer, and identified the causal commit. This commit appeared completely correct but had some negative side-effects. In part two of this series, I’ll further explain the root cause, update the conceptual model to explain how CFS-Cgroup CPU constraints actually work, and describe the solution we eventually pushed into the kernel.


Fixing CPU Limits in the Cloud—cross-posted on Medium.