Coming Together to Support the Open Source Community

Indeed Open Source Program logo

Over the last few weeks, conference and event cancellations around the world have heavily impacted the open source community. These events play an important role in the ecosystem that supports free and open source software. If we want that ecosystem to remain healthy, it is important for us to act now.

Why support matters now more than ever

Conferences and events provide essential opportunities for the open source community to:

  • Coordinate activities
  • Raise funds
  • Grow their user bases
  • Support each other
  • Evangelize their technologies
  • Educate and onboard new contributors
  • Ship new releases
  • Share knowledge

Running these events requires a lot of time, effort, and money. When they are cancelled, the event organizers bear the brunt of the losses. If we want these events to continue—and the open source community to sustain itself and grow—all users of free and open source software must respond.

The FOSS Responders

Open source community leaders from across the industry have come together to form a working group called FOSS Responders. We’re focused on identifying open source events, communities, foundations, and community members who are most in need of support. We also want to support individuals who are unable to absorb conference-related cancellation fees. Our goals: amplify these community needs and mobilize organizational and individual resources to help.

This working group of committed industry professionals includes participants from Indeed, GitLab, Open Collective, the Sustain community, the Drupal Association, and several other organizations. You can find more information about this working group—including information on how to join and participate—at fossresponders.com.

Virtual funding event

Indeed and Open Collective are collaborating to host a virtual funding event on May 22, 2020. We want to raise funds for conference organizers that have suffered irrecoverable losses due to event cancellations. We are calling on our peers in the industry to join us at this event as fellow FOSS Funders.

By coming together to share knowledge, collaborate on decision making, and coordinate our collective response, we can ensure that these events will continue to serve and support the community in the months and years to come. Find more information about the virtual funding event.

Taking action

Regardless of your need or your capacity to help, the time to act is now. Here are some specific actions you can take.

How to help

Now is the time to give back to the projects we depend on. This is how we future-proof our open source infrastructure investments and help millions who built the software we benefit from:

  • Donate and buy membership to foundations that support your projects
  • Donate to individuals working on projects you use via GitHub Sponsors and Open Collective
  • Donate to the FOSS Responders Open Collective—your funds will be used to help individuals who might otherwise fall through the cracks

How to get help

FOSS Responders will amplify your need so others can easily see who and how to help. By sharing your need, we can connect you with helpers.

  • If you are an individual who needs help paying for conference-related cancellation fees, fill out the FOSS Responders Individual Request
  • If you had to cancel an event and your organization needs financial assistance as a result, open an EVENT issue
  • If you need other kinds of help, open an ORGANIZE issue

How to get involved

The FOSS Responders working group is growing quickly and we could use your help.


Supporting the Open Source Community—cross-posted on Medium.

Improving Incident Retrospectives

This post was originally published on learningfromincidents.io.

Photo by Jared Erondu on Unsplash

As a Site Reliability Engineer (SRE) at Indeed, I often participate in the retrospective process that follows an incident. Retrospectives—in use at Indeed since late 2015—are a meaningful part of our engineering culture. I have never questioned their importance, but recently I was struck by shortcomings I saw in some retrospectives. For example:

  • A retrospective meeting might use only ~30% of the allotted time.
  • What is discussed might be gleaned from reading the incident ticket and retrospective document instead of attending the meeting.
  • Too much focus is devoted to the conditions that “triggered” the incident.
  • Signals used for deciding to hold a retrospective tend to direct focus toward incidents with high impact or high visibility.
  • Were participants actually learning anything new? It became apparent to me that we were not using every incident to realize our full potential to learn.

I decided to explore why so that we could improve our process.

The typical retrospective

Retrospectives at Indeed are usually a one-hour discussion including up to several dozen participants. The meeting is open to anyone in the company, but usually participants have either been involved in the incident response or have a stake in the outcome.

Facilitators follow a prescribed process:

  1. Review the timeline.
  2. Review the remediation items in the template.
  3. Find owners for the remediation items.
  4. Open the room for questions.

Spotting opportunities for improvement

In Summer 2018 I visited one of our tech sites and was invited to several local retrospective meetings to discuss some recent incidents. As an SRE it wasn’t unusual for me (or members of my team) to be invited. I also had subject matter expertise in a technology related to the incidents.

The facilitators took about 5 minutes to review the timeline, spent 8-10 minutes reviewing the remediation items, and concluded with questions related to the specific technologies involved in the causal chain. I didn’t learn anything new. I could have gained the same information from reading the incident ticket and retrospective document. This was a rare opportunity when a unique and eager group of people gathered in a conference room ready to collaboratively investigate. Instead, we never achieved the full potential.

This result is not uniform across retrospectives. I’ve been present in retrospectives where the participants offered such rich detail that the conversation continued well beyond the one-hour time limit, culminating with a huddle outside of the conference room.

The facilitators for these particular retrospective meetings followed the process faithfully but had only utilized ~30% of the time. It was clear to me that the retrospective process itself needed improvement.

Nurturing a safety culture

To understand potential changes, I first solicited viewpoints on why we conduct retrospectives at Indeed. Reasons I heard are likely familiar to most software organizations:

  • Find out what caused the outage
  • Measure the impact
  • Ensure that the outage never happens again
  • Create remediation items and assign owners

These goals also reflect Indeed’s strong sense of ownership. When someone’s service is involved in an incident, there’s a concern that we were closer to the edge of failure than we thought we were. Priorities temporarily change and people are more willing to critically examine process and design choices.

It’s important to use these opportunities to direct efforts toward a deeper analysis into our systems (both people and technical) and the assumptions that we’ve made about them. These approaches to a different safety culture at Indeed are still relatively new and are evolving toward widespread adoption.

Recommendation: Decouple remediation from the retrospective process

One process change I recommend is around the creation of remediation items. The retrospective process is not necessary as a forcing function for driving accountability of finding and owning remediation items.

I consistently observe that the creation of remediation items occurs organically after Production is stabilized. Many fixes are obvious to teams in the hindsight following an incident.

I see value in decoupling these “after action” activities from the retrospective process for many reasons.

  • The search for remediation items is often a tacit stopping point that halts further or deeper investigation.
  • The accountability around owning remediation items should be tightly coupled to incident ownership.
  • The retrospective process should be an optional activity. By making the retrospective process optional, teams that decide to engage in it are doing so because they see value in it rather than as an obligation or a checklist item.
  • Participants are freed up to conduct a deeper investigation unencumbered by the search for remediation items and shallow explanations.

Recommendation: Lighten up the retrospective template

Another useful change is with the retrospective template itself.

Using retrospective templates can be a lot like filling out forms. The focus is directed toward completion of an activity rather than free exposition. A blank document encourages a different kind of sharing. I have witnessed incidents where responders were so motivated to share their thoughts and descriptions that they produced rich and detailed analysis simply by starting with a blank document.

If every incident is shaped like a snowflake, it’s impossible to develop a template that is equipped to capture its unique characteristics. A template constrains detail and triggers explanations through close-ended questions. A blank canvas is open-ended. A template is yet another tacit stopping point that hinders deeper analysis. I recommend that we apply templates to incident analysis, but that we use blank documents for the retrospective process.

Driving organizational change

I have learned a lot by working to drive change at Indeed as we’ve grown quickly. My efforts have benefitted from my tenure in the company, experience participating in hundreds of incidents, and connection to the literature. I have made headway but there is still a lot to do.

I attribute some of my progress so far to finding other advocates in the company and remembering to communicate.

Find advocates

Advocates are colleagues who align closely with my goals, acknowledge where we could be doing better, and share a vision of what could be. I had no trouble finding these advocates. They are colleagues who are willing to listen, have an open mind and have the patience to consider another perspective. I held numerous 1:1s with leaders and stakeholders across the organization. I found opportunities to bring these topics up during meetings. I gave tech talks and reached out to potential advocates whenever I visited one of our global Engineering offices.

Over-communicate

As much as I might think that I was communicating what I was working on, it was never enough. I found I had to constantly over-communicate. As I over-communicated and leveraged multiple media, I may have sounded repetitive to anyone in close proximity to my words. But this was the only way to reach the far edges of the organization who might not have otherwise heard me. Not everybody has time to read every email or internal blog post.

Looking ahead

Response to these changes has been largely positive. The focus during retrospectives is still anchored to the technological factors, when more attention could be paid to the human factors. I’m exploring different avenues for increasing the reach and effectiveness of these efforts. This includes working with our instructional design team to create a debrief facilitator program, communicating more often and more broadly, making more process changes, continuing to help teams produce and share high quality write-ups, and focusing on producing educational opportunities. At this point we’ve only scratched the surface and I’m looking forward to what we will accomplish.


About the author

Alex Elman is a founding member of the Site Reliability Engineering team at Indeed. He leads two teams: one that focuses on Resilience Engineering and one that supports the flagship Job Search product. For the past eight years Alex has been helping Indeed adopt reliability practices to cope with ever increasing complexity and scale. Follow Alex on Twitter @_pkill.


Improving Incident Retrospectives—cross-posted on Medium.

Unthrottled: How a Valid Fix Becomes a Regression

This post is the second in a two-part series.

In a previous post, I outlined how we recognized a major throttling issue involving CFS-Cgroup bandwidth control. To uncover the problem, we created a reproducer and used git bisect to identify the causal commit. But that commit appeared completely valid, which added even more complications. In this post, I’ll explain how we uncovered the root of the throttling problem and how we solved it.

Photo of busy highway at night

Photo by Jake Givens on Unsplash

Scheduling on multiple CPUs with many threads

While accurate, the conceptual model in my prior post fails to fully capture the kernel scheduler’s complexity. If you’re not familiar with the scheduling process, reading the kernel documentation might lead you to believe the kernel tracks the amount of time used. Instead, it tracks the amount of time still available. Here’s how that works.

The kernel scheduler uses a global quota bucket located in cfs_bandwidth->quota. It allocates slices of this quota to each core (cfs_rq->runtime_remaining) on an as-needed basis. This slice amount defaults to 5ms, but you can tune it via the kernel.sched_cfs_bandwidth_slice_us sysctl tunable.

If all threads in a cgroup stop being runnable on a particular CPU, such as blocking on IO, the kernel returns all but 1ms of this slack quota to the global bucket. The kernel leaves 1ms behind, because this decreases global bucket lock contention for many high performance computing applications. At the end of the period, the scheduler expires any remaining core-local time slice and refills the global quota bucket.

That’s at least how it has worked since commit 512ac999 and v4.18 of the kernel.

To clarify, here’s an example of a multi-threaded daemon with two worker threads, each pinned to their own core. The top graph shows the cgroup’s global quota over time. This starts with 20ms of quota, which correlates to .2 CPU. The middle graph shows the quota assigned to per-CPU queues, and the bottom graph shows when the workers were actually running on their CPU.

Multi-threaded daemon with two worker threads

 

Time Action
10ms
  • A request comes in for worker 1. 
  • A slice of quota is transferred from the global quota to the per-CPU queue for CPU 1.  
  • Worker 1 takes exactly 5ms to process and respond to the request.
17ms
  • A request comes in for worker 2. 
  • A slice of quota is transferred from the global quota to the per-CPU queue for CPU 2.

The chance that worker 1 takes precisely 5ms to respond to a request is incredibly unrealistic. What happens if the request requires some other amount of processing time? Multi-threaded daemon with two worker threads

Time Action
30ms
  • A request comes in for worker 1. 
  • Worker 1 needs only 1ms to process the request, leaving 4ms remaining on the per-CPU bucket for CPU 1.
  • Since there is time remaining on the per-CPU run queue, but there are no more runnable threads on CPU 1, a timer is set to return the slack quota back to the global bucket. This timer is set for 7ms after worker 1 stops running.
38ms
  • The slack timer set on CPU 1 triggers and returns all but 1 ms of quota back to the global quota pool.  
  • This leaves 1 ms of quota on CPU 1.
41ms
  • Worker 2 receives a long request. 
  • All the remaining time is transferred from the global bucket to CPU 2’s per-CPU bucket, and worker 2 uses all of it.
49ms
  • Worker 2 on CPU 2 is now throttled without completing the request.  
  • This occurs in spite of the fact that CPU 1 still has 1ms of quota.

While 1ms might not have much impact on a two-core machine, those milliseconds add up on high-core count machines. If we hit this behavior on an 88 core (n) machine, we could potentially strand 87 (n-1) milliseconds per period. That’s 87ms or 870 millicores or .87 CPU that could potentially be unusable. That’s how we hit low-quota usage with excessive throttling. Aha!

Back when 8- and 10-core machines were considered huge, this issue went largely unnoticed. Now that core count is all the rage, this problem has become much more apparent. This is why we noticed an increase in throttling for the same application when run on higher core count machines.


Note: If an application only has 100ms of quota (1 CPU), and the kernel uses 5ms slices, the application can only use 20 cores before running out of quota (100 ms / 5 ms slice = 20 slices). Any threads scheduled on the other 68 cores in an 88-core behemoth are then throttled and must wait for slack time to be returned to the global bucket before running.

Resolving a long-standing bug

How is it, then, that a patch that fixed a clock-skew throttling problem resulted in all this other throttling? In part one of this series, we identified 512ac999 as the causal commit. When I returned to the patch and picked it apart, I noticed this.

-       if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+       if (cfs_rq->expires_seq == cfs_b->expires_seq) {
               /* extend local deadline, drift is bounded above by 2 ticks */
                cfs_rq->runtime_expires += TICK_NSEC;
       } else {
                /* global deadline is ahead, expiration has passed */
                cfs_rq->runtime_remaining = 0;
        }

The pre-patch code expired runtime if and only if the per-CPU expire time matched the global expire time (cfs_rq->runtime_expires != cfs_b->runtime_expires). By instrumenting the kernel, I proved that this condition was almost never the case on my nodes. Therefore, those 1 milliseconds never expired. The patch changed this logic from being clock time based to a period sequence count, resolving a long-standing bug in the kernel.

The original intention of that code was to expire any remaining CPU-local time at the end of the period. Commit 512ac999 actually fixed this so the quota properly expired. This results in quota being strictly limited for each period.

When CFS-Cgroup bandwidth control was initially created, time-sharing on supercomputers was one of the key features. This strict enforcement worked well for those CPU-bound applications since they used all their quota in each period anyway, and none of it ever expired. For Java web applications with tons of tiny worker threads, this meant tons of quota expiring each period, 1ms at a time.

The solution

Once we knew what was going on, we needed to fix the issue. We approached the problem in several different ways.

First, we tried implementing “rollover minutes” that banked expiring quota and made it usable in the next period. This created a thundering herd problem on the global bucket lock at the period boundary. Then, we tried to make quota expiration configurable separate from the period. This led to other issues where bursty applications could consume way more quota in some periods. We also tried returning all the slack quota when threads became unable to run, but this led to a ton of lock contention and some performance issues. Ben Segall, the author of the CFS scheduler, suggested tracking the core-local slack and reclaiming it only when needed. This solution had performance issues of its own on high-core count machines.

As it turns out, the solution was actually staring us right in the face the whole time. No one had noticed any issues with CFS CPU bandwidth constraints since 2014. Then, the expiration bug was fixed in commit 512ac999, and lots of people started reporting the throttling problem.

So, why not remove the expiration logic altogether? That’s the solution we ended up pushing back into the mainline kernel. Now, instead of being strictly limited to a quota amount of time per period, we still strictly enforce average CPU usage over a longer time window. Additionally, the amount that an application can burst is limited to 1ms for each CPU queue. You can read the whole conversation and see the five subsequent patch revisions on the Linux kernel mailing list archives.

These changes are now a part of the 5.4+ mainline kernels. They have been backported onto many available kernels:

  • Linux-stable: 4.14.154+, 4.19.84+, 5.3.9+
  • Ubuntu: 4.15.0-67+, 5.3.0-24+
  • Redhat Enterprise Linux:
    • RHEL 7: 3.10.0-1062.8.1.el7+
    • RHEL 8: 4.18.0-147.2.1.el8_1+
  • CoreOS: v4.19.84+

The results

In the best-case scenario, this fix enables a .87 increase in usable CPU for each instance of our affected applications, or a corresponding decrease in required CPU quota. These benefits will unlock increased application density and decreased application response times across our clusters.

Graph showing decrease in required CPU load

How to mitigate the issue

Here’s what you can do to prevent CFS-Cgroup bandwidth control from creating a throttling issue on your systems:

  • Monitor your throttled percentage
  • Upgrade your kernels
  • If you are using Kubernetes, use whole CPU quotas, as this decreases the number of schedulable CPUs available to the cgroup
  • Increase quota where necessary

Ongoing scheduler developments

Konstantin Khlebnikov of Yandex proposed patches to the Linux kernel mailing list to create a “burst bank.” These changes are feasible now that we have removed the expiration logic, as described above. These bursting patches could enable even tighter packing of applications with small quota limits. If you find this idea interesting, join us on the Linux kernel mailing list and show your support.

To read more about kernel scheduler bugs in Kubernetes, see these interesting GitHub issues:

Please also feel free to tweet your questions to me @dchiluk.


How a Valid Fix Becomes a Regression—cross-posted on Medium.