Speed Matters, But It Isn’t Everything

Photo by Jonathan Chng on Unsplash

Over the last few years at Indeed, we noticed our public-facing web applications were loading more slowly. We tested numerous ways to improve performance. Some were very successful, others were not.

We improved loading speeds by 40% but we also learned that speed is not always the most important factor for user experience.

Performance metrics

We measured loading speed using two key metrics:

We chose a weighted average instead of a single metric. This provided a more accurate measure of perceived load time, and helped us answer two critical questions:

  • How long did the user wait before the page seemed responsive?
  • How long did the user wait before they could interact with the page?

Though these metrics came with tradeoffs, we decided to use them instead of Google Web Vitals because they gave the broadest coverage across our user base. After deciding on these metrics, we had simple, observable, and reportable data from hundreds of applications and across a variety of web browsers.

Successful methods for improving speed

While we tried many strategies, the following efforts provided the biggest increases in performance.

Flushing <Head/> early

Browsers generally use the most resources during page load when they are downloading and parsing static resources such as JS, CSS, and HTML files. To reduce this cost, we can send static content early, so the browser can begin to download and parse files even before those files are required. This eliminates much of the render-blocking time these resources introduce.

By flushing the HTML head early on multiple applications, we saw load time improvements of 5-10%.

This implementation comes with a few trade-offs, however, since flushing the HTML document in multiple chunks can result in confusing error modes. Once we’ve flushed the first part of the response, we’re no longer able to change parts of the response, such as status code or cookies. Even if an error occurs somewhere before the last part of the response, we can’t change these headers. We’ve implemented some common libraries that help with these complications.

Reducing files on the critical path

Apart from the total number of bytes, one of the most important aspects for page load time is the number of total resources – especially render-blocking resources – required on the critical path for rendering. In general, the more blocking files you request, the slower the page. For example, a 100kB page served with 5 files will be significantly faster than a 100kB page served with 10 files.

In an A/B test, we reduced the number of render-blocking files from 30 to 12, a 60% reduction. The total amount of bytes shipped during page load was roughly identical. This test provided a 2+ second improvement for domContentLoadedEventEnd at the 95th percentile for our desktop and mobile search pages, as well as significant improvements in largestContentfulPaint.

To dive into this further, we explored the cost of a single extra CSS file. We ran a test on one of our highest trafficked pages to reduce the number of CSS files by 1. Page load times improved by a statistically significant amount, about 15ms at the 95th percentile.

Improving the runtime cost of CSS-in-JS

As more of our applications started using our newest component library, built on top of the Emotion library, we noticed 40% slower page loads.

The Emotion library supports CSS-in-JS, a growing industry trend. We determined that rendering CSS-in-JS components added extra bytes to our JavaScript bundles. The runtime cost of this new rendering strategy – along with the added bytes – caused this slowdown. We built a webpack plugin that precompiled many of our most commonly used components, reducing their render costs and helped address the problem.

This strategy resulted in a massive improvement, decreasing the slowdown from 40% to about 5% in aggregate, at the 95 percentiles. However, the CSS-in-JS approach still incurred more runtime cost than more traditional rendering approaches.

Factors outside our control

Along with testing improvements, we analyzed the types of users, locales, and devices that had an impact on page speeds.

Device type and operating system

For Android devices, which are generally lower powered than their iOS counterparts, we saw 63% slower timings for firstContentfulPaint, and 107% slower timings for domContentLoadedEventEnd.

Windows users saw 26% slower timings for domContentLoadedEventEnd compared to their iOS counterparts. These results were somewhat expected, since Windows devices tend to be older.

This data provided important takeaways:

  • The performance impact of features and additional code is non-linear: newer, robust devices can incur 100kB more code without an impact to performance, while older devices see a much bigger slowdown as a result.
  • Testing applications using real user metrics (RUM) is critical to understanding performance, since performance varies so widely based on device and the operating system’s capabilities.

Connection type and network latency

We used the Network Information API to collect information about various connection types. The API is not supported in all browsers, making this data incomplete, however, it did allow us to make notable observations:

  • 4G connection types were 4 times faster than 3G, 10 times faster than 2G, and 20 times faster than connections that were less than 2G. Put another way, network latency accounts for a huge percent of our total latency.
  • For browsers that report connection type information, 4G connection types make up 95% of total traffic. Including all browser types drops this number closer to 50%.

Networks vary greatly by country, and for some countries it takes over 20 seconds to load a page. By excluding expensive features such as big images or videos in certain regions, we deliver simpler, snappier experiences on slower networks.

This is by far the simplest way to improve performance, but it comes at the cost of complexity.

Results of speed and other factors

The impact of performance on the web varies. Companies such as Amazon have reported that slowdowns of just 1 second could result in $1.6 billion in lost sales. However, other case studies have reported a more muddled understanding of the impact of performance.

Over the course of our testing, we saw some increases in engagement based on performance improvements. But we’re not so sure they’re strongly correlated to performance improvements alone.

Reliability vs speed

Our current understanding of these increases in engagement is that they are based on increased reliability rather than an improvement in loading speed.

In tests where we moved our static assets to a content delivery network (CDN), we saw engagement improvements, but we also saw indications of greater reliability and availability. In tests that improved performance but not reliability, we did not see strong improvements in engagement.

The impact of single, big improvements

In tests where we improved performance by a second or more (without improving reliability), we saw no significant changes in our Key Performance Indicators.

Our data suggests that for non-commerce applications, small to medium changes in performance do not meaningfully improve engagement.

Engagement vs performance

Our observations reminded us not to equate performance with engagement when analyzing our metrics. One stark example of this point was the different performance metrics observed for mobile iOS users versus mobile Android users.

While Android users had nearly 2 times slower rendering, there was no observable drop in engagement when compared to iOS users.

So when does speed matter?

After a year of testing strategies to improve speed, we found some that are worth the effort to improve performance. While these improvements were measurable, they were not significant enough to drive changes to key performance indicators.

The bigger lesson is that while a certain level of speed is required, other factors matter too. The user’s device and connection play a large role in the overall experience. The silver lining is that knowing we cannot fully control all these factors, we can be open to architectural strategies not specifically designed for speed. Making minor trade-offs in speed for improvements in other areas can result in an overall better user experience.

 

Cross-posted on Medium

Obligation and Opportunity

A good friend of mine who’s been in engineering leadership at a handful of early-stage companies recently had something interesting to say about core values:

I’m never putting ‘Accountability’ as a core value again. I’ve tried it three different ways at three very different companies and it always ends up the same. ‘Accountability’ just ends up being something everybody wishes everybody else would take more of. It’s a stick to beat people with, instead of a core value to practice.

That echoes something I noticed as Indeed grew rapidly through the 2010s. As the company grew larger and more complex, it became harder and harder to improve shared capabilities that fall outside any given team’s scope. Over the last couple of years, I’ve occasionally heard some variation of one of the following:

  • Whose responsibility is (thing X)?
  • We should make (aspect Y) somebody’s responsibility.
  • Why doesn’t leadership make (task Z) somebody’s job?

The thing is: responsibility can’t be given, it can only be taken.

Image by Tumisu from Pixabay

I’ve had the pleasure of working with hundreds of colleagues over the last decade. Every one of them is a highly qualified professional who would thrive with many different teams inside Indeed and many different organizations outside. If one of their managers insisted on assigning them tasks that were neither interesting nor transparently impactful, it wouldn’t be very long until that individual quite rightly started asking after what other positions might be available.

Indeed’s engineering leadership has emphasized the coaching model of leadership over command-and-control management ever since you could count the engineering managers on one hand. In this model, a coach’s job isn’t to assign tasks or obligations. Coaches work with people to identify opportunities, help them choose between opportunities, and then help them realize those opportunities.

One of my favorite examples of seeing opportunity versus obligation play out in practice is ownership of the retrospective after a production outage. Indeed has long championed the habit of blameless retrospectives: focusing attention on understanding contributing factors and preventing recurrence, rather than fault-finding.

Nevertheless, I’ve heard a hundred times in the heat of the moment: “that team broke things, they should own it.” From my point of view, this is a little wide of the point. Driving a retrospective is an opportunity, not an obligation. You grab the baton on a retrospective when you happen to be well-positioned to prevent its recurrence independently of whether or not you were anywhere near the triggering condition.

As for individuals, so for teams

We do ask teams to take on specific responsibilities… but we explicitly list out probably fewer than you imagine. When a team has a service running in production, they take on responsibility for making sure that service stays healthy, responsive, and compliant with company policies. We don’t mandate that teams respond to feature requests within a certain timeframe, that they support specific features, or that they use specific technologies.

Instead, we ask them to look for opportunities. Where will supporting new users help other teams onboard to the solution they’re building? Which features will help them accomplish their mission? Where can they find discontinuous advantage by adopting a different underlying technology?

As the engineering lead for a group of platform teams, I get a lot of chances to think about obligation versus opportunity. For example, we provide a modular browser-based UI platform. The bulk of code written against that platform is not written by the team itself. It is written by product teams creating product-specific modules. The platform team members clearly aren’t obligated to monitor the browser-side errors emitted by those modules, and it would be wholly unscalable to try and make them responsible. But at least for now, they can and they do. The opportunity to help product teams that are less familiar with deploying and maintaining modules is just too good to pass up. It won’t scale forever but, while it does, it significantly eases adoption by new teams and helps the platform team see where their users run into trouble.

Our communications platform team helps product teams message job seekers and employers over various channels. Through the years, the team has worked through just about every flavor of this when partnering with core infrastructure teams:

  • Years ago, when postfix performance was a dramatic bottleneck, the core infrastructure team took the feedback, fixed the performance problem, and has maintained it ever since. Responsibility taken.
  • When various issues affected the durability guarantees our message queues could offer, the core infrastructure team didn’t have a clear path to be able to provide the hard guarantees we needed. We worked around the problem by detecting and re-sending messages after any end-to-end delivery failure. Responsibility declined.
  • When we needed to move away from a proprietary key-value store that had been deprecated, an infrastructure team working with OpenStack was very interested in building out a Ceph-based solution. We worked closely with them to prototype the solution, but it became clear that timeline pressure would not allow the solution to provide sufficient performance guarantees soon enough. We fell back on using S3, with the option to cost-optimize in Ceph later. Responsibility desired, but not feasible.

These examples spotlight some really important themes. Responsibility cannot be assigned based on the logic of team names alone. It can only be taken based on a team’s desire and ability to fulfill it. A team named “Storage Systems” is not obligated to support OracleDB simply because they’re the Storage Systems team. If their roadmap takes them in a different direction that meets the needs of their clients and stakeholders, it’s their decision.

Similarly, desire alone is not sufficient. When a much smaller Indeed first experimented with Cassandra, the experiment didn’t fail because of an inherent flaw in the technology. It withered because we didn’t have the in-house expertise and capacity to successfully operate a large-scale cluster through all the vagaries that occur in production. We wanted it to work and teams were happy to try and figure it out… it just ended up not being feasible.

Getting your opportunities noticed

So what does that mean for Thing X, Aspect Y, Task Z, and all of the other wish-list items that people come across in the course of a normal workday? If managers can’t just make those somebody’s job, then how on earth do we make progress on the opportunities that no one’s yet taken?

Two basic prerequisites make the opportunity-driven model effective: one mechanical, one cultural. Unsurprisingly, the mechanical aspect is easier.

The coach’s responsibilities that I listed earlier are identifying opportunities, selecting opportunities, and realizing opportunities.

Product-driven delivery organizations like Indeed already spend a lot of effort continuously improving their ability to deliver software to production. I won’t spend a lot of time on realizing opportunities here.

Identifying opportunities is also a core skill for product delivery teams. Where we needed to invest significant effort was in identifying them effectively. Primarily, that means making sure that good ideas end up on the radar of the people who are able to act on them.

An audience-friendly intake process is a crucial component for teams serving internal customers. Audience friendliness involves several critical aspects.

  • It must be lightweight: incomplete ideas can be fleshed out later; lost ideas are gone for good.
  • It must be responsive, since nothing demotivates a colleague so much as finding their suggestions lost in a black hole.
  • Finally, it must operate at a high enough level in the organization. Individual delivery teams typically have narrow, carefully defined scopes that let them focus. That’s smart for delivery efficiency, but people outside the team can’t reasonably be expected to understand fine-grained subdivisions.

An effective intake process requires something of requesters as well. Making sure the rationale and assumptions behind a request are crystal clear—even when they seem obvious to you—makes it far easier for future engineers to notice and get psyched about the opportunity you’re presenting. Understanding and communicating a value proposition is good practice for any up-and-coming engineer and greatly increases the odds of somebody selecting your opportunity.

A culture of ownership

Of course, relying on others to pick up and run with opportunities requires a lot of trust in your colleagues. You trust that your priorities are generally aligned, so that your rationale will be compelling. You also trust that most everyone is generally hungry for good opportunities and will look for ways to make them happen.

Another way of framing that is that an opportunity-driven model can only work in a high ownership culture. At Indeed, we don’t tend to frame things in terms of obligations and accountability, because we’ve worked hard to develop a culture in which individuals and teams hold themselves accountable. Once a team or an individual has chosen to adopt a responsibility, they will see it through.

My long-time colleague, Patrick Schneider, illustrates the idea of high-ownership nicely. He looked at the daily question of “How should I spend my time?” through the lens of a RACI breakdown for an individual displaying various degrees of ownership. RACI stands for responsible, accountable, consulted, and informed.

How should I spend my time?

Patrick Schneider | May 16, 2019

Level of Ownership Responsible Accountable Consulted Informed
High

Me

I decide how to spend my time.

Me

I am able to describe what actions I have taken, which tasks I have completed, and provide justification for each.

OKRs, my product manager, my team, other teams, …

I consult whoever is necessary until I’m confident that I’m spending my time well.

My team, my product manager, Jira, Slack, etc.

I regularly and proactively let people know what I am spending my time on.

Medium-High Me

I choose from curated options how to spend my time.

Me

I am able to describe what actions I have taken and which tasks I have completed.

Me

I have choices or recommendations from my manager, product manager, or others after they have consulted whoever they believe is appropriate.

My team, my product manager, Jira, Slack, etc.

People usually know what I am working on.

Medium-Low

My manager
decides how to spend my time.

My product manager
decides how to spend my time.

My manager, my product manager, or automation

My manager, product manager, and/or automated systems describe the things I have completed and actions I have taken.

Me

I am provided choices or recommendations by my manager or product manager, after they have consulted whoever they believe is appropriate.

My manager or product manager

I inform my manager and product manager about what I am working on; they may inform whoever else they believe is appropriate. Jira is usually up-to-date.

Low

My manager
decides how to spend my time.

My product manager
decides how to spend my time.

Non-humans, e.g., the next email in my Inbox
decide how to spend my time.

Unknown or opaque

Many things are in-progress or being worked on; work is described in the continuous tense, often with “-ing” verbs. The state of completion is rarely reached or described.

Unknown or opaque

My manager or product manager consults whoever they believe is appropriate. Randomness or algorithms.

Unknown or opaque

My manager or product manager informs whoever they believe is appropriate. Others may or may not find out about my work.

 

Putting it all together

Accountability is a critical attribute of high-performance teams, but it isn’t well-served by simply being named a core value. Instead, you need to instill a culture of high individual ownership, establish processes that spotlight opportunities, and empower your teams to chase the opportunities most meaningful to their mission.

Cross-posted on Medium.

Shifting Modes: Creating a Program to Support Sustained Resilience

Originally published on InfoQ.


Imagine for a moment that you work at a company that continuously ships customer-facing software. Say that your organization has managed to do the impossible and stopped having any serious incidents — you’ve achieved 100% reliability. Your product is successful. It’s fast, useful, usable, and reliable. Adoption increases, users desire new features, and they become accustomed to this success. As this happens, various pressures are continuously exerted on your organization — such as the pressure to ship features more quickly, the pressure to increase revenue, and the pressure to do more with less. Concurrently, there are additional constraints. Employees cannot be asked to work longer hours because work-life balance is a stated corporate priority. Given both this short-term success coupled with the constraints, what would happen over time?

Photo by Mitchell Luo on Unsplash

Since employees are not spending time responding to incidents, engaging with retrospective activities, and delivering on action items in this highly reliable future, they’ll have more time to respond to those business pressures for greater efficiency.

The tradeoff with having no incidents is that existing employees will fall out of practice on how to collaboratively work to respond to and understand their products in Production (also known as operational underload). Work will continue to increase in tempo, pace, and complexity. New employees will be hired and trained to account for the increase in workload. Unforeseen threats will act upon the system.

Inevitably, there will be more incidents.

Incidents are a signal from the system that change is happening too quickly and that there are mismatches between people’s models of the system versus the actual system. Incidents are a buffer that stabilizes the pace of change. Success is the reason that you will never be able to truly prevent incidents according to the Law of Stretched Systems. Embracing this inevitability will be the key to continued success in a climate of increasing complexity and interconnectedness.

What I’m witnessing in the software industry is that we’re getting stuck in a local maxima. We’ve plateaued in our approach to safety. I predict that if we don’t level up how we cope with increases in complexity and scale soon, we’ll be in big trouble.

At Indeed, we’ve recognized that we need to drive organizational change to maintain the success we’ve had and keep pace with changing complexity and greater scales. Over the last 16 years, Indeed has grown quickly and the pace of change has accelerated. Because we recognize the importance of getting this right, we are implementing a shift to a Learn & Adapt safety mode within our resilience engineering department.

In this article I will advocate that this mode shift is necessary in order to contend with the direction that the software industry is being pushed. I’ll describe the work necessary to enact this shift. Finally, I’ll compare the traits of an organization that is well poised for successfully persisting this mode shift. This shift won’t just make your organization safer, but also as Allspaw (2020) notes, “changing the primary focus from fixing to learning will result in a significant competitive advantage.”

Different approaches to safety

Facing down this increase in complexity and scale requires escaping the local maxima. A change in how an organization works is necessary. The shift is away from the traditional “prevent and fix” mode that’s popular in software today. A prevent and fix safety mode is defined by a preoccupation with accident avoidance, strict controls, and a focus on what breaks.

The prevent and fix cycle focuses on increasing safety over time through fixing and preventing. Participants in this cycle do not learn how to adapt to surprise.

Prevent & Fix cycle

An organization preoccupied with this type of safety mode is not spending time focusing on how to adapt to surprise. The organization might also be spending a lot of time fixing things that don’t need the most attention. Sometimes preventative work can actually hinder opportunities for adaptations. For example, turning on MySQL safe mode in production to prevent UPDATE statements without a WHERE clause might prevent a recurrence of this type of mistake. Safe mode can also stymie a DBA jumping onto the MySQL command line to make a critical repair during an incident.

By contrast, practicing a “learn and adapt” (Learn & Adapt) approach to safety means that encounters with incidents lead to an enhanced understanding of how normal, everyday work creates safety. Organizations that prioritize learning and adapting over preventing and fixing will also improve their ability to prevent and fix. I describe in more detail how that can lead to safer operations in a talk I gave at SREcon20 Americas.

The learn and adapt reinforcing loop focuses on increasing safety over time through an enhanced understanding. This cycle adapts to surprise and therefore becomes safer.

Learn & Adapt reinforcing loop

There appears to be a broad consensus from the Resilience Engineering research literature that the Learn & Adapt approach is superior to approaches aimed at accident avoidance and local fixes. A set of traits make some organizations more successful at this than others. As article 1 in the InfoQ series mentioned, it’s unreasonable to expect anyone in an organization to have predicted the coronavirus pandemic, but it’s perfectly reasonable to anticipate and prepare for future encounters with surprise. It’s something that an organization can get better at over time with continuous focus and investment.

One example of achieving this mode shift is in how an organization approaches its incidents. In the prevent and fix safety mode, incidents are seen as evidence of poor team performance, poor product quality, or avoidable losses. One primary cause is uncovered through causal analysis techniques like The Five Whys. The analysis typically ends there. By contrast, Learn & Adapt promotes using incidents as a lens through which an organization casts a light on processes, decision making, collaboration, and how work gets done. This is accomplished using an incident analysis loop that focuses on at least 50% of the human factors.

This mode shift isn’t achieved by creating a new team, changing people’s titles, hiring the “right” person, or buying the “right” vendor product. It’s also not something that happens overnight.

This mode shift requires the organization to change from within. It begins by sowing the seed of organizational change. Once the seed becomes a sapling, the organization can begin to achieve a continuous reinforcing loop of learning and adapting. This reinforcing loop requires constant nurturing and attention, much like caring for a delicate plant. The caveat is that the sapling can only emerge from the soil and thrive with the right mix of nutrients and the right environmental conditions. Many of those nutrients and conditions are related to organizational culture.


Driving organizational change

My intense focus in this area was inspired by an experience I had years ago when I participated in a string of hour-long retrospective meetings. I was invited to these meetings because I was an SRE and a recognized subject matter expert in RabbitMQ — a factor in several of those incidents. What I noticed struck me as a missed opportunity.

In each of those meetings, over a dozen people were present in the conference room. In some cases, it was standing room only. It was a very expensive meeting. The facilitator went through the agenda, going over the timeline, the action items, and the contributing factors. It was a rote presentation rehashing what had happened, driven by the template document produced a priori. There was a call for questions, and the meeting ran to the end of the agenda within 25 to 30 minutes. We wrapped early. This was an opportunity where we had a lot of eager people in a room to discuss the incident, but I left the meeting without an improved or enhanced understanding about what happened.

The facilitator followed the process faithfully, so I identified a problem with the process itself. I wanted to learn how to make this process more effective. And in pursuing this research, I found that there was so much more to learning from incidents than what I originally assumed.

Once I recognized that process change was necessary, I solicited viewpoints from co-workers on why we conduct retrospectives at Indeed. Reasons I heard are likely familiar to most software organizations:

  • Find out what caused the outage
  • Measure the impact
  • Ensure that the outage never happens again
  • Create remediation items and assign owners

While these answers reflect Indeed’s strong sense of ownership, it’s important to use these opportunities to direct efforts toward a deeper analysis into our systems (both people and technical) and the assumptions that we’ve made about them. When someone’s service is involved in an incident, there’s a concern that we were closer to the edge of failure than we thought we were. Priorities temporarily change and people are more willing to critically examine process and design choices.

These approaches to a different organizational culture at Indeed are still relatively new and are evolving toward widespread adoption, but early indications are promising. After a recent learning review where we discussed an incident write-up, I received this piece of feedback:

The write-up had interesting and varied content, effectively summarized crucial Indeed context, and demonstrably served as the basis for a rich dialogue. Participants revealed thoughtful self-reflection, openly shared new information about their perspective, refined their mental models, became closer as colleagues, and just plain learned cool things.

I have made headway, but there is still a lot to do. While my efforts have benefitted from my tenure in the company, experience participating in hundreds of incidents, and connection to the research literature, I can also attribute some of my progress so far to three key organizational elements:

  1. Finding other advocates in the company
  2. Communicating broadly, and
  3. Normalizing certain critical behaviors

Find advocates

Advocates are colleagues who align closely with the goals, acknowledge where we could be doing better, and share a vision of what could be. They are instrumental to drive organizational change. Having multiple colleagues model new behaviors can help spur social change and create a movement. It’s very difficult to engage in this work alone. I’ve found these advocates and I wager they exist within your company as well. They are colleagues who keep an open mind and have the curiosity to consider multiple perspectives.

I found one such advocate during an incident in 2020 that I analyzed. In a 1:1 debrief interview with a responder who had only peripherally been involved, I asked why they had participated in a group remediation session. Their answer demonstrates that advocates aren’t created; they’re discovered:

I like to join just about every event [Slack] channel I can even when I’m not directly related. I find that these kinds of things are one of the best ways to learn our infrastructure, how things work, who to go to when things are on fire. Who [are] the people that will be fixing stuff? I learn a lot from these things. Like I said, even when it’s not my stuff that’s broken.

Incident debrief interviewing is not the only place to locate advocates. I hold numerous 1:1s with leaders and stakeholders across the organization. I find opportunities to bring these topics up during meetings. I give internal tech talks and reach out to potential advocates whenever I visit one of our global engineering offices. Internal tech talks have the effect of drawing people out who have aligned interests or stories to share. They will make themselves known, perhaps by approaching you after the talk. You may find them to be advocates who can help socialize the movement within your organization. Indeed has offices all over the world, across different time zones. Advocates in each of those offices bring uniformity to the campaign.

Communicate broadly

The second key component of driving organizational change is ensuring the messages are heard across the entire organization — not just within a single team or function. Organization size is an important influence when engaging in broad communication. A 10,000 person org poses different challenges than a 1,000 or 100 person org.

As much as I might think that I am sufficiently communicating the details of a new program, it’s rarely enough. I find that I have to constantly over-communicate. As I over-communicate and leverage multiple channels, I may sound repetitive to anyone in close proximity to my message. This is the only way to reach the far edges of the organization that might not otherwise hear me.

The same communication challenges present themselves in the aftermath of an incident when a team discovers and applies corrective actions. These are often “local-only” fixes, interventions, and lessons that only benefit the part of the organization that experiences the incident. The global organization fails to learn this (sometimes costly) lesson.

Ron Westrum, a researcher in organizational behavior, notes in A typology of organisational cultures:

One of the most important features of a culture of conscious inquiry is that what is known in one part of the system is communicated to the rest. This communication, necessary for a global fix, aids learning from experience, very important in systems safety. The communication occurs because those in the system consider it their duty to inform the others of the potential danger or the potential improvement.

It’s not enough for a team to capture and address important technical fixes and lessons learned in their retrospective materials. Allspaw (2020) spent two years observing how software organizations engage with incidents and found that “hands-on practitioners do not typically capture the post-incident write-up for readers beyond their local team” and “do not read post-incident review write-ups from other teams.”

The organization doesn’t truly benefit until those lessons are scaled far and wide. For the write-ups to be useful, they have to teach the reader something new and help draw out the complexities of the incident.

Normalize new behaviors

Organizational change involves new modes and behaviors. Some of those modes and behaviors might be at odds with how things used to be done. Or they are just non-intuitive. This places a barrier on reaching a critical mass in these desired behaviors. A good place to get started is by modeling the changes yourself. Normalizing these modes and behaviors will help them spread to early adopters and then spawn a social movement. I’ve found there are four main areas to focus on to successfully promote a Learn & Adapt mode to safety.

1. Normalize stating your assumptions as much as possible

Assumptions are beliefs you hold that are sometimes so obvious or (seemingly) self-evident that stating them explicitly doesn’t seem necessary. It’s very likely that what you think is obvious might be surprising to others.

For example, you might believe that the fact that the MySQL primary can’t safely fail over to another datacenter automatically is so obvious as to not be worth explicitly stating often. In reality, your colleague might believe the exact opposite.

Stating your assumptions gives others an opportunity to recalibrate their model if there’s a mismatch or vice-versa. The conversations between a group of people recalibrating their models of the system are some of the most insightful conversations I’ve experienced. Great places to state your assumptions are in design review materials and in merge requests.

What do you assume will happen in the presence of 10% packet loss?

What about 50% packet loss?

Do you assume that system clocks are always monotonically increasing?

Do you assume that your consumer services will never encounter duplicate messages?

What do you assume might happen if they do encounter duplicates?

Stating these assumptions explicitly will elicit important conversations because participants in these reviews will come with their own assumptions about what you assumed about your design. There’s no impetus for participants to challenge your assumptions if they assume yours matches theirs.

2. Normalize asking a lot of questions

This is another approach that can help surface mismatched models of the system. Curiosity is an important cultural trait that nurtures Learn & Adapt. You might worry that asking questions betrays a surprising gap in your knowledge, but if everybody asks a lot of questions, it takes the sting out of asking them.

Asking questions can also help promote a more psychologically safe workplace. Discussing technical topics in front of an audience of peers can be stressful. Everybody has errors somewhere in their mental models and you’re bound to surface those through discussions. The way that those errors are revealed to you are reflected by the cultural norms of your organization. Telling a colleague, “Well, actually there are several problems with what you just said…” has a chilling effect on their willingness to state their assumptions in the future. Even if you’re certain that somebody is wrong, be curious instead of corrective.

Ask follow-up questions to reveal more of their mental model: “Did you notice any deprecation warnings at compile time?” Posing the mismatch as a question instead of a correction will lead to a more productive and psychologically safe exploration of the problem space. It also makes room for you, the corrector, to be incorrect, which also promotes an aspect of psychological safety.

3. Normalize increased cooperation between roles that traditionally don’t directly work together

A great example of this is product/engineering and client-facing roles like customer support or client success. Invite members of those teams to design reviews. Invite them to retrospective meetings or group learning reviews. Sometimes the client-facing support teams are the very first people in an organization to learn about a serious problem. The time between client-facing teams discovering the issue and the product teams learning about them is critical. The work needed to shorten that delay has to happen before the incident occurs, not during.

There was an incident in 2019 that was first detected by the client success team. During the interview phase of the incident analysis, I asked a product manager about how their team directly engages with the client success team. Their response was dismissive of the idea at first: “I don’t think that a sufficient solution for [this incident] should be relying on [customer] feedback to let us know of an issue. It’s too slow of a mechanism to help us identify a high impact issue.”

The corrective action for this incident was to add automated detection. While that corrective action will help detect a recurrence of the same impact, it misses an opportunity to work on better engagement and cooperation with the customer-facing teams. Incidents with impact that evade the existing detection in the future will take longer to resolve.

4. Normalize sharing incident analysis deliverables with everyone in the company

Sharing and discussing incident write-ups is arguably the most important aftermath activity. The STELLA report delivered by the first cycle of the SNAFUcatchers Workshop on coping with complexity highlights this value:

Postmortems can point out unrecognized dependencies, mismatches between capacity and demand, mis-calibrations about how components will work together, and the brittleness of technical and organizational processes. They can also lead to deeper insights into the technical, organizational, economic, and even political factors that promote those conditions.

Postmortems bring together and focus significant expertise on a specific problem for a short period. People attending them learn about the way that their systems work and don’t work. Postmortems do not, in and of themselves, make change happen; instead, they direct a group’s attention to areas of concern that they might not otherwise pay attention to.


Cultural traits

Moving from a prevent and fix safety mode to Learn & Adapt involves changing the very nature of how organizations get work done. If your organization is already relatively successful at delivering products to customers, then making changes to the organization can be risky or even ill advised. Change must be deliberate, incremental, and continuously monitored if it is to result in a net benefit.

While the idea of a “safety culture” is problematic, there exists a connection between an organization’s culture and its ability to successfully prepare for surprise, navigate complexity, and learn from incidents. Culture is, as defined by Westrum (2004), “…the organisation’s pattern of response to the problems and opportunities it encounters.” These patterns are informed by the shared set of behaviors, beliefs, and actions promoted and encouraged in an organization. A cultural norm might be obligating people to “own” their mistakes by expecting a set of follow-up behaviors in the aftermath of an incident.

In reflecting on the cultural norms within my own organization, I’ve identified some tradeoffs we’ve made that have helped cultivate and promote this shift toward Learn & Adapt.

Opportunity over obligation

How an organization handles accountability and responsibility is one aspect of the cultural norms. After a costly incident, a lot of attention is cast upon the parts of the system seen as broken or faulty. If there are considerable losses involved, a common reaction is to isolate a person or team to take responsibility and show accountability for the recovery and prevention activities.

People engage with a task differently when they feel it’s an obligation versus an opportunity. Opportunity is taken whereas obligation is assigned (whether explicitly or implicitly). It is leadership’s role to highlight opportunities by making them attractive, clearly defined, and actionable.

One way to make opportunities more attractive is to change the incentive structures. Ryn Daniels, a software infrastructure engineer, describes a leverage point for crafting a resilient culture:

While there is a lot that goes into psychological safety in the workplace, one way to design for a culture of learning and blamelessness is to look at the incentive structures within your organization.

Instead of expecting people to own their post-incident activities, strive to make the opportunity attractive enough for anyone to select. Ryn suggests a strategy:

If your skills matrices for promotions include things like community contributions, post-mortem facilitation, or incident writeups, that can also provide incentive for people to take part in learning-focused activities. The behaviors that get rewarded and promoted within your organization will have a great deal of impact on its culture.

Creating opportunities instead of assigning ownership not only helps ensure more thorough results, but fosters psychological safety.

Flexibility over rigidity

Placing rigid constraints on decision-making, new technologies, access, and what people are allowed to do in delivering their work can hinder opportunities for adaptation by undermining sources of resilience. These constraints accumulate over time as scar tissue from previous encounters with costly outages.

Rigid constraints can help an organization navigate legal risk, security risk, and financial risk, but they can limit flexibility. More flexibility can prove useful for adaptation because it gives people space to be curious and exercise their interests in other roles. How does the organization respond to a database administrator giving unsolicited help to the security team? What about a data scientist participating in a code review when it’s unrelated to their work or product? Being told to “stay in your lane” can be a manifestation of cultural norms that bias toward rigidity and could be a reflection of people’s insecurities, previous encounters with failure, or fear there is more work to do than available bandwidth.

Fostering this flexibility can pay immense dividends when expertise emerges during an incident in an unexpected way.

Agility over speed

One of the most important engineering priorities at Indeed is velocity, which is the shortening of the development cycle from idea to delivery. While speed is important in the delivery of software, speed isn’t sufficient to adapt to unanticipated challenges. “Turning the ship” is a common metaphor to highlight the challenges of quickly changing direction as a function of organization size and speed.

Agility is a trait that is useful in helping recognize when to change course and accept the sunk costs. In an incident, agility could mean recognizing and escaping cognitive fixation during diagnosis. After an incident, agility could result in the local participants sharing what they’ve learned so that the global organization can take heed and quickly recruit resources by pulling them from less important projects. Agility is a necessary (but not sufficient) aspect of promoting a Learn & Adapt approach to safety.

Trust over suspicion

Trust is fundamental to an organizational culture that learns and adapts. Trust colors our interpretations when we witness the actions of others when we don’t have the benefit of context. Trust means that we can assume that others are acting in good faith. Sometimes it can be easy to jump to anger or disgust with our colleagues when we are armed with hindsight in the aftermath of an incident. Trust means that we allow that they may have encountered substantial challenges. In a low-trust environment, fear, judgment, sanction, rigidity, and blame are common coping mechanisms.


Making the shift

In the course of introducing these new approaches in my own organization, I sometimes encounter pushback about how engaging in incident analysis distracts from getting “real” work done. I remind them that this is the real work. Engineering is knowledge work and requires continual learning.

Not only does engaging in incident analysis help people get better at their job as they learn more, but incident analysis is a form of knowledge creation. Ralph D. Stacey, an organizational theorist, helped me make the profound observation that simply filing away an incident report is not new knowledge:

From mainstream perspectives, knowledge is thought to be stored in individual heads, largely in tacit form, and it can only become the asset of an organization when it is extracted from those individual heads and stored in some artifact as explicit knowledge.

Incident write-ups do not become organizational knowledge until they are actually used:

Knowledge is the act of conversing and new knowledge is created when ways of talking, and therefore patterns of relationship, change. Knowledge, in this sense, cannot be stored.

Knowledge is created when a group of people meet to discuss a well-crafted incident write-up. Knowledge is created when it is communicated broadly and reinforced through normalized behaviors.

Incidents cannot be prevented, because incidents are the inevitable result of success. Organizations that have the cultural elements to foster a Learn & Adapt mode to safety will embrace the desirable aspects of incidents. Incidents can lead to bridging new connections, engaging with fresh perspectives, surfacing risks, and creating new training material.

If you’re part of an organization that considers incidents avoidable, detestable, or disruptive, it’s likely that you’ll need to change more than just the retrospective process. Start small, mirror the behaviors that cultivate Learn & Adapt, and be patient. Before long, a sapling will emerge.

About the author

Alex Elman has been helping Indeed cope with ever-increasing complexity and scale for the past nine years. He is a founding member of Indeed’s site reliability engineering team. Alex leads the resilience engineering team that focuses on learning from incidents, chaos engineering, and fault-tolerant design patterns.


References

  1. Allspaw, J. (2020). How learning is different than fixing. (Adaptive Capacity Labs blog.) (video). Accessed Oct. 20, 2020.
  2. Daniels, R. (2019). Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident. (InfoQ: DevOps.). Accessed Dec 28, 2020.
  3. Elman, A. (2019). Improving Incident Retrospectives at Indeed. (Learning from Incidents in Software blog.). Accessed Oct. 20, 2020.
  4. Elman, A. (2020). Are We Getting Better Yet? Progress Toward Safer Operations. In USENIX Association SREcon Conference, USA.
  5. Schemmel, M. (2019). Obligation vs Opportunity. (Internal Indeed Engineering blog.) Unavailable. Accessed Oct. 20, 2020.
  6. Spadaccini, A. (2016). Being On-Call. In Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA, USA: O’Reilly Media, Inc. First Edition, ch. 11, pp. 125-132.
  7. Stacey, R. D. (2001). Mainstream Thinking about Knowledge Creation in Organizations. In Complex Responsive Processes in Organizations: Learning and knowledge creation. London, England, UK: Routledge.
  8. Westrum, R. (2004). A typology of organisational cultures. In Quality & Safety in Health Care, 13, ii22-ii27. 10.1136/qshc.2003.009522
  9. Woods, D. D. (2002). Steering the Reverberations of Technology Change on Fields of Practice: Laws that Govern Cognitive Work. In Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society. pp.14-16 10.4324/9781315782379-10.
  10. Woods, D. D. (2017). STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity. Columbus, OH: The Ohio State University.