Indeed SRE: An Inside Look

Photo by Kevin Ku on Unsplash

Indeed adds over 30 million jobs online every month, which helps connect 250 million job seekers to prospective employers. How do we keep our services available, fast, and scalable? That’s the ongoing challenge for our site reliability engineering (SRE) team.

What is SRE?

The idea behind SRE is simple: The team ensures that a company’s core infrastructure works effectively. SRE originated in 2003 when Google formed a small production engineering team to address reliability issues. Its initial focus was on-call, monitoring, release pipeline, and other operations work. The team established service-level indicators and objectives (SLIs and SLOs) to improve infrastructure across the company. Other companies took note, and SRE soon became an industry standard.

SRE is distinct from other engineering roles. Team members work across business areas to ensure that services built by software engineering (SWE) teams remain scalable, performant, and resilient. Working with platform teams, SRE helps manage and monitor infrastructure like Kubernetes. SRE teams build frameworks to automate processes for operations teams. They might also develop applications to handle DNS, load balancing, and service connections for network engineering teams.

These functions are crucial for any company competing in today’s tech world. However, because of the vast range of technologies and methods available, each SRE team takes a different approach.

SRE at Indeed

At Indeed, we established an SRE team in 2017 to increase attention on reliability goals and optimize value delivery for product development teams. Our SRE team uses an embedded model, where each team member works with a specific organization. They code custom solutions to automate critical processes and reduce toil for engineers.

Indeed SRE focuses on these key goals:

Promote reliability best practices. SRE helps product teams adopt and iterate on metrics, such as SLOs, SLIs, and error budget policies. They promote an Infrastructure as Code (IaC) model. That means they write code to automate management of data centers, SLOs, and other assets. They also drive important initiatives to improve reliability and velocity, like Indeed’s effort to migrate products to AWS.

Drive the creation of reliability roadmaps. At Indeed, the SRE team spends more than 50% of their time on strategic work for roadmaps. They analyze infrastructure to define how and when to adopt new practices, re-architect systems, switch to new technologies, or build new tools. Once product teams approve these proposals, SRE helps design and implement the necessary code changes.

Strive for operational excellence. SRE works with product teams to identify operational challenges and build more efficient tools. They also guide the process of responding to and learning from critical incidents, adding depth to individual team retrospectives. Their expertise in incident analysis helps them identify patterns and speed up improvements across the company.

Who works in Indeed SRE?

Our SRE team is diverse and global. We asked a few team members to talk about how they arrived at Indeed SRE.

Ted, Staff SRE

I love programming. Coming from a computer science background, I started my career as a software engineer. As I progressed in my role, I became interested in certain infrastructure related challenges. How can we move a system to the cloud and maximally reduce the costs? How do we scale a legacy service to several machines? What metrics should we collect—and how frequently—to tell if a service works as intended?

Later, I discovered that these questions are at the intersection of SWE and SRE. Without realizing it, I had implemented SRE methodology in every company I’d worked for! I decided to apply at Indeed, a company with an established SRE culture where I could learn—not only teach.

Working for Indeed SRE gives me more freedom to select my focus than working as a SWE. I can pick from a range of tasks: managing major outages, building internal tools, improving reliability and scalability, cleaning up deprecated infrastructure, migrating systems to new platforms. My work also has a broad impact. I can improve scalability for 20+ repositories in different programming languages in one go. Or I can migrate them to a new environment in a week. SRE has given me deeper knowledge of how services from container orchestration tools to front end applications are physically managed, which makes me a better engineer.

Jessica, Senior SRE

Before joining Indeed SRE, I tried many roles, from QA to full-stack web developer to back-end engineer. Over time, I realized that I liked being able to fix issues that I identify. I wanted to communicate and empathize with the customer instead of being part of a feature factory. Those interests led me to explore work in operations, infrastructure, and reliability. That’s when I decided on SRE.

Now I support a team that works on a set of role-based authentication control (RBAC) services for our clients. All our employer-facing services use this RBAC solution to determine whether a particular user is authorized to perform an action. Disruptions can lead to delays in our clients’ hiring processes, so we have to make sure they get fast, consistent responses.

The best thing about being on the SRE team is working with a lot of very talented engineers. Together, we solve hard problems that software engineers aren’t often exposed to. The information transfer is amazing, and I get to help.

Xiaoyun, Senior SRE Manager

When I joined Indeed in 2015, I was a SWE and then a SWE manager. At first I worked on product features, but gradually my passion shifted to engineering work. I started improving the performance of services, e.g., making cron jobs run in minutes instead of hours. This led me to explore tools for streaming process logs and database technology for improving query latency.

Then I learned about SRE opportunities at Indeed that focused on those subjects. I was attracted to the breadth and depth offered by SRE. Since joining, I have worked with a range of technologies, services, and infrastructure across Indeed. At the same time, I’ve had the opportunity to dive deep into technologies like Kafka and Hadoop. My team has diagnosed and solved issues in several complex AWS managed services.

Indeed also encourages SRE to write reliability focused code. This makes my background useful—I enjoy using my SWE skills to solve these kinds of challenges.

Yusuke, Staff SRE

I joined Indeed in 2018 as a new university graduate. In school, I studied computer science and did a lot of coding. I learned different technologies from infrastructure to web front-end and mobile apps. Eventually I decided to start my career in SRE, which I felt utilized my broad skill set better than a SWE role would.

I started on a back-end team that builds the platform to enable job search at Indeed. To begin, we defined SLIs and SLOs, set monitors for them, and established a regular process to plan capacity. Soon we were re-architecting the job processing system for better reliability and performance. We improved the deployment process with more resilient tooling. I helped adopt cloud native technologies and migrate applications to the cloud. To track and share investigation notes, we also started building an internal knowledge base tool.

I enjoy Indeed SRE because I can flex different skills. With the nature and the scale of the system we’re supporting, I get to share my expertise in coding, technologies, and infrastructure. SRE members with different backgrounds are always helping each other to solve problems.

Building your SRE career

Develop a broad skill set

SRE works with a variety of systems, so it’s important to diversify your technical skills. Besides SWE skills, you’ll need an understanding of the underlying infrastructure. A passion for learning and explaining new technologies is helpful when making broader policy and tool recommendations.

Focus on the wider organization

SRE takes a holistic view of reliability practices and core systems. When working with shared infrastructure, your decisions can affect systems across the company. To prioritize changes, you need to understand how others are using those systems and why. Working across different teams is a positive way to achieve personal and professional growth, and it advances your SRE journey.

Join us at Indeed

If you’re a software engineer, pivoting to SRE gives you exposure to the full stack of technologies that enable a service to run. If you’re currently doing operational work (in SRE or elsewhere), Indeed’s broad approach can add variety to your workload. Each team we work with has its own set of reliability challenges. You’ll be able to pick projects that interest you.

Indeed SRE also provides opportunities to grow. Our SRE culture is well established and always expanding. You’ll work with SWE and other roles, learning from each other along the way.

If you’re interested in challenging work that expands your horizons, browse our open positions today.

Speed Matters, But It Isn’t Everything

Photo by Jonathan Chng on Unsplash

Over the last few years at Indeed, we noticed our public-facing web applications were loading more slowly. We tested numerous ways to improve performance. Some were very successful, others were not.

We improved loading speeds by 40% but we also learned that speed is not always the most important factor for user experience.

Performance metrics

We measured loading speed using two key metrics:

We chose a weighted average instead of a single metric. This provided a more accurate measure of perceived load time, and helped us answer two critical questions:

  • How long did the user wait before the page seemed responsive?
  • How long did the user wait before they could interact with the page?

Though these metrics came with tradeoffs, we decided to use them instead of Google Web Vitals because they gave the broadest coverage across our user base. After deciding on these metrics, we had simple, observable, and reportable data from hundreds of applications and across a variety of web browsers.

Successful methods for improving speed

While we tried many strategies, the following efforts provided the biggest increases in performance.

Flushing <Head/> early

Browsers generally use the most resources during page load when they are downloading and parsing static resources such as JS, CSS, and HTML files. To reduce this cost, we can send static content early, so the browser can begin to download and parse files even before those files are required. This eliminates much of the render-blocking time these resources introduce.

By flushing the HTML head early on multiple applications, we saw load time improvements of 5-10%.

This implementation comes with a few trade-offs, however, since flushing the HTML document in multiple chunks can result in confusing error modes. Once we’ve flushed the first part of the response, we’re no longer able to change parts of the response, such as status code or cookies. Even if an error occurs somewhere before the last part of the response, we can’t change these headers. We’ve implemented some common libraries that help with these complications.

Reducing files on the critical path

Apart from the total number of bytes, one of the most important aspects for page load time is the number of total resources – especially render-blocking resources – required on the critical path for rendering. In general, the more blocking files you request, the slower the page. For example, a 100kB page served with 5 files will be significantly faster than a 100kB page served with 10 files.

In an A/B test, we reduced the number of render-blocking files from 30 to 12, a 60% reduction. The total amount of bytes shipped during page load was roughly identical. This test provided a 2+ second improvement for domContentLoadedEventEnd at the 95th percentile for our desktop and mobile search pages, as well as significant improvements in largestContentfulPaint.

To dive into this further, we explored the cost of a single extra CSS file. We ran a test on one of our highest trafficked pages to reduce the number of CSS files by 1. Page load times improved by a statistically significant amount, about 15ms at the 95th percentile.

Improving the runtime cost of CSS-in-JS

As more of our applications started using our newest component library, built on top of the Emotion library, we noticed 40% slower page loads.

The Emotion library supports CSS-in-JS, a growing industry trend. We determined that rendering CSS-in-JS components added extra bytes to our JavaScript bundles. The runtime cost of this new rendering strategy – along with the added bytes – caused this slowdown. We built a webpack plugin that precompiled many of our most commonly used components, reducing their render costs and helped address the problem.

This strategy resulted in a massive improvement, decreasing the slowdown from 40% to about 5% in aggregate, at the 95 percentiles. However, the CSS-in-JS approach still incurred more runtime cost than more traditional rendering approaches.

Factors outside our control

Along with testing improvements, we analyzed the types of users, locales, and devices that had an impact on page speeds.

Device type and operating system

For Android devices, which are generally lower powered than their iOS counterparts, we saw 63% slower timings for firstContentfulPaint, and 107% slower timings for domContentLoadedEventEnd.

Windows users saw 26% slower timings for domContentLoadedEventEnd compared to their iOS counterparts. These results were somewhat expected, since Windows devices tend to be older.

This data provided important takeaways:

  • The performance impact of features and additional code is non-linear: newer, robust devices can incur 100kB more code without an impact to performance, while older devices see a much bigger slowdown as a result.
  • Testing applications using real user metrics (RUM) is critical to understanding performance, since performance varies so widely based on device and the operating system’s capabilities.

Connection type and network latency

We used the Network Information API to collect information about various connection types. The API is not supported in all browsers, making this data incomplete, however, it did allow us to make notable observations:

  • 4G connection types were 4 times faster than 3G, 10 times faster than 2G, and 20 times faster than connections that were less than 2G. Put another way, network latency accounts for a huge percent of our total latency.
  • For browsers that report connection type information, 4G connection types make up 95% of total traffic. Including all browser types drops this number closer to 50%.

Networks vary greatly by country, and for some countries it takes over 20 seconds to load a page. By excluding expensive features such as big images or videos in certain regions, we deliver simpler, snappier experiences on slower networks.

This is by far the simplest way to improve performance, but it comes at the cost of complexity.

Results of speed and other factors

The impact of performance on the web varies. Companies such as Amazon have reported that slowdowns of just 1 second could result in $1.6 billion in lost sales. However, other case studies have reported a more muddled understanding of the impact of performance.

Over the course of our testing, we saw some increases in engagement based on performance improvements. But we’re not so sure they’re strongly correlated to performance improvements alone.

Reliability vs speed

Our current understanding of these increases in engagement is that they are based on increased reliability rather than an improvement in loading speed.

In tests where we moved our static assets to a content delivery network (CDN), we saw engagement improvements, but we also saw indications of greater reliability and availability. In tests that improved performance but not reliability, we did not see strong improvements in engagement.

The impact of single, big improvements

In tests where we improved performance by a second or more (without improving reliability), we saw no significant changes in our Key Performance Indicators.

Our data suggests that for non-commerce applications, small to medium changes in performance do not meaningfully improve engagement.

Engagement vs performance

Our observations reminded us not to equate performance with engagement when analyzing our metrics. One stark example of this point was the different performance metrics observed for mobile iOS users versus mobile Android users.

While Android users had nearly 2 times slower rendering, there was no observable drop in engagement when compared to iOS users.

So when does speed matter?

After a year of testing strategies to improve speed, we found some that are worth the effort to improve performance. While these improvements were measurable, they were not significant enough to drive changes to key performance indicators.

The bigger lesson is that while a certain level of speed is required, other factors matter too. The user’s device and connection play a large role in the overall experience. The silver lining is that knowing we cannot fully control all these factors, we can be open to architectural strategies not specifically designed for speed. Making minor trade-offs in speed for improvements in other areas can result in an overall better user experience.

 

Cross-posted on Medium

Obligation and Opportunity

A good friend of mine who’s been in engineering leadership at a handful of early-stage companies recently had something interesting to say about core values:

I’m never putting ‘Accountability’ as a core value again. I’ve tried it three different ways at three very different companies and it always ends up the same. ‘Accountability’ just ends up being something everybody wishes everybody else would take more of. It’s a stick to beat people with, instead of a core value to practice.

That echoes something I noticed as Indeed grew rapidly through the 2010s. As the company grew larger and more complex, it became harder and harder to improve shared capabilities that fall outside any given team’s scope. Over the last couple of years, I’ve occasionally heard some variation of one of the following:

  • Whose responsibility is (thing X)?
  • We should make (aspect Y) somebody’s responsibility.
  • Why doesn’t leadership make (task Z) somebody’s job?

The thing is: responsibility can’t be given, it can only be taken.

Image by Tumisu from Pixabay

I’ve had the pleasure of working with hundreds of colleagues over the last decade. Every one of them is a highly qualified professional who would thrive with many different teams inside Indeed and many different organizations outside. If one of their managers insisted on assigning them tasks that were neither interesting nor transparently impactful, it wouldn’t be very long until that individual quite rightly started asking after what other positions might be available.

Indeed’s engineering leadership has emphasized the coaching model of leadership over command-and-control management ever since you could count the engineering managers on one hand. In this model, a coach’s job isn’t to assign tasks or obligations. Coaches work with people to identify opportunities, help them choose between opportunities, and then help them realize those opportunities.

One of my favorite examples of seeing opportunity versus obligation play out in practice is ownership of the retrospective after a production outage. Indeed has long championed the habit of blameless retrospectives: focusing attention on understanding contributing factors and preventing recurrence, rather than fault-finding.

Nevertheless, I’ve heard a hundred times in the heat of the moment: “that team broke things, they should own it.” From my point of view, this is a little wide of the point. Driving a retrospective is an opportunity, not an obligation. You grab the baton on a retrospective when you happen to be well-positioned to prevent its recurrence independently of whether or not you were anywhere near the triggering condition.

As for individuals, so for teams

We do ask teams to take on specific responsibilities… but we explicitly list out probably fewer than you imagine. When a team has a service running in production, they take on responsibility for making sure that service stays healthy, responsive, and compliant with company policies. We don’t mandate that teams respond to feature requests within a certain timeframe, that they support specific features, or that they use specific technologies.

Instead, we ask them to look for opportunities. Where will supporting new users help other teams onboard to the solution they’re building? Which features will help them accomplish their mission? Where can they find discontinuous advantage by adopting a different underlying technology?

As the engineering lead for a group of platform teams, I get a lot of chances to think about obligation versus opportunity. For example, we provide a modular browser-based UI platform. The bulk of code written against that platform is not written by the team itself. It is written by product teams creating product-specific modules. The platform team members clearly aren’t obligated to monitor the browser-side errors emitted by those modules, and it would be wholly unscalable to try and make them responsible. But at least for now, they can and they do. The opportunity to help product teams that are less familiar with deploying and maintaining modules is just too good to pass up. It won’t scale forever but, while it does, it significantly eases adoption by new teams and helps the platform team see where their users run into trouble.

Our communications platform team helps product teams message job seekers and employers over various channels. Through the years, the team has worked through just about every flavor of this when partnering with core infrastructure teams:

  • Years ago, when postfix performance was a dramatic bottleneck, the core infrastructure team took the feedback, fixed the performance problem, and has maintained it ever since. Responsibility taken.
  • When various issues affected the durability guarantees our message queues could offer, the core infrastructure team didn’t have a clear path to be able to provide the hard guarantees we needed. We worked around the problem by detecting and re-sending messages after any end-to-end delivery failure. Responsibility declined.
  • When we needed to move away from a proprietary key-value store that had been deprecated, an infrastructure team working with OpenStack was very interested in building out a Ceph-based solution. We worked closely with them to prototype the solution, but it became clear that timeline pressure would not allow the solution to provide sufficient performance guarantees soon enough. We fell back on using S3, with the option to cost-optimize in Ceph later. Responsibility desired, but not feasible.

These examples spotlight some really important themes. Responsibility cannot be assigned based on the logic of team names alone. It can only be taken based on a team’s desire and ability to fulfill it. A team named “Storage Systems” is not obligated to support OracleDB simply because they’re the Storage Systems team. If their roadmap takes them in a different direction that meets the needs of their clients and stakeholders, it’s their decision.

Similarly, desire alone is not sufficient. When a much smaller Indeed first experimented with Cassandra, the experiment didn’t fail because of an inherent flaw in the technology. It withered because we didn’t have the in-house expertise and capacity to successfully operate a large-scale cluster through all the vagaries that occur in production. We wanted it to work and teams were happy to try and figure it out… it just ended up not being feasible.

Getting your opportunities noticed

So what does that mean for Thing X, Aspect Y, Task Z, and all of the other wish-list items that people come across in the course of a normal workday? If managers can’t just make those somebody’s job, then how on earth do we make progress on the opportunities that no one’s yet taken?

Two basic prerequisites make the opportunity-driven model effective: one mechanical, one cultural. Unsurprisingly, the mechanical aspect is easier.

The coach’s responsibilities that I listed earlier are identifying opportunities, selecting opportunities, and realizing opportunities.

Product-driven delivery organizations like Indeed already spend a lot of effort continuously improving their ability to deliver software to production. I won’t spend a lot of time on realizing opportunities here.

Identifying opportunities is also a core skill for product delivery teams. Where we needed to invest significant effort was in identifying them effectively. Primarily, that means making sure that good ideas end up on the radar of the people who are able to act on them.

An audience-friendly intake process is a crucial component for teams serving internal customers. Audience friendliness involves several critical aspects.

  • It must be lightweight: incomplete ideas can be fleshed out later; lost ideas are gone for good.
  • It must be responsive, since nothing demotivates a colleague so much as finding their suggestions lost in a black hole.
  • Finally, it must operate at a high enough level in the organization. Individual delivery teams typically have narrow, carefully defined scopes that let them focus. That’s smart for delivery efficiency, but people outside the team can’t reasonably be expected to understand fine-grained subdivisions.

An effective intake process requires something of requesters as well. Making sure the rationale and assumptions behind a request are crystal clear—even when they seem obvious to you—makes it far easier for future engineers to notice and get psyched about the opportunity you’re presenting. Understanding and communicating a value proposition is good practice for any up-and-coming engineer and greatly increases the odds of somebody selecting your opportunity.

A culture of ownership

Of course, relying on others to pick up and run with opportunities requires a lot of trust in your colleagues. You trust that your priorities are generally aligned, so that your rationale will be compelling. You also trust that most everyone is generally hungry for good opportunities and will look for ways to make them happen.

Another way of framing that is that an opportunity-driven model can only work in a high ownership culture. At Indeed, we don’t tend to frame things in terms of obligations and accountability, because we’ve worked hard to develop a culture in which individuals and teams hold themselves accountable. Once a team or an individual has chosen to adopt a responsibility, they will see it through.

My long-time colleague, Patrick Schneider, illustrates the idea of high-ownership nicely. He looked at the daily question of “How should I spend my time?” through the lens of a RACI breakdown for an individual displaying various degrees of ownership. RACI stands for responsible, accountable, consulted, and informed.

How should I spend my time?

Patrick Schneider | May 16, 2019

Level of Ownership Responsible Accountable Consulted Informed
High

Me

I decide how to spend my time.

Me

I am able to describe what actions I have taken, which tasks I have completed, and provide justification for each.

OKRs, my product manager, my team, other teams, …

I consult whoever is necessary until I’m confident that I’m spending my time well.

My team, my product manager, Jira, Slack, etc.

I regularly and proactively let people know what I am spending my time on.

Medium-High Me

I choose from curated options how to spend my time.

Me

I am able to describe what actions I have taken and which tasks I have completed.

Me

I have choices or recommendations from my manager, product manager, or others after they have consulted whoever they believe is appropriate.

My team, my product manager, Jira, Slack, etc.

People usually know what I am working on.

Medium-Low

My manager
decides how to spend my time.

My product manager
decides how to spend my time.

My manager, my product manager, or automation

My manager, product manager, and/or automated systems describe the things I have completed and actions I have taken.

Me

I am provided choices or recommendations by my manager or product manager, after they have consulted whoever they believe is appropriate.

My manager or product manager

I inform my manager and product manager about what I am working on; they may inform whoever else they believe is appropriate. Jira is usually up-to-date.

Low

My manager
decides how to spend my time.

My product manager
decides how to spend my time.

Non-humans, e.g., the next email in my Inbox
decide how to spend my time.

Unknown or opaque

Many things are in-progress or being worked on; work is described in the continuous tense, often with “-ing” verbs. The state of completion is rarely reached or described.

Unknown or opaque

My manager or product manager consults whoever they believe is appropriate. Randomness or algorithms.

Unknown or opaque

My manager or product manager informs whoever they believe is appropriate. Others may or may not find out about my work.

 

Putting it all together

Accountability is a critical attribute of high-performance teams, but it isn’t well-served by simply being named a core value. Instead, you need to instill a culture of high individual ownership, establish processes that spotlight opportunities, and empower your teams to chase the opportunities most meaningful to their mission.

Cross-posted on Medium.