The Importance of Using a Composite Metric to Measure Performance

A still image depicting a page loading evenly over four seconds

In the past, Indeed has used a variety of metrics to evaluate our client-side performance, but we’ve tended to focus on one at a time. Traditionally, we chose a single performance metric and used it as the measuring stick for whether we were improving or degrading the user experience. 

This made it simple to track performance because we only needed to instrument and monitor a single datapoint. Technical and non-technical consumers could easily parse this information and understand how we were doing as an organization.

However, this type of thinking also brought about significant drawbacks that, in many cases, ended up resulting in overall degraded performance and wasted effort. This document examines those drawbacks, and suggests that using a “composite metric” enables us to much better measure what our users are experiencing. 

Past Performance Measurements

Below we look at a few metrics we’ve used to try and understand client-side performance, attempting to answer the following questions:

“When did the main JavaScript for the page execute?” —  JSV Delay

One of the earliest metrics widely used at Indeed was “JSV delay” (JavaScript Verification Delay) which measured the point at which JavaScript loaded, parsed, and began to execute. It was instrumented as a client-side network request which marked the time at which our main JavaScript began to execute. 

This metric was helpful in measuring whether we were degrading the experience by adding extra JS, or content before the JS bundle since that also resulted in slowdowns in JSV Delay. Over time, this measurement was widely adopted but suffered from significant issues:

  • Failure to capture performance impact of third party content (Google Analytics, Micro Frontends, etc)
  • Inability to measure what a user was actually experiencing even if JS loaded, the page wasn’t actually usable at the time, and the time to usability wasn’t being measured 
  • Bespoke implementation of the metric meant we were not uniformly measuring performance across our pages JSV delay meant something different from one page to another
  • No one really knew what the metric meant because it’s only a standard inside Indeed, we were continually explaining the metric, its advantages, and its downsides

“When did all critical CSS and JavaScript Load?” — domContentLoadEnd

After we decided JSV Delay was no longer serving our needs we decided to adopt a metric which was more broadly used in the software industry. domContentLoadEnd is defined as:

when the HTML document has been completely parsed, and all deferred scripts… have downloaded and executed. It doesn’t wait for other things like images, subframes, and async scripts to finish loading.

In layman’s terms, we can interpret domContentLoadEnd as a more generalized JSV Delay it fires only after critical HTML, CSS, and JavaScript have loaded. This gave us a much better idea of how the page as a whole was performing, and it was no longer a custom metric, which reduced confusion and ensured that we were uniformly measuring performance across all of our pages. However, this metric too came with significant issues:

  • domContentLoadEnd doesn’t capture async scripts, which means it misses out on significant portions of the page
  • Similar to JSV Delay, the fact that much of the code had loaded didn’t necessarily mean the page was interactive
  • For some pages, domContentLoadEnd could trigger for entirely blank pages (e.g., single page applications).

“When did users see the most important content on the page?” — largestContentfulPaint

Our last usage of “a single metric to explain performance” was largestContentfulPaint (LCP), which was a big step forward for us because it was our first adoption of a Google-recommended metric which was created to try and measure an ever-evolving web landscape.

This allowed us to, for the first time, use a metric that captured “perceived performance,” rather than a more arbitrary datapoint from a browser API. By using LCP, we were making a conscious choice to measure the actual user experience, which was a big step in the right direction. 

Because of Indeed’s usage of server-side rendering on high-traffic job search pages, where HTML is immediately visible to users on initial page load, LCP corresponded to the moment where users first saw job cards, the job description, and other critical content. The faster we show our user content, the more time we save them, the more delightful the experience. 

Again, however, this measurement came with significant issues:

  • LCP is not supported on iOS and other legacy browsers, which means we fail to capture this metric on a large percentage of our page loads, users, etc. 
  • Although users can see the critical content, it probably isn’t yet interactive.
  • LCP is a web-based metric, only collectible in web browsers, and thus excludes native applications. 

Differing Page Loads 

The lifecycle of a page is complex — from a technical perspective, a lot happens between the initial navigation to a page and when a user begins interacting with its content. The core problem with using a single metric to understand this complex workflow is that it removes much of the context which is necessary in understanding “how the user perceived the page load”. 

Let’s consider the following diagram:

Animated timeline showing a page loading evenly over four seconds

Here we see a standard page which takes 4 seconds to load. To start, the job seeker sees a blank page for 1 second; a second later they see a header and a loading indicator. 1 second later they see the main content of the page (LCP), and a second later the page is fully interactive. Now let’s take a look at the next diagram: 

Animated timeline showing a page loading four seconds, with the first three changes happening more quickly

Here we see the same page loading, but we see the main content of the page much quicker! But .. we wait 2.5 seconds for the page to become interactive. If we were using a single metric, say LCP, we would believe the second page is much faster. However, users would be experiencing a lot of frustration waiting for the page to become interactive. 

Finally, let’s look at this scenario: 

Animated timeline showing a page loading four seconds, with the last three changes happening quickly near the end of the four seconds

Here we see that the page is still taking 4 seconds to load but that users don’t see any content until the last second. It’s pretty intuitive that this is a poor experience, since much of the time we’re looking at a blank page, and we don’t even know if it’s working/loading at all. Again if we chose a single metric, we wouldn’t be capturing the actual perceived experience of the page load. What if we improved the time to seeing initial content to 2 seconds from 3.5, while total loading time stayed the same? The user would feel that the page is faster, but we wouldn’t be capturing that improvement. 

The Single Metric Problem

As we can see from the above, the lifecycle of a page can be highly variable, where small changes can have big impacts on how users perceive performance. When we look back on our historical performance measurements which utilized the “single metric approach”, we see two fundamental issues:

One metric can’t capture perceived performance

Holistic performance cannot be captured by a single metric — as depicted in the diagrams above, there is no single point in a page load which measures how quickly a user becomes engaged with content. 

There are thousands (or an infinite number?) of ways to build a web page, and each brings about their own trade offs when it comes to performance. 

For pages that don’t implement server-side rendering (SSR), if we chose to only measure firstContentfulPaint, we would be measuring a datapoint which has effectively no value (since this metric would capture when the first blank page was rendered). 

For single page applications, if we chose to measure only time to interactive (TTI), we would be ignoring how quickly users saw initial content, and how quickly they could begin to interact with the page. The reason is that although TTI is an important indicator, it fails to precisely capture when a page is truly interactive. 

Another problem with using a single metric is that our pages change over time, and as a result, so too changes how users perceive the performance of a page. Using the above examples, what if an application went from a server-side rendered approach, to a client-side rendered approach? If we stuck with the same performance measurement, say TTI, we would actually think we hurt performance but in reality we’re now showing content much sooner to the user, with the tradeoff of negligible impact to TTI. Overall the perceived page performance would be drastically improved, but we would fail to measure it. 

From a business and organizational perspective, that’s an observability gap which has profound implications in the ways we spend our time, and effort. 

Improving one metric often degrades another

The second, and perhaps more significant issue with using a single metric to measure speed is that it often results in degraded performance without us realizing it. 

The easiest way to improve performance is to ship fewer bytes, and render less content overall. In reality, that’s not always a decision we can make for the business. So as we begin to try to improve performance, we often end up in situations where we’re able to improve a single metric but it either has no bearing on holistic performance, or it actually hurts it! 

Let’s take a look at a new diagram (depicted below):

Animated timeline showing a page loading four seconds, with the page becoming progressively more useful over the four seconds

Here we see that our page begins loading normally and at the 2 second mark we have our main content, and the page is interactive. At this point our users can perform their primary goal with the page (let’s say apply for a job for example). At the 3 second mark more content pops in, and finally a second later, all content is visible on the page. This is a common loading pattern for async, or client-side rendered applications (e.g., single page apps). 

Ideally, what we’d like to do is shift each of these frames to the left, improving the perceived performance of each step. However, if we were only measuring time to interactive, which occurs in frame 4, we would completely disregard the most important part of the page load which is “how quickly can we make the main content of our page visible and interactive (frame 2). Similarly, if we only measured LCP (which occurs in frame 2), we would be disregarding TTI, which is where all of the content is finally visible. 

In this example, we can see that no single metric captures the true performance of the page, but rather it’s a collection of metrics which help us understand the true perceived performance. 

Perceived performance is very dependent on how quickly the page loads, but perhaps more important, how it loads. 

Using a Composite Metric: LightHouse Explained

Finally, this brings us to the use of a “composite metric” which is a term used in statistics that simply means “a single measurement based on multiple metrics”. With a LightHouse score we’re able to derive a single score based on 5 data points, each which represent a different aspect of a page load. 

These data points are:

A table showing the different metrics in the composite LightHouse score, and how they're weighted

For brevity, we won’t go into detail on each data point you can read more about these page markers here. At a high level, industry experts have agreed upon these 5 markers and weighted them according to how much they contribute to a user perceiving a page as fast and responsive. 

As is hopefully evident based on the explanations above, the purpose of using these 5 data points is to best capture the holistic perceived performance. We weight LCP, total blocking time (TBT), and cumulative layout shift the highest because we believe these are the most important indicators of speed. FCP and speedIndex are contributors but less significant overall. 

During each page load, we’re able to calculate all of these metrics and use an algorithm to determine a single score users who receive a score >= 90 are determined to be “fast and responsive”. Scores below 90 are in need of improvement.

Composite Metrics in Action

If we use the same page load diagram from above, we can imagine how using a composite metric allows us to fully capture performance for our users.

A still image depicting a page loading evenly over four seconds

Let’s run through a few scenarios: 

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), and did no harm to frames 3 and 4, we would see an improvement to our overall LightHouse score.

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), but degraded frames 3 and 4, we would see no improvement to our overall LightHouse score.

If we ended up with an improvement which improved FCP, but degraded frames 2, 3, and 4, we would see an overall degradation that we would have missed if we were monitoring only a single metric. 

Why Can’t We Simply Use “Time to Interactive” (TTI)? 

This is a common question within the performance realm so I wanted to address it here, and how it relates to composite metrics. 

First, what is TTI? The most common definition is as follows: 

TTI is a performance metric that measures a page’s load responsiveness and helps identify situations where a page looks interactive but actually isn’t. TTI measures the earliest time after First Contentful Paint (FCP) when the page is reliably ready for user interactivity.

This sounds great, so why not just use this? Isn’t the most important thing for performance when the page is interactive? 

Like all things in software, there’s nuance and tradeoffs. Let’s look at the pros and cons:

Pros:

  • A single metric which estimates how long the overall page took to become usable

Cons:

  • TTI is no longer recommended, and has been taken out of LightHouse calculations because it’s not believed to be an accurate metric across a wide variety of page load types (CSR, SSR, etc).
  • TTI is an estimation based on network activity, and DOM mutations, not an actual marker of page completion.
  • Because TTI is just a single metric, it suffers from “the single metric problem” which is explained above.

My point here isn’t that TTI is bad, but rather that it’s an incomplete way of looking at performance. TTI is a useful indicator, but it’s only meaningful if we look at it in context to our other metrics (FCP, LCP, etc). TTI’s main purpose is to provide a corroborating metric, rather than to explain performance overall. 

As an organization, we can imagine hundreds of ways to improve TTI without actually improving the most critical aspects of perceived performance. Additionally, we can imagine ways which improve TTI that actually hurt the earlier marks of a page load, which may result in degraded performance overall. 

Conclusions 

My hope for readers that have made it this far is that we now have a more nuanced understanding of how we can measure client-side performance. With the advent of the web we developed metrics which helped us figure out how fast static pages were loading — as the web advanced (thanks a lot jQuery!), so too have our measurements advanced.

Based on the past ~4 years of deep investment in performance improvements at Indeed, I believe these are my most important takeaways: 

  • Use a composite metric, but be willing to change the underlying internal metrics.
  • Be wary of the silver bullet — metrics or tools that purport to capture everything you need nearly always don’t. 
  • Technology changes, and we need to change how we measure performance as a result.
  • Corroborate your speed metrics with how your page loads and ensure it actually represents what users are experiencing. 

SHAP Plots: The Crystal Ball for UI Test Ideas

Photo by Sam on Unsplash

 

Have you ever wanted a crystal ball that would predict the best A/B test to boost your product’s growth, or identify which part of your UI drives a target metric?

With a statistical model and a SHAP decision plot, you can identify impactful A/B test ideas in bulk. The Indeed Interview team used this methodology to generate optimal A/B tests, leading to a 5-10% increase in key business metrics.

Case study: Increasing interview invites

Indeed Interview aims to make interviewing as seamless as possible for job seekers and employers. The Indeed Interview team has one goal: to increase the number of interviews happening on the platform. For this case study, we wanted UI test ideas that would help us boost the number of invitations sent by employers. To do this, we needed to analyze their behavior on the employer dashboard, and try to predict interview invitations.

Employer using Indeed Interview to virtually interview a candidate.

Convert UI elements into features

The first step of understanding employer behavior was to create a dataset. We needed to predict the probability of sending interview invitations based on an employer’s clicks in the dashboard.

We organized the dataset so each cell represented the number of times an employer clicked a specific UI element. We then used these features to predict our targeted action: clicking the Set up interview button vs. not clicking on the button.

Set up interview button on the employer dashboard

Train the model on the target variable

The next step was to train a model to make predictions based on the dataset. We selected a tree-based model, CatBoost, due to its overall superior performance and ability to detect interactions among features. And, just like any model, it works effectively with our interpretation tool – SHAP plot.

We could have used correlation or logistic regression coefficients, but we chose SHAP plot combined with a tree-based model because it provides unique advantages for model interpretation tasks. Two features with similar correlation coefficients could have dramatically different interpretations in SHAP plot, which factors in feature importance. In addition, a tree-based model usually has better performance than logistic regression, leading to a more accurate model. Using SHAP plot combined with a tree-based model provides both performance and interpretability.

Interpret SHAP results into positive and negative predictors

Now that we have a dataset and trained model, we can interpret the SHAP plot generated from it. SHAP works by showing how much a certain feature can change the prediction value. In the SHAP plot below, each row is a feature, and the features are ranked based on descending importance: the ones at the top are the most important and have the highest influence (positive or negative) on our targeted action of clicking Set up interview.

The data for each feature is displayed with colors representing the scale of the feature. A red dot on the plot means the employer clicked a given UI element many times, and a blue dot means the employer clicked it only a few times. Each dot also has a SHAP value on the X axis, which signifies the type of influence, positive or negative, that the feature has on the target and the strength of its impact. The farther a dot is from the center, the stronger the influence.

SHAP plot displaying features A-O ranked by descending influence on the model (regardless of positive or negative). Each feature has red and blue dots (feature value) organized by SHAP value (impact on model output). Features outlined in red: A, B, D, F, H, I, K, L, and N. Features outlined in blue: E, G, M, and O.

SHAP plot with features outlined in red for positive predictors, and blue for negative predictors

Based on the color and location of the dots, we categorized the features as positive or negative predictors.

  • Positive Predictor – A feature where red dots are to the right of the center.
    • They have positive SHAP value: usage of this feature predicts the employer will send an interview invitation.
    • In the SHAP plot above, Feature B is a good example.
  • Negative Predictor – A feature where red dots are to the left of the center.
    • They have negative SHAP value: usage of this feature predicts the employer will not send an interview invitation.
    • Feature G is a good example of this.

Red dots on both sides of the center are more complex and need further investigation, using tools such as dependency plots (also in SHAP package).

Note that this relationship between feature and target is not causal yet. A model can only claim causality when it assumes all confounding variables have been included, which is a strong assumption. While the relationships could be causal, we don’t know for certain until they are verified in A/B tests.

Generate test ideas

Our SHAP plot contains 9 positive predictors and 4 negative predictors, and each one is a potential A/B test hypothesis of the relationship between the UI element and the target. We hypothesize that positive predictors boost target usage, and negative predictors hinder target usage.

To verify these hypotheses, we can test ways to make positive predictors more prominent, and direct the employer’s attention to them. After the employer clicks on the feature, we can direct attention to the target, in order to boost its usage. Another option is to test ways to divert the employer’s attention away from negative predictors. We can add good friction, making them less easy to access and see if usage of the target increases.

Boost positive predictors

We tested changes to the positive predictors from our SHAP plot to make them more prominent in our UI. We made Feature B more prominent on the dashboard, and directed the employer’s attention to it. After the employer clicked Feature B, we showed a redesigned UI with improved visuals to make the Set up interview button more attractive.

The results were a 6% increase in clicking to set up an interview.

Divert away from negative predictors

We also tested changes to the negative predictors from our SHAP plot in the hopes of increasing usage of the target. We ran a test to divert employer attention away from Feature G by placing it close to the Set up interview button on the dashboard. This way it was easier for the employer to choose setting up an interview instead.

This change boosted clicks to send interview invitations by 5%.

Gaze into your own crystal ball

A SHAP plot may not be an actual crystal ball. When used with a statistical model, however, it can generate UI A/B test ideas in bulk and boost target metrics for many products. You might find it especially suitable for products with a complex and nonlinear UI, such as user dashboards. The methodology also provides a glimpse of which UI elements drive the target metrics the most, allowing you to focus on testing features that have the most impact. So, what are you waiting for? Start using this method and good fortune will follow.

 

Cross-posted on Medium

Indeed SRE: An Inside Look

Photo by Kevin Ku on Unsplash

Indeed adds over 30 million jobs online every month, which helps connect 250 million job seekers to prospective employers. How do we keep our services available, fast, and scalable? That’s the ongoing challenge for our site reliability engineering (SRE) team.

What is SRE?

The idea behind SRE is simple: The team ensures that a company’s core infrastructure works effectively. SRE originated in 2003 when Google formed a small production engineering team to address reliability issues. Its initial focus was on-call, monitoring, release pipeline, and other operations work. The team established service-level indicators and objectives (SLIs and SLOs) to improve infrastructure across the company. Other companies took note, and SRE soon became an industry standard.

SRE is distinct from other engineering roles. Team members work across business areas to ensure that services built by software engineering (SWE) teams remain scalable, performant, and resilient. Working with platform teams, SRE helps manage and monitor infrastructure like Kubernetes. SRE teams build frameworks to automate processes for operations teams. They might also develop applications to handle DNS, load balancing, and service connections for network engineering teams.

These functions are crucial for any company competing in today’s tech world. However, because of the vast range of technologies and methods available, each SRE team takes a different approach.

SRE at Indeed

At Indeed, we established an SRE team in 2017 to increase attention on reliability goals and optimize value delivery for product development teams. Our SRE team uses an embedded model, where each team member works with a specific organization. They code custom solutions to automate critical processes and reduce toil for engineers.

Indeed SRE focuses on these key goals:

Promote reliability best practices. SRE helps product teams adopt and iterate on metrics, such as SLOs, SLIs, and error budget policies. They promote an Infrastructure as Code (IaC) model. That means they write code to automate management of data centers, SLOs, and other assets. They also drive important initiatives to improve reliability and velocity, like Indeed’s effort to migrate products to AWS.

Drive the creation of reliability roadmaps. At Indeed, the SRE team spends more than 50% of their time on strategic work for roadmaps. They analyze infrastructure to define how and when to adopt new practices, re-architect systems, switch to new technologies, or build new tools. Once product teams approve these proposals, SRE helps design and implement the necessary code changes.

Strive for operational excellence. SRE works with product teams to identify operational challenges and build more efficient tools. They also guide the process of responding to and learning from critical incidents, adding depth to individual team retrospectives. Their expertise in incident analysis helps them identify patterns and speed up improvements across the company.

Who works in Indeed SRE?

Our SRE team is diverse and global. We asked a few team members to talk about how they arrived at Indeed SRE.

Ted, Staff SRE

I love programming. Coming from a computer science background, I started my career as a software engineer. As I progressed in my role, I became interested in certain infrastructure related challenges. How can we move a system to the cloud and maximally reduce the costs? How do we scale a legacy service to several machines? What metrics should we collect—and how frequently—to tell if a service works as intended?

Later, I discovered that these questions are at the intersection of SWE and SRE. Without realizing it, I had implemented SRE methodology in every company I’d worked for! I decided to apply at Indeed, a company with an established SRE culture where I could learn—not only teach.

Working for Indeed SRE gives me more freedom to select my focus than working as a SWE. I can pick from a range of tasks: managing major outages, building internal tools, improving reliability and scalability, cleaning up deprecated infrastructure, migrating systems to new platforms. My work also has a broad impact. I can improve scalability for 20+ repositories in different programming languages in one go. Or I can migrate them to a new environment in a week. SRE has given me deeper knowledge of how services from container orchestration tools to front end applications are physically managed, which makes me a better engineer.

Jessica, Senior SRE

Before joining Indeed SRE, I tried many roles, from QA to full-stack web developer to back-end engineer. Over time, I realized that I liked being able to fix issues that I identify. I wanted to communicate and empathize with the customer instead of being part of a feature factory. Those interests led me to explore work in operations, infrastructure, and reliability. That’s when I decided on SRE.

Now I support a team that works on a set of role-based authentication control (RBAC) services for our clients. All our employer-facing services use this RBAC solution to determine whether a particular user is authorized to perform an action. Disruptions can lead to delays in our clients’ hiring processes, so we have to make sure they get fast, consistent responses.

The best thing about being on the SRE team is working with a lot of very talented engineers. Together, we solve hard problems that software engineers aren’t often exposed to. The information transfer is amazing, and I get to help.

Xiaoyun, Senior SRE Manager

When I joined Indeed in 2015, I was a SWE and then a SWE manager. At first I worked on product features, but gradually my passion shifted to engineering work. I started improving the performance of services, e.g., making cron jobs run in minutes instead of hours. This led me to explore tools for streaming process logs and database technology for improving query latency.

Then I learned about SRE opportunities at Indeed that focused on those subjects. I was attracted to the breadth and depth offered by SRE. Since joining, I have worked with a range of technologies, services, and infrastructure across Indeed. At the same time, I’ve had the opportunity to dive deep into technologies like Kafka and Hadoop. My team has diagnosed and solved issues in several complex AWS managed services.

Indeed also encourages SRE to write reliability focused code. This makes my background useful—I enjoy using my SWE skills to solve these kinds of challenges.

Yusuke, Staff SRE

I joined Indeed in 2018 as a new university graduate. In school, I studied computer science and did a lot of coding. I learned different technologies from infrastructure to web front-end and mobile apps. Eventually I decided to start my career in SRE, which I felt utilized my broad skill set better than a SWE role would.

I started on a back-end team that builds the platform to enable job search at Indeed. To begin, we defined SLIs and SLOs, set monitors for them, and established a regular process to plan capacity. Soon we were re-architecting the job processing system for better reliability and performance. We improved the deployment process with more resilient tooling. I helped adopt cloud native technologies and migrate applications to the cloud. To track and share investigation notes, we also started building an internal knowledge base tool.

I enjoy Indeed SRE because I can flex different skills. With the nature and the scale of the system we’re supporting, I get to share my expertise in coding, technologies, and infrastructure. SRE members with different backgrounds are always helping each other to solve problems.

Building your SRE career

Develop a broad skill set

SRE works with a variety of systems, so it’s important to diversify your technical skills. Besides SWE skills, you’ll need an understanding of the underlying infrastructure. A passion for learning and explaining new technologies is helpful when making broader policy and tool recommendations.

Focus on the wider organization

SRE takes a holistic view of reliability practices and core systems. When working with shared infrastructure, your decisions can affect systems across the company. To prioritize changes, you need to understand how others are using those systems and why. Working across different teams is a positive way to achieve personal and professional growth, and it advances your SRE journey.

Join us at Indeed

If you’re a software engineer, pivoting to SRE gives you exposure to the full stack of technologies that enable a service to run. If you’re currently doing operational work (in SRE or elsewhere), Indeed’s broad approach can add variety to your workload. Each team we work with has its own set of reliability challenges. You’ll be able to pick projects that interest you.

Indeed SRE also provides opportunities to grow. Our SRE culture is well established and always expanding. You’ll work with SWE and other roles, learning from each other along the way.

If you’re interested in challenging work that expands your horizons, browse our open positions today.