The Importance of Using a Composite Metric to Measure Performance

In the past, Indeed has used a variety of metrics to evaluate our client-side performance, but we’ve tended to focus on one at a time. Traditionally, we chose a single performance metric and used it as the measuring stick for whether we were improving or degrading the user experience.

This made it simple to track performance because we only needed to instrument and monitor a single datapoint. Technical and non-technical consumers could easily parse this information and understand how we were doing as an organization.

However, this type of thinking also brought about significant drawbacks that, in many cases, ended up resulting in overall degraded performance and wasted effort. This document examines those drawbacks, and suggests that using a “composite metric” enables us to much better measure what our users are experiencing.

Past Performance Measurements

Below we look at a few metrics we’ve used to try and understand client-side performance, attempting to answer the following questions:

“When did the main JavaScript for the page execute?” — JSV Delay

One of the earliest metrics widely used at Indeed was “JSV delay” (JavaScript Verification Delay) which measured the point at which JavaScript loaded, parsed, and began to execute. It was instrumented as a client-side network request which marked the time at which our main JavaScript began to execute.

This metric was helpful in measuring whether we were degrading the experience by adding extra JS, or content before the JS bundle since that also resulted in slowdowns in JSV Delay. Over time, this measurement was widely adopted but suffered from significant issues:

Failure to capture performance impact of third party content (Google Analytics, Micro Frontends, etc)
Inability to measure what a user was actually experiencing — even if JS loaded, the page wasn’t actually usable at the time, and the time to usability wasn’t being measured
Bespoke implementation of the metric meant we were not uniformly measuring performance across our pages — JSV delay meant something different from one page to another
No one really knew what the metric meant — because it’s only a standard inside Indeed, we were continually explaining the metric, its advantages, and its downsides

“When did all critical CSS and JavaScript Load?” — domContentLoadEnd

After we decided JSV Delay was no longer serving our needs we decided to adopt a metric which was more broadly used in the software industry. domContentLoadEnd is defined as:

when the HTML document has been completely parsed, and all deferred scripts… have downloaded and executed. It doesn’t wait for other things like images, subframes, and async scripts to finish loading.

In layman’s terms, we can interpret domContentLoadEnd as a more generalized JSV Delay — it fires only after critical HTML, CSS, and JavaScript have loaded. This gave us a much better idea of how the page as a whole was performing, and it was no longer a custom metric, which reduced confusion and ensured that we were uniformly measuring performance across all of our pages. However, this metric too came with significant issues:

domContentLoadEnd doesn’t capture async scripts, which means it misses out on significant portions of the page
Similar to JSV Delay, the fact that much of the code had loaded didn’t necessarily mean the page was interactive
For some pages, domContentLoadEnd could trigger for entirely blank pages (e.g., single page applications).

“When did users see the most important content on the page?” — largestContentfulPaint

Our last usage of “a single metric to explain performance” was largestContentfulPaint (LCP), which was a big step forward for us because it was our first adoption of a Google-recommended metric which was created to try and measure an ever-evolving web landscape.

This allowed us to, for the first time, use a metric that captured “perceived performance,” rather than a more arbitrary datapoint from a browser API. By using LCP, we were making a conscious choice to measure the actual user experience, which was a big step in the right direction.

Because of Indeed’s usage of server-side rendering on high-traffic job search pages, where HTML is immediately visible to users on initial page load, LCP corresponded to the moment where users first saw job cards, the job description, and other critical content. The faster we show our user content, the more time we save them, the more delightful the experience.

Again, however, this measurement came with significant issues:

LCP is not supported on iOS and other legacy browsers, which means we fail to capture this metric on a large percentage of our page loads, users, etc.
Although users can see the critical content, it probably isn’t yet interactive.
LCP is a web-based metric, only collectible in web browsers, and thus excludes native applications.

Differing Page Loads

The lifecycle of a page is complex — from a technical perspective, a lot happens between the initial navigation to a page and when a user begins interacting with its content. The core problem with using a single metric to understand this complex workflow is that it removes much of the context which is necessary in understanding “how the user perceived the page load”.

Let’s consider the following diagram:

Animated timeline showing a page loading evenly over four seconds

Here we see a standard page which takes 4 seconds to load. To start, the job seeker sees a blank page for 1 second; a second later they see a header and a loading indicator. 1 second later they see the main content of the page (LCP), and a second later the page is fully interactive. Now let’s take a look at the next diagram:

Animated timeline showing a page loading four seconds, with the first three changes happening more quickly

Here we see the same page loading, but we see the main content of the page much quicker! But .. we wait 2.5 seconds for the page to become interactive. If we were using a single metric, say LCP, we would believe the second page is much faster. However, users would be experiencing a lot of frustration waiting for the page to become interactive.

Finally, let’s look at this scenario:

Animated timeline showing a page loading four seconds, with the last three changes happening quickly near the end of the four seconds

Here we see that the page is still taking 4 seconds to load but that users don’t see any content until the last second. It’s pretty intuitive that this is a poor experience, since much of the time we’re looking at a blank page, and we don’t even know if it’s working/loading at all. Again if we chose a single metric, we wouldn’t be capturing the actual perceived experience of the page load. What if we improved the time to seeing initial content to 2 seconds from 3.5, while total loading time stayed the same? The user would feel that the page is faster, but we wouldn’t be capturing that improvement.

The Single Metric Problem

As we can see from the above, the lifecycle of a page can be highly variable, where small changes can have big impacts on how users perceive performance. When we look back on our historical performance measurements which utilized the “single metric approach”, we see two fundamental issues:

One metric can’t capture perceived performance

Holistic performance cannot be captured by a single metric — as depicted in the diagrams above, there is no single point in a page load which measures how quickly a user becomes engaged with content.

There are thousands (or an infinite number?) of ways to build a web page, and each brings about their own trade offs when it comes to performance.

For pages that don’t implement server-side rendering (SSR), if we chose to only measure firstContentfulPaint, we would be measuring a datapoint which has effectively no value (since this metric would capture when the first blank page was rendered).

For single page applications, if we chose to measure only time to interactive (TTI), we would be ignoring how quickly users saw initial content, and how quickly they could begin to interact with the page. The reason is that although TTI is an important indicator, it fails to precisely capture when a page is truly interactive.

Another problem with using a single metric is that our pages change over time, and as a result, so too changes how users perceive the performance of a page. Using the above examples, what if an application went from a server-side rendered approach, to a client-side rendered approach? If we stuck with the same performance measurement, say TTI, we would actually think we hurt performance but in reality we’re now showing content much sooner to the user, with the tradeoff of negligible impact to TTI. Overall the perceived page performance would be drastically improved, but we would fail to measure it.

From a business and organizational perspective, that’s an observability gap which has profound implications in the ways we spend our time, and effort.

Improving one metric often degrades another

The second, and perhaps more significant issue with using a single metric to measure speed is that it often results in degraded performance without us realizing it.

The easiest way to improve performance is to ship fewer bytes, and render less content overall. In reality, that’s not always a decision we can make for the business. So as we begin to try to improve performance, we often end up in situations where we’re able to improve a single metric but it either has no bearing on holistic performance, or it actually hurts it!

Let’s take a look at a new diagram (depicted below):

Animated timeline showing a page loading four seconds, with the page becoming progressively more useful over the four seconds

Here we see that our page begins loading normally and at the 2 second mark we have our main content, and the page is interactive. At this point our users can perform their primary goal with the page (let’s say apply for a job for example). At the 3 second mark more content pops in, and finally a second later, all content is visible on the page. This is a common loading pattern for async, or client-side rendered applications (e.g., single page apps).

Ideally, what we’d like to do is shift each of these frames to the left, improving the perceived performance of each step. However, if we were only measuring time to interactive, which occurs in frame 4, we would completely disregard the most important part of the page load which is “how quickly can we make the main content of our page visible and interactive (frame 2). Similarly, if we only measured LCP (which occurs in frame 2), we would be disregarding TTI, which is where all of the content is finally visible.

In this example, we can see that no single metric captures the true performance of the page, but rather it’s a collection of metrics which help us understand the true perceived performance.

Perceived performance is very dependent on how quickly the page loads, but perhaps more important, how it loads.

Using a Composite Metric: LightHouse Explained

Finally, this brings us to the use of a “composite metric” which is a term used in statistics that simply means “a single measurement based on multiple metrics”. With a LightHouse score we’re able to derive a single score based on 5 data points, each which represent a different aspect of a page load.

These data points are:

A table showing the different metrics in the composite LightHouse score, and how they're weighted

For brevity, we won’t go into detail on each data point — you can read more about these page markers here. At a high level, industry experts have agreed upon these 5 markers and weighted them according to how much they contribute to a user perceiving a page as fast and responsive.

As is hopefully evident based on the explanations above, the purpose of using these 5 data points is to best capture the holistic perceived performance. We weight LCP, total blocking time (TBT), and cumulative layout shift the highest because we believe these are the most important indicators of speed. FCP and speedIndex are contributors but less significant overall.

During each page load, we’re able to calculate all of these metrics and use an algorithm to determine a single score — users who receive a score >= 90 are determined to be “fast and responsive”. Scores below 90 are in need of improvement.

Composite Metrics in Action

If we use the same page load diagram from above, we can imagine how using a composite metric allows us to fully capture performance for our users.

A still image depicting a page loading evenly over four seconds

Let’s run through a few scenarios:

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), and did no harm to frames 3 and 4, we would see an improvement to our overall LightHouse score.

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), but degraded frames 3 and 4, we would see no improvement to our overall LightHouse score.

If we ended up with an improvement which improved FCP, but degraded frames 2, 3, and 4, we would see an overall degradation that we would have missed if we were monitoring only a single metric.

Why Can’t We Simply Use “Time to Interactive” (TTI)?

This is a common question within the performance realm so I wanted to address it here, and how it relates to composite metrics.

First, what is TTI? The most common definition is as follows:

TTI is a performance metric that measures a page’s load responsiveness and helps identify situations where a page looks interactive but actually isn’t. TTI measures the earliest time after First Contentful Paint (FCP) when the page is reliably ready for user interactivity.

This sounds great, so why not just use this? Isn’t the most important thing for performance when the page is interactive?

Like all things in software, there’s nuance and tradeoffs. Let’s look at the pros and cons:

Pros:

A single metric which estimates how long the overall page took to become usable

Cons:

TTI is no longer recommended, and has been taken out of LightHouse calculations because it’s not believed to be an accurate metric across a wide variety of page load types (CSR, SSR, etc).
TTI is an estimation based on network activity, and DOM mutations, not an actual marker of page completion.
Because TTI is just a single metric, it suffers from “the single metric problem” which is explained above.

My point here isn’t that TTI is bad, but rather that it’s an incomplete way of looking at performance. TTI is a useful indicator, but it’s only meaningful if we look at it in context to our other metrics (FCP, LCP, etc). TTI’s main purpose is to provide a corroborating metric, rather than to explain performance overall.

As an organization, we can imagine hundreds of ways to improve TTI without actually improving the most critical aspects of perceived performance. Additionally, we can imagine ways which improve TTI that actually hurt the earlier marks of a page load, which may result in degraded performance overall.

Conclusions

My hope for readers that have made it this far is that we now have a more nuanced understanding of how we can measure client-side performance. With the advent of the web we developed metrics which helped us figure out how fast static pages were loading — as the web advanced (thanks a lot jQuery!), so too have our measurements advanced.

Based on the past ~4 years of deep investment in performance improvements at Indeed, I believe these are my most important takeaways:

Use a composite metric, but be willing to change the underlying internal metrics.
Be wary of the silver bullet — metrics or tools that purport to capture everything you need nearly always don’t.
Technology changes, and we need to change how we measure performance as a result.
Corroborate your speed metrics with how your page loads and ensure it actually represents what users are experiencing.