Secure Workload Identity with SPIRE and OIDC: A Guide for Kubernetes and Istio Users

Goal

This blog is for engineering teams, architects, and leaders responsible for defining and implementing a workload identity platform and access controls rooted in Zero Trust principles to mitigate the risks from compromised services. It is relevant for companies using Kubernetes to manage workloads, using Istio for service mesh, and aiming to define identities in a way that aligns with internal standards, free from platform-specific constraints. Specifically, we’ll discuss Indeed’s solution for third-party authentication, opinionated best practices, and challenges faced. It is not practical to share all the alternatives, trade-offs and engineering insights supporting our decisions; we want to share design choices and implementation details that can accelerate decision making and problem solving for others in similar situations.

Introduction

Passwords are a tale as old as ancient civilizations. Modern systems routinely rely on API key & ID pairs (analogous to username and passwords) to access other systems. These API keys in theory are complex, managed by developers, and stored securely. The reality is more complicated. We all have heard stories of passwords hiding in plain sight, unencrypted, in code repositories, in log messages, in headers, in terminal history, wherever it’s convenient to just get the job done. Rotating old API keys can even be scarier. Who knows if keys have been shared, how many times they have been shared, and where all they have been shared? Did Alice delete the old API key? Was the new API key deployed everywhere!?

So what’s the solution? Step 1: Articulate and measure the problem. At Indeed, we embody our core value of being data-driven. Through our analysis, we recognized the risk posed by compromised credentials used by services. Our data revealed that half of our AWS IAM keys have access to some type of restricted data. We observed shared API keys being used across a wide range of our workloads. We discovered roughly eight times as many stored secrets as there are unique keys in all of our major authorization systems. This indicates a significant duplication of secrets, though we have not yet determined the exact scale of this duplication. Step 2: Implement a solution that works for Indeed’s heterogeneous workloads across third-party SaaS cloud vendors and Indeed’s own (first-party) apps.

Image showing API keys from a shared vault being used to access resources in multiple cloud providers

The starting point is to build an identity platform capable of provisioning temporary, verifiable, attestable, unique, and cryptographically secure workload credentials for access to third-party systems like Confluent Cloud and AWS, and first-party services as well. Indeed promotes responsible use of Open Source Software and dedicated platforms with clear responsibilities leveraging industry standards to solve common problems. Our workload identity platform is built on SPIRE, embracing open standards like SPIFFE, OAuth 2.0 and OIDC to provide managed identities in x509 PKI Certificate or JSON Web Tokens standards.

SPIRE

SPIRE is a PKI project that graduated from the Cloud Native Computing Foundation. SPIRE is open source, widely used in the industry and has a vibrant and active community of engineers. SPIRE can be deployed in a scalable and resilient manner and has been operating reliably at scale in production at Indeed for over a year now. SPIRE-issued x509 identities are used in our Istio service mesh for mTLS, and JWT identities are used to enable OIDC-based federated access with Confluent and AWS resources.

Istio Opinions

Adopting Istio to replace our legacy service mesh created conflicts with certain SPIRE configurations already in production.

SPIFFE Format

We debated the granularity and uniqueness of identities suitable to represent an Indeed application. In this context identity refers to the subject, i.e., the SPIFFE ID of a workload. The discussion revolved around the SPIFFE template and its constituent parts, e.g.:

spiffe://<trust_domain>/<scheduling_platform>/<environment>/ns/<namespace>/sa/<service-account>

However, Istio is highly opinionated about the SPIFFE ID format a workload must have:

spiffe://<trust.domain>/ns/<namespace>/sa/<service-account>

An image showing a cautionary note on workload ID formatting from the Istio / SPIRE documentation

 This is a known problem that is still open with Istio: Customizing SPIFFE ID format if using an external SPIFFE-compliant SDS should be supported · Issue #43105 · istio/istio · GitHub

If you have a SPIRE deployment already in production with a different SPIFFE ID format for your Kubernetes workloads, be aware of Istio requirements. Updating the subject of your workloads is not trivial. While it’s only a configuration change in SPIRE, the subject likely appears wherever access control and authorization rules are defined for your workloads. 

SPIRE Agent Socket Name

Istio requires SPIRE Agent APIs be available on the /var/run/secrets/workload-spiffe-uds/socket Unix domain socket only—another (unnecessary) Istio opinion that affects the entirety of the platform and will require careful planning to accommodate. Since we already had SPIRE in production, we used K8s to mount our socket path to /var/run/secrets/workload-spiffe-uds and only had to update the file name from agent.socket to socket. We made the practical choice of temporarily disabling mTLS in the mesh and rolling out our SPIRE Agent socket name changes one cluster at a time, as it affected the proxy SDS (Secret Discovery Service) configuration as well. During this time, our mesh was only protected by the network perimeter behind the VPN. After both SPIRE Agent and service mesh SDS configuration were updated, mTLS was turned back on.

SPIRE Architecture

Topology and Trust Domain

At Indeed, we manage a single trust domain in SPIRE deployed in a nested topology. We run multiple SPIRE Servers in each Kubernetes cluster for redundancy. SPIRE Servers in each cluster have a common datastore for synchronization. There’s one root SPIRE CA deployed in a special cluster reserved for infrastructure services. All other Kubernetes clusters have their own intermediate SPIRE CAs with the root CA as their upstream authority.

An image showing an example of a nested SPIRE deployment

A and N represent cardinality and any number greater than 1 is suitable. The cardinality for M is the number of nodes in the cluster, as each node has its own instance of SPIRE Agent.

This topology is scalable, performant and resilient. A single Spire Server can go down in any cluster without any outage. All SPIRE Servers in a cluster going down only affects workloads in that cluster. Each SPIRE component in each cluster can be configured and tuned separately. SPIRE configures each Server with its own CA signing keys. That’s also desirable from a security perspective, as any compromised SPIRE Server private keys are not used elsewhere.

We use a unified trust domain for all our workloads in production and non-production environments (excluding local development). A single trust domain is easier to reason about and maintain. Namespace naming conventions at Indeed typically include environment names in the namespace and that provides sufficient logical separation from an operational and security perspective. E.g., we treat metrics from spire–dev namespace differently to those from spire–prod. We help our developer teams understand that they can use variations in namespace and service account to create different permission boundaries for similar workloads in different environments.

SPIRE Performance and Deployment Tuning: Lessons from Production

Through our experience running various SPIRE components across a fleet of 3000 pods, we discovered some Kubernetes configurations that keep our platform stable even as nodes and pods come and go. These settings were also influenced by stress testing of our SPIRE platform by scheduling thousands of workloads in a limited amount of time and observing how our platform behaved during major upgrades. Here are some settings we recommend:

  1. Set the criticality of the SPIRE components to minimize eviction. priorityClassName: XXXX for SPIRE Server and Agent.
    • Kubernetes has a hard limit of 110 pods per node. We need to guarantee that the SPIRE Agent gets scheduled on each node. It’s a runtime requirement for all pods. Secondly, we want to prevent pre-emption for core SPIRE components as much as possible. Without priorityClassName Kubernetes will default to priority of zero or globalDefault. This setting must be set explicitly and high enough to ensure scheduling of SPIRE Agents on each node.
  2. Set resource request/limits for ephemeral storage for SPIRE Agent. We observed SPIRE Agent pod evictions related to disk pressure on the node. Our solution was to explicitly set both requests/limits to ephemeral-storage: XXXMi to prevent the SPIRE Agent from being evicted.
  3. Leverage vertical pod autoscaling (VPA) for SPIRE components (Servers, Registrars, and Agents). SPIRE runs in a myriad of clusters with unique and varying performance characteristics. Our performance testing revealed the CPU and memory upper bounds we can expect. But overallocation for the worst case is costly and inefficient. With VPA we are able to set CPU minAllowed to 15m, i.e., 0.015 CPU for SPIRE components! The max was based on observations during performance testing.
    • Note that updatePolicy for SPIRE Agents was set to updateMode: Initial. This is to prevent evictions from VPA updates. We made a conscious choice to minimize SPIRE Agent disruption from VPA changes and apply VPA policies during expected SPIRE Agent restarts due to node upgrades, scheduled deployments, etc.
    • updateMode: Auto is in use for all other SPIRE components.
  4. Since SPIRE Agents are configured as  DaemonSet we also set our updateStrategy to type: RollingUpdate with rollingUpdate set to maxUnavailable: 5. This slows the rollout of SPIRE Agents in a cluster but also ensures a large majority of the nodes in the cluster are being served by SPIRE as expected.

SPIRE Signing Keys and KeyManager Configuration

If your workload requires a JWT SPIFFE Verifiable Identity Document (SVID), it is highly likely you’ll need a stable, predictable number of signing keys in use across all SPIRE Servers. It is important to note:

  1. Each Spire Server has a separate and unique x509 and JWT key pair for signing.
  2. The in-memory KeyManager results in new x509 and JWT signing keys generated upon every restart.
  3. SPIRE doesn’t have the option to use the SQL Datastore as a KeyManager also.

We encountered issues in using AWS EBS/EFS CSI as persistent volumes and thus couldn’t use the disk KeyManager plugin. We helped enhance the built-in AWS KMS KeyManager plugin so there’s an option for persistent key store without relying on persistent volumes for Spire Server pods. We found the AWS KMS KeyManager to be reliable.

Given M total SPIRE Servers, the number of JWT signing keys K in the JSON Web Key Sets is:  M <= K <= 2 * M. It is possible a SPIRE Server has an active JWT signing key that’s used for signing and verification and another unexpired key that’s used for verification only.

SPIRE as OAuth Identity Server

OIDC Discovery Provider

SPIRE can be integrated as an Identity Server in the OAuth flow. The use of SPIRE OIDC Discovery Provider further allows for federation based on SPIRE JWT SVIDs. We initially deployed the SPIRE OIDC Provider to all Spire Servers including Root and Intermediate CAs. Querying the Provider would return a varying number of public JWT signing keys! Our current strategy is to enable and serve the SPIRE OIDC Discovery Provider from the Root CAs only. We find the Root SPIRE CAs in a nested topology to be an accurate source for the full trust bundle (including all JWT signing keys being used in the entire SPIRE Server fleet).

CredentialComposer Plugin

SPIRE Server supports many customization plugins. There’s also a plugin that can modify the claims in a JWT SVID as needed. At Indeed, we implement a custom plugin that looks up a workload’s metadata and translates that into additional claims as needed. Our approach to federation with AWS is based on passing session tags using AssumeRoleWithWebIdentity. We tag AWS resources storing sensitive data and manage which workload has access to which tags in internal systems. The custom plugin looks up the appropriate session tags for a workload and adds them to the JWT SVID.

An image showing how a K8s workload uses SPIRE OAuth to access an S3 bucket with a custom JWT

The workload’s final access is the combination of the IAM Policy attached to the IAM Role and additional session tags the workload was granted. The IAM Role itself doesn’t need to be tagged.

The SPIFFE Helper utility runs as a sidecar to request, refresh, and store the JWT SVID at a fixed location on the workload pod.

Third-party Federation Using OIDC

At Indeed, popular Confluent and AWS technologies are used to store most of our critical data. Most of our workloads also access data in both clouds. It is important for us to implement federation with both successfully from the beginning. The details for enabling and configuring OIDC are well documented for both AWS and Confluent. Next we’ll cover how our experience differed for both vendors and lessons learned. It is fair to say that there were significant differences and nothing should be taken for granted, as you’ll see.

Opaque Limits on Keys Accepted in JWKS, and Too Many JWT Signing Keys

We discussed earlier that each Spire Server has its own unique JWT signing key pair and that the maximum number of signing keys is twice the number of SPIRE CA servers. One drawback of a nested topology scaled for fault tolerance is that there are many SPIRE CAs. So, given M SPIRE Server per N K8s cluster, there can be 2 * M * N JWT signing keys in the JSON Web Key Set (JWKS).

In the early phases of development, we saw the verification of the SPIRE JWT failed in both Confluent and AWS. Our proof of concept, which had used a single SPIRE CA server in a test trust domain, had worked. We investigated more and figured that AWS accepts ~100 signing keys and Confluent only a handful. Neither documents the limit anywhere, which made the whole process more difficult. We were able to work with Confluent to increase the soft limit to something more reasonable. The AWS limit remains the same. 

We have this issue open with the SPIRE community as well. SPIRE deployments of more than a few servers can create more keys in JWKS than OIDC federating system supports · Issue #4699 · spiffe/spire · GitHub 

While nested topologies are great for high availability, there’s a real risk that federation can fail based on arbitrary limits on signing keys supported by the federating system. SPIRE could benefit from providing a mechanism where the number of SPIRE instances can scale, but the number of JWT signing keys are fixed, i.e. be able to logically group Spire Servers that use the same key material.

OIDC Configuration

When configuring the OIDC Provider in AWS, the thumbprint for the top level certificate used in signing the OIDC endpoint is required. Confluent doesn’t require any such configuration. Our SPIRE OIDC server endpoint has a certificate issued by Let’s Encrypt. Confluent implicitly trusts globally trusted CAs. AWS requires that the thumbprint be set. This is challenging as Let’s Encrypt recently truncated the chain and has also shortened the duration of the new top-level Intermediate CA. You must define a process or automation to update the OIDC configuration in AWS before the signing CA for the OIDC server itself rotates.

Note: This is different from the JWT signing key pair used to sign the JWT and subsequently used in JWT verification.

Confluent Identity Pool vs. AWS IAM Role

In the context of OIDC, Confluent identity pools and AWS IAM roles are used for managing permissions, but have different implementations. We’ll look at some key differences.

Audience Claim

It is worth noting that AWS expects Audience(s) to be set per OIDC provider. Confluent expects the aud claim to be defined in each identity pool. The difference is that in AWS, the audience claim is tied to the Issuer relationship itself, so there’s no need for an audience check in the trust policy for an IAM Role. Confluent expects Identity Pool filters to explicitly verify the issuer, audience, etc.

Trusting OIDC Providers

A Confluent Identity Pool trusts a single OIDC Provider only. The Confluent documentation for identity pool may lead you to believe otherwise by supporting filter expressions like claims.iss in [“google”, “okta”], but an identity pool is bound to one OIDC Provider. AWS IAM Roles, on the other hand, rely on trust policies which can be configured to trust multiple OIDC Providers by repeating the principal block. This matters when thinking about migrating to new OIDC Providers or running multiple Identity Providers in your organization.

Size Limits

AWS IAM Roles have a limitation on the size of the trust policy, and Confluent has a limit on the size of the filter. Work with the vendor to understand the hard limits and soft limits for your company. It is better to know these limits ahead of time as that can influence the design of the trust policy and the workload identity format itself.

SDK and Standards Maturity

AWS has a mature and well documented credential provider chain. It walks a developer through what SDK configuration is needed so that the OAuth JWT will be automatically located and used in a call to AssumeRoleWithWebIdentity inside the client application. A few properly configured environment variables and a credential file containing the JWT are all that’s needed for the AWS SDK to automatically exchange it for an STS credential with Role assumption. No additional logic is needed when the credential file containing the JWT is automatically refreshed.

Confluent Kafka Simple Authentication and Security Layer (SASL) libraries provide interfaces that have to be implemented in multiple languages for the JWT to be located, refreshed and made available for use.

The biggest issue we’ve faced so far in our journey was the least expected: CredentialComposer plugin serializes integer claims as float · Issue #4982 · spiffe/spire · GitHub. The SPIRE credential composer plugin converts timestamp fields from integer to float. This led AWS STS to reject the JWT due to invalid data type for the iat and exp claims. Confluent, on the other hand, had no problem validating and verifying the JWT. The JWT spec defines timestamps to be numeric types, and both integer and float are valid types. We got stuck between poor data type handling in SPIRE and AWS STS aversion to fixing the issue on their end and bringing their JWT validation up to spec. A tactical fix was pushed by Indeed so SPIRE JWT SVIDS will be accepted by AWS.

Conclusion

Adopting SPIRE as your OIDC Provider with major cloud vendors allows you to specify identities independently of vendor-specific naming schemes and manage them centrally. This approach provides a consistent view of each workload, benefiting compliance, governance, and auditing efforts within the company.

If you are pushing for the latest and greatest in SPIRE architecture and security standards, be prepared to overcome gaps on behalf of SPIRE or the federating system. While no system is perfect, the problems SPIRE already solves, it solves well. A highly available SPIRE deployment as an OIDC provider is a road less traveled, and we are excited to make things better wherever we can and share our learnings for everyone’s benefit. We hope this guide accelerates your journey for embracing secure workload identity in your organization.

The Importance of Using a Composite Metric to Measure Performance

A still image depicting a page loading evenly over four seconds

In the past, Indeed has used a variety of metrics to evaluate our client-side performance, but we’ve tended to focus on one at a time. Traditionally, we chose a single performance metric and used it as the measuring stick for whether we were improving or degrading the user experience. 

This made it simple to track performance because we only needed to instrument and monitor a single datapoint. Technical and non-technical consumers could easily parse this information and understand how we were doing as an organization.

However, this type of thinking also brought about significant drawbacks that, in many cases, ended up resulting in overall degraded performance and wasted effort. This document examines those drawbacks, and suggests that using a “composite metric” enables us to much better measure what our users are experiencing. 

Past Performance Measurements

Below we look at a few metrics we’ve used to try and understand client-side performance, attempting to answer the following questions:

“When did the main JavaScript for the page execute?” —  JSV Delay

One of the earliest metrics widely used at Indeed was “JSV delay” (JavaScript Verification Delay) which measured the point at which JavaScript loaded, parsed, and began to execute. It was instrumented as a client-side network request which marked the time at which our main JavaScript began to execute. 

This metric was helpful in measuring whether we were degrading the experience by adding extra JS, or content before the JS bundle since that also resulted in slowdowns in JSV Delay. Over time, this measurement was widely adopted but suffered from significant issues:

  • Failure to capture performance impact of third party content (Google Analytics, Micro Frontends, etc)
  • Inability to measure what a user was actually experiencing even if JS loaded, the page wasn’t actually usable at the time, and the time to usability wasn’t being measured 
  • Bespoke implementation of the metric meant we were not uniformly measuring performance across our pages JSV delay meant something different from one page to another
  • No one really knew what the metric meant because it’s only a standard inside Indeed, we were continually explaining the metric, its advantages, and its downsides

“When did all critical CSS and JavaScript Load?” — domContentLoadEnd

After we decided JSV Delay was no longer serving our needs we decided to adopt a metric which was more broadly used in the software industry. domContentLoadEnd is defined as:

when the HTML document has been completely parsed, and all deferred scripts… have downloaded and executed. It doesn’t wait for other things like images, subframes, and async scripts to finish loading.

In layman’s terms, we can interpret domContentLoadEnd as a more generalized JSV Delay it fires only after critical HTML, CSS, and JavaScript have loaded. This gave us a much better idea of how the page as a whole was performing, and it was no longer a custom metric, which reduced confusion and ensured that we were uniformly measuring performance across all of our pages. However, this metric too came with significant issues:

  • domContentLoadEnd doesn’t capture async scripts, which means it misses out on significant portions of the page
  • Similar to JSV Delay, the fact that much of the code had loaded didn’t necessarily mean the page was interactive
  • For some pages, domContentLoadEnd could trigger for entirely blank pages (e.g., single page applications).

“When did users see the most important content on the page?” — largestContentfulPaint

Our last usage of “a single metric to explain performance” was largestContentfulPaint (LCP), which was a big step forward for us because it was our first adoption of a Google-recommended metric which was created to try and measure an ever-evolving web landscape.

This allowed us to, for the first time, use a metric that captured “perceived performance,” rather than a more arbitrary datapoint from a browser API. By using LCP, we were making a conscious choice to measure the actual user experience, which was a big step in the right direction. 

Because of Indeed’s usage of server-side rendering on high-traffic job search pages, where HTML is immediately visible to users on initial page load, LCP corresponded to the moment where users first saw job cards, the job description, and other critical content. The faster we show our user content, the more time we save them, the more delightful the experience. 

Again, however, this measurement came with significant issues:

  • LCP is not supported on iOS and other legacy browsers, which means we fail to capture this metric on a large percentage of our page loads, users, etc. 
  • Although users can see the critical content, it probably isn’t yet interactive.
  • LCP is a web-based metric, only collectible in web browsers, and thus excludes native applications. 

Differing Page Loads 

The lifecycle of a page is complex — from a technical perspective, a lot happens between the initial navigation to a page and when a user begins interacting with its content. The core problem with using a single metric to understand this complex workflow is that it removes much of the context which is necessary in understanding “how the user perceived the page load”. 

Let’s consider the following diagram:

Animated timeline showing a page loading evenly over four seconds

Here we see a standard page which takes 4 seconds to load. To start, the job seeker sees a blank page for 1 second; a second later they see a header and a loading indicator. 1 second later they see the main content of the page (LCP), and a second later the page is fully interactive. Now let’s take a look at the next diagram: 

Animated timeline showing a page loading four seconds, with the first three changes happening more quickly

Here we see the same page loading, but we see the main content of the page much quicker! But .. we wait 2.5 seconds for the page to become interactive. If we were using a single metric, say LCP, we would believe the second page is much faster. However, users would be experiencing a lot of frustration waiting for the page to become interactive. 

Finally, let’s look at this scenario: 

Animated timeline showing a page loading four seconds, with the last three changes happening quickly near the end of the four seconds

Here we see that the page is still taking 4 seconds to load but that users don’t see any content until the last second. It’s pretty intuitive that this is a poor experience, since much of the time we’re looking at a blank page, and we don’t even know if it’s working/loading at all. Again if we chose a single metric, we wouldn’t be capturing the actual perceived experience of the page load. What if we improved the time to seeing initial content to 2 seconds from 3.5, while total loading time stayed the same? The user would feel that the page is faster, but we wouldn’t be capturing that improvement. 

The Single Metric Problem

As we can see from the above, the lifecycle of a page can be highly variable, where small changes can have big impacts on how users perceive performance. When we look back on our historical performance measurements which utilized the “single metric approach”, we see two fundamental issues:

One metric can’t capture perceived performance

Holistic performance cannot be captured by a single metric — as depicted in the diagrams above, there is no single point in a page load which measures how quickly a user becomes engaged with content. 

There are thousands (or an infinite number?) of ways to build a web page, and each brings about their own trade offs when it comes to performance. 

For pages that don’t implement server-side rendering (SSR), if we chose to only measure firstContentfulPaint, we would be measuring a datapoint which has effectively no value (since this metric would capture when the first blank page was rendered). 

For single page applications, if we chose to measure only time to interactive (TTI), we would be ignoring how quickly users saw initial content, and how quickly they could begin to interact with the page. The reason is that although TTI is an important indicator, it fails to precisely capture when a page is truly interactive. 

Another problem with using a single metric is that our pages change over time, and as a result, so too changes how users perceive the performance of a page. Using the above examples, what if an application went from a server-side rendered approach, to a client-side rendered approach? If we stuck with the same performance measurement, say TTI, we would actually think we hurt performance but in reality we’re now showing content much sooner to the user, with the tradeoff of negligible impact to TTI. Overall the perceived page performance would be drastically improved, but we would fail to measure it. 

From a business and organizational perspective, that’s an observability gap which has profound implications in the ways we spend our time, and effort. 

Improving one metric often degrades another

The second, and perhaps more significant issue with using a single metric to measure speed is that it often results in degraded performance without us realizing it. 

The easiest way to improve performance is to ship fewer bytes, and render less content overall. In reality, that’s not always a decision we can make for the business. So as we begin to try to improve performance, we often end up in situations where we’re able to improve a single metric but it either has no bearing on holistic performance, or it actually hurts it! 

Let’s take a look at a new diagram (depicted below):

Animated timeline showing a page loading four seconds, with the page becoming progressively more useful over the four seconds

Here we see that our page begins loading normally and at the 2 second mark we have our main content, and the page is interactive. At this point our users can perform their primary goal with the page (let’s say apply for a job for example). At the 3 second mark more content pops in, and finally a second later, all content is visible on the page. This is a common loading pattern for async, or client-side rendered applications (e.g., single page apps). 

Ideally, what we’d like to do is shift each of these frames to the left, improving the perceived performance of each step. However, if we were only measuring time to interactive, which occurs in frame 4, we would completely disregard the most important part of the page load which is “how quickly can we make the main content of our page visible and interactive (frame 2). Similarly, if we only measured LCP (which occurs in frame 2), we would be disregarding TTI, which is where all of the content is finally visible. 

In this example, we can see that no single metric captures the true performance of the page, but rather it’s a collection of metrics which help us understand the true perceived performance. 

Perceived performance is very dependent on how quickly the page loads, but perhaps more important, how it loads. 

Using a Composite Metric: LightHouse Explained

Finally, this brings us to the use of a “composite metric” which is a term used in statistics that simply means “a single measurement based on multiple metrics”. With a LightHouse score we’re able to derive a single score based on 5 data points, each which represent a different aspect of a page load. 

These data points are:

A table showing the different metrics in the composite LightHouse score, and how they're weighted

For brevity, we won’t go into detail on each data point you can read more about these page markers here. At a high level, industry experts have agreed upon these 5 markers and weighted them according to how much they contribute to a user perceiving a page as fast and responsive. 

As is hopefully evident based on the explanations above, the purpose of using these 5 data points is to best capture the holistic perceived performance. We weight LCP, total blocking time (TBT), and cumulative layout shift the highest because we believe these are the most important indicators of speed. FCP and speedIndex are contributors but less significant overall. 

During each page load, we’re able to calculate all of these metrics and use an algorithm to determine a single score users who receive a score >= 90 are determined to be “fast and responsive”. Scores below 90 are in need of improvement.

Composite Metrics in Action

If we use the same page load diagram from above, we can imagine how using a composite metric allows us to fully capture performance for our users.

A still image depicting a page loading evenly over four seconds

Let’s run through a few scenarios: 

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), and did no harm to frames 3 and 4, we would see an improvement to our overall LightHouse score.

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), but degraded frames 3 and 4, we would see no improvement to our overall LightHouse score.

If we ended up with an improvement which improved FCP, but degraded frames 2, 3, and 4, we would see an overall degradation that we would have missed if we were monitoring only a single metric. 

Why Can’t We Simply Use “Time to Interactive” (TTI)? 

This is a common question within the performance realm so I wanted to address it here, and how it relates to composite metrics. 

First, what is TTI? The most common definition is as follows: 

TTI is a performance metric that measures a page’s load responsiveness and helps identify situations where a page looks interactive but actually isn’t. TTI measures the earliest time after First Contentful Paint (FCP) when the page is reliably ready for user interactivity.

This sounds great, so why not just use this? Isn’t the most important thing for performance when the page is interactive? 

Like all things in software, there’s nuance and tradeoffs. Let’s look at the pros and cons:

Pros:

  • A single metric which estimates how long the overall page took to become usable

Cons:

  • TTI is no longer recommended, and has been taken out of LightHouse calculations because it’s not believed to be an accurate metric across a wide variety of page load types (CSR, SSR, etc).
  • TTI is an estimation based on network activity, and DOM mutations, not an actual marker of page completion.
  • Because TTI is just a single metric, it suffers from “the single metric problem” which is explained above.

My point here isn’t that TTI is bad, but rather that it’s an incomplete way of looking at performance. TTI is a useful indicator, but it’s only meaningful if we look at it in context to our other metrics (FCP, LCP, etc). TTI’s main purpose is to provide a corroborating metric, rather than to explain performance overall. 

As an organization, we can imagine hundreds of ways to improve TTI without actually improving the most critical aspects of perceived performance. Additionally, we can imagine ways which improve TTI that actually hurt the earlier marks of a page load, which may result in degraded performance overall. 

Conclusions 

My hope for readers that have made it this far is that we now have a more nuanced understanding of how we can measure client-side performance. With the advent of the web we developed metrics which helped us figure out how fast static pages were loading — as the web advanced (thanks a lot jQuery!), so too have our measurements advanced.

Based on the past ~4 years of deep investment in performance improvements at Indeed, I believe these are my most important takeaways: 

  • Use a composite metric, but be willing to change the underlying internal metrics.
  • Be wary of the silver bullet — metrics or tools that purport to capture everything you need nearly always don’t. 
  • Technology changes, and we need to change how we measure performance as a result.
  • Corroborate your speed metrics with how your page loads and ensure it actually represents what users are experiencing. 

SHAP Plots: The Crystal Ball for UI Test Ideas

Photo by Sam on Unsplash

 

Have you ever wanted a crystal ball that would predict the best A/B test to boost your product’s growth, or identify which part of your UI drives a target metric?

With a statistical model and a SHAP decision plot, you can identify impactful A/B test ideas in bulk. The Indeed Interview team used this methodology to generate optimal A/B tests, leading to a 5-10% increase in key business metrics.

Case study: Increasing interview invites

Indeed Interview aims to make interviewing as seamless as possible for job seekers and employers. The Indeed Interview team has one goal: to increase the number of interviews happening on the platform. For this case study, we wanted UI test ideas that would help us boost the number of invitations sent by employers. To do this, we needed to analyze their behavior on the employer dashboard, and try to predict interview invitations.

Employer using Indeed Interview to virtually interview a candidate.

Convert UI elements into features

The first step of understanding employer behavior was to create a dataset. We needed to predict the probability of sending interview invitations based on an employer’s clicks in the dashboard.

We organized the dataset so each cell represented the number of times an employer clicked a specific UI element. We then used these features to predict our targeted action: clicking the Set up interview button vs. not clicking on the button.

Set up interview button on the employer dashboard

Train the model on the target variable

The next step was to train a model to make predictions based on the dataset. We selected a tree-based model, CatBoost, due to its overall superior performance and ability to detect interactions among features. And, just like any model, it works effectively with our interpretation tool – SHAP plot.

We could have used correlation or logistic regression coefficients, but we chose SHAP plot combined with a tree-based model because it provides unique advantages for model interpretation tasks. Two features with similar correlation coefficients could have dramatically different interpretations in SHAP plot, which factors in feature importance. In addition, a tree-based model usually has better performance than logistic regression, leading to a more accurate model. Using SHAP plot combined with a tree-based model provides both performance and interpretability.

Interpret SHAP results into positive and negative predictors

Now that we have a dataset and trained model, we can interpret the SHAP plot generated from it. SHAP works by showing how much a certain feature can change the prediction value. In the SHAP plot below, each row is a feature, and the features are ranked based on descending importance: the ones at the top are the most important and have the highest influence (positive or negative) on our targeted action of clicking Set up interview.

The data for each feature is displayed with colors representing the scale of the feature. A red dot on the plot means the employer clicked a given UI element many times, and a blue dot means the employer clicked it only a few times. Each dot also has a SHAP value on the X axis, which signifies the type of influence, positive or negative, that the feature has on the target and the strength of its impact. The farther a dot is from the center, the stronger the influence.

SHAP plot displaying features A-O ranked by descending influence on the model (regardless of positive or negative). Each feature has red and blue dots (feature value) organized by SHAP value (impact on model output). Features outlined in red: A, B, D, F, H, I, K, L, and N. Features outlined in blue: E, G, M, and O.

SHAP plot with features outlined in red for positive predictors, and blue for negative predictors

Based on the color and location of the dots, we categorized the features as positive or negative predictors.

  • Positive Predictor – A feature where red dots are to the right of the center.
    • They have positive SHAP value: usage of this feature predicts the employer will send an interview invitation.
    • In the SHAP plot above, Feature B is a good example.
  • Negative Predictor – A feature where red dots are to the left of the center.
    • They have negative SHAP value: usage of this feature predicts the employer will not send an interview invitation.
    • Feature G is a good example of this.

Red dots on both sides of the center are more complex and need further investigation, using tools such as dependency plots (also in SHAP package).

Note that this relationship between feature and target is not causal yet. A model can only claim causality when it assumes all confounding variables have been included, which is a strong assumption. While the relationships could be causal, we don’t know for certain until they are verified in A/B tests.

Generate test ideas

Our SHAP plot contains 9 positive predictors and 4 negative predictors, and each one is a potential A/B test hypothesis of the relationship between the UI element and the target. We hypothesize that positive predictors boost target usage, and negative predictors hinder target usage.

To verify these hypotheses, we can test ways to make positive predictors more prominent, and direct the employer’s attention to them. After the employer clicks on the feature, we can direct attention to the target, in order to boost its usage. Another option is to test ways to divert the employer’s attention away from negative predictors. We can add good friction, making them less easy to access and see if usage of the target increases.

Boost positive predictors

We tested changes to the positive predictors from our SHAP plot to make them more prominent in our UI. We made Feature B more prominent on the dashboard, and directed the employer’s attention to it. After the employer clicked Feature B, we showed a redesigned UI with improved visuals to make the Set up interview button more attractive.

The results were a 6% increase in clicking to set up an interview.

Divert away from negative predictors

We also tested changes to the negative predictors from our SHAP plot in the hopes of increasing usage of the target. We ran a test to divert employer attention away from Feature G by placing it close to the Set up interview button on the dashboard. This way it was easier for the employer to choose setting up an interview instead.

This change boosted clicks to send interview invitations by 5%.

Gaze into your own crystal ball

A SHAP plot may not be an actual crystal ball. When used with a statistical model, however, it can generate UI A/B test ideas in bulk and boost target metrics for many products. You might find it especially suitable for products with a complex and nonlinear UI, such as user dashboards. The methodology also provides a glimpse of which UI elements drive the target metrics the most, allowing you to focus on testing features that have the most impact. So, what are you waiting for? Start using this method and good fortune will follow.

 

Cross-posted on Medium