How Indeed Replaced Its CI Platform with Gitlab CI

Here at Indeed, our mission is to help people get jobs. Indeed is the #1 job site in the world with over 580M+ Job Seeker Profiles. For Indeed’s Engineering Platform teams, we have a slightly different motto: “We help people to help people get jobs”. As part of a data-driven engineering culture that has spent the better part of two decades always putting the job seeker first, we are responsible for building the tools that not only make this possible, but empower engineers to deliver positive outcomes to job seekers every day.

Do you want to build a Jenkins snowman?

Like many large technology companies, our Continuous Integration (CI) platform was built organically as the company scaled. In fact, Indeed was using Hudson, Jenkins’ direct predecessor, back in 2007. At the time, Indeed had fewer than 20 engineers. Today, through nearly two decades of growth, we have thousands of engineers. We built our platform on top of the de facto open source and industry standard solutions available at the time. As new technology became available, we made incremental improvements, switching to Jenkins after Oracle bought Sun and caused the Jenkins/Hudson fork around 2011. Another improvement allowed us to move most of our workloads to dynamic cloud worker nodes using AWS EC2. As we entered the Kubernetes age, however, the system architecture reached its limits. Hudson was first released in 2005. In 2005, J2SE 5.0 was less than a year old. Java with generics was novel! AWS was not a thing. Clouds were made of water vapor, not servers and software defined networking.

Suffice it to say, Jenkins’ architecture was not created with the cloud in mind and could not have been, because the cloud did not yet exist. Jenkins operates by having a “controller” node, a single point of failure which runs critical parts of a pipeline and farms out certain steps to worker nodes (which can scale horizontally to some extent). Controllers are not only a single point of failure, they are also a manual scaling axis. If you have too many jobs to fit on one controller, you must partition your jobs across controllers manually. Cloudbees, the largest company offering Jenkins enterprise support, has some mitigations for this including the Cloudbees Jenkins Operations Center (CJOC), which allows you to manage your constellation of controllers from a single centralized place, but they remain challenging to run in a Kubernetes environment because each controller is a fragile single-point-of-failure. Activities like node rollouts or hardware failures cause downtime.

Follow the yellow brick road

Besides the technical limitations baked into Jenkins itself, our CI platform also had several problems of our own making. We used the Groovy Jenkins DSL to generate jobs from code which were checked into each repository – an industry best practice and the minimum necessary for sanity. However, these scripts were based upon shared code using a library model, rather than a template model. This means that a large portion of the job logic was essentially copy-pasted into each project repository and only called out to shared modules leveraging shared code.

This pattern had several drawbacks. Each project had its own copy-pasted version of the job pipeline, which was copied from the skeleton for that project type at the time of creation and then rarely, if ever, updated. This resulted in hundreds of different versions of our various pipelines all existing at the same time and depending upon our shared library modules. That in turn made them extremely difficult to update without breaking pipelines. Testing changes against the wide variety of pipelines was an intractable challenge. Furthermore, modifying pipelines to adopt new features often required asking our users to manually update their own build code, since hundreds of divergent versions existed across the company, many with customization implemented by the teams.

To understand why things were this way, it is important to understand that Indeed’s engineering culture includes a core value of flexibility. We accept that there are many valid ways to do something and different teams and products may have different optimal choices. Furthermore, being agile and data-driven often requires a degree of flexibility. We do not subscribe to a monorepo model and instead each project lives in its own repository (we have tens of thousands of repositories).

This flexibility serves us well in many contexts but unfortunately, too much flexibility can be a double-edged sword. The inevitable result of this balance was that teams were spending an unacceptable portion of their time just addressing “platform asks”. This is our term for regular maintenance that would come up when we needed teams to modify their build, as we deployed new versions of our platform, moved resources to the cloud, or made other changes to our infrastructure. The flexibility we gave our users (other engineers at Indeed) meant we couldn’t easily make the changes for them. It was around the time that we were looking to solve the hardware scaling and resiliency problems of Jenkins that we realized the scope and depth of our self-imposed technical debt for our build platform code. The solution came from the Golden Path pattern. Using this pattern, we could give our users the flexibility to do things their own way while still making sure it was easy to choose the default way when possible, and modify only the parts of the path they really needed to while leveraging the shared path as much as possible for the rest.

The CI Platform team at Indeed

The CI Platform team at Indeed is not very large. Our team of ~11 engineers supports thousands of users, fielding support requests, performing upgrades and maintenance, and enabling follow-the-sun support for our global company. 

Because our team not only supports Gitlab but also the entire CI platform including the artifact server, our shared build code, and multiple other custom components of our platform, we had our work cut out for us. We needed a plan to get where we were going that makes the most efficient use of the resources we have.

A plan comes together

After a careful design review with key stakeholders, we successfully built consensus for the new CI Platform. We would migrate the entire company from Jenkins to Gitlab CI. The primary reasons for choosing Gitlab CI were:

  • Gitlab is a complete offering (already in use for SCM) which provides everything we need for CI
  • Gitlab CI is designed for scalability and the cloud
  • Gitlab CI enables us to write templates that extend other templates, which is compatible with our golden path strategy.

By the time we officially announced that the Gitlab CI Platform would be generally available to users, we already had 23% of all builds happening in Gitlab CI from a combination of grassroots efforts and early adopters wanting to switch ASAP. The challenge of the migration, however, would be the long tail. Due to the number of custom builds in Jenkins, an automated migration tool would not work for the majority of teams. Most of the benefits of the new system would not come until the old system was at 0%. Only then could we turn off the hardware and save the Cloudbees license fee.

Gitlab CI is Open Source Software

Another factor that influenced our decision-making process and ended up being critical to our success was that Gitlab itself is Open Source software. As a proof of concept, we had a project to make a small change to Gitlab. We picked a few simple looking bugs (a Gitlab Geo issue, and a template parsing bug) we had noticed and submitted the fixes. Gitlab was massively supportive of this and helped us shepherd our changes through. This reduced uncertainty because we knew we could always fix our own issues if Gitlab was not able to prioritize fixing them for us.

This foresight would become especially prescient the next year when we discovered an unexpected behavior in the CI job runner that caused an internal security issue due to Indeed’s unique access configuration. We were able to leverage our experience from contributing to Gitlab and compile and run a fork of the Gitlab CI job runner immediately to mitigate the issue. Meanwhile, we were able to submit the fork as an MR to Gitlab so they could understand the vulnerability and come up with an acceptable long-term fix. In the end we only had to run a fork for a few months, but that flexibility proved the value of choosing open source software.

Feature parity and the benefits of starting over

Though we support many different technologies at Indeed, the three most common languages are Java, Python, and Javascript. These language stacks are used to make libraries, deployables (i.e. web services or applications), and cron jobs (a process that runs at regular intervals, for example, to build a data set in our data lake). Each of these formed a matrix of project types (Java Library, Python Cronjob, Javascript Webapp, etc) for which we had a skeleton in Jenkins. Therefore, we had to produce a golden path template in Gitlab CI for each of these project types. Most users could use these recommended paths without change, but for those who did require customization, the golden path would still be a valuable starting point and enable them to change only what they needed, while still benefiting from centralized template updates in the future.

We quickly realized that most users, even those with customizations, were happy to take the golden path and at least try it. If they missed their customizations, they could always add them later. This was a surprising result! We thought that teams who had invested in significant customization would be loath to give them up, but in the majority of cases teams just didn’t care about them anymore. This allowed us to migrate many projects very quickly – we could just drop the golden path (a small file about 6 lines long with includes) into their project, and they could take it from there.

InnerSource to the rescue

The CI Platform team also adopted a policy of “external contributions first” to encourage everyone in the company to participate. This is sometimes called InnerSource. We wrote tests and documentation to enable external contributions – contributions from outside our immediate team – so teams that wanted to write customizations could instead include them in the golden path behind a feature flag. This let them share their work with others and ensure we didn’t break them moving forward (because they became part of our codebase, not theirs). 

This also had the benefit that particular teams who were blocked waiting for a feature they needed were empowered to work on the feature themselves. We could say “we plan to implement the feature in a few weeks, but if you need it earlier than that we are happy to accept a contribution”. In the end, many core features necessary for parity were developed in this manner, more quickly and better than our team had resources to do it. The migration would not have been a success without this model.

Ahead of schedule and under budget

Our Cloudbees license expired on April 1, 2024. This gave us an aggressive target to achieve the full migration. This was particularly aggressive considering at the time, 80% of all builds (60% of all projects) still used Jenkins for their CI. This meant over 2000 Jenkinsfiles would still need to be rewritten or replaced with our golden path templates. The wide consensus was that this date was extremely aggressive and an alternative (such as a smaller license engagement for the teams that still required Jenkins) would be needed. Nonetheless, we took the approach that one must aim for the stars to land on the moon. We made documentation and examples available, implemented features where possible, and helped our users contribute features where they were able.

We started regular office hours, where anyone could come and ask questions or seek our help to migrate. We additionally prioritized support questions relating to migration ahead of almost everything else. Our team became Gitlab CI experts and shared that expertise inside our team and across the organization.

Automatic migration for most projects was not possible, but we discovered it could work for a small subset of projects where customization was rare. We created a Sourcegraph batch change campaign to submit merge requests (MRs) to migrate hundreds of projects, and poked and prodded our users to accept these MRs. We took success stories from our users and shared them widely. As users contributed new features to our golden paths, we advertised that these features “came free” when you migrated to Gitlab CI. Some examples included built in security and compliance scanning, Slack notifications for CI builds, and integrations with other internal systems.

We also conducted a campaign of aggressive “scream tests”. We automatically disabled Jenkins jobs that hadn’t run in a while or hadn’t succeeded in a while, telling users “if you need these, turn them back on, it is self-service”. This was a low-friction way to get some signal about what jobs were actually needed. We had thousands of jobs that hadn’t been run a single time since our last CI migration (which was Jenkins to Jenkins). This allowed us to know we could safely ignore almost all of them.

In January 2024, we nudged our users by announcing that all Jenkins controllers would become read-only (no builds) unless an exception was explicitly requested. We had much better ownership information for controllers and they generally aligned with our organization’s structure, so it made sense to focus on controllers rather than jobs. The list of controllers was also a much more manageable list than the list of jobs. The only thing we asked of our users in order to obtain an exception was to find their controllers in a spreadsheet and put their contact information next to it. This enabled us to get a guaranteed up-to-date list of stakeholders we could follow up with as we sprinted to the finish line, but also enabled users to clearly say “we need these jobs, please don’t break them without talking to us”. At peak we had about 400 controllers, by January we had 220, but only 54 controllers required exceptions (several of them owned by us, to run our tests and canaries).

With a list of ~50 teams to reach out to, we had an approachable list we could divide among our team and start doing the work of understanding where they were at. We spent January and February discovering that some teams planned to finish their migration without our help before February 28th, others were planning to deprecate their projects before then, and a very small number were very worried they wouldn’t make it.

We were able to work with this smaller set of teams and provide them with “white-glove” service. We still explained that while we lacked the expertise necessary to do it for them, we could pair together with a subject matter expert from their team. For some projects we wrote and they reviewed, for others they wrote and we reviewed. In the end, all of our work paid off and we turned off Jenkins on the very day we had announced 8 months earlier.

All’s well that ends well

At peak, our Jenkins CI platform ran over 14,000 pipelines per day and serviced our thousands of projects. Today, our Gitlab CI platform has run over 40,000 pipelines in a single day and regularly runs over 25,000 per day. The incremental cost of each job of each pipeline is similar to Jenkins, but without the overhead of hardware to run the controllers. Additionally, these controllers served as single points of failure and scaling limiters that forced us to artificially divide our platform into segments. While an apples-to-apples comparison is difficult, we find that with this overhead gone our CI hardware costs are 10-20% lower. Additionally, the support burden of Gitlab CI is lower since the application automatically scales in the cloud, has cross-availability-zone resiliency, and the templating language has excellent public documentation available.

A benefit just as important, if not moreso, is that now we are at over 70% adoption of our golden paths. This means that we can roll out an improvement and over 5000 projects at Indeed will benefit immediately with no action required on their part. This has enabled us to move some jobs to more cost-effective ARM64 instances, keep users’ build images updated more easily, and better manage other cost saving opportunities. Most importantly, our users are happier with the new platform.

This post is long enough, so I will leave you with two of my favorite graphs of my entire career.

Acknowledgements

This migration would not have been possible without the tireless efforts of Tron Nedelea, Eddie Huang, Vivek Nynaru, Carlos Gonzalez, Lane Van Elderen, and the rest of the CI Platform team. The team also especially appreciates the leadership of Deepak Bitragunta, and Irina Tyree for helping secure buy-in, resources and company wide alignment throughout this long project. Finally, our thanks go out to everyone across Indeed who contributed code, feedback, bug reports, and helped migrate projects.

Secure Workload Identity with SPIRE and OIDC: A Guide for Kubernetes and Istio Users

Goal

This blog is for engineering teams, architects, and leaders responsible for defining and implementing a workload identity platform and access controls rooted in Zero Trust principles to mitigate the risks from compromised services. It is relevant for companies using Kubernetes to manage workloads, using Istio for service mesh, and aiming to define identities in a way that aligns with internal standards, free from platform-specific constraints. Specifically, we’ll discuss Indeed’s solution for third-party authentication, opinionated best practices, and challenges faced. It is not practical to share all the alternatives, trade-offs and engineering insights supporting our decisions; we want to share design choices and implementation details that can accelerate decision making and problem solving for others in similar situations.

Introduction

Passwords are a tale as old as ancient civilizations. Modern systems routinely rely on API key & ID pairs (analogous to username and passwords) to access other systems. These API keys in theory are complex, managed by developers, and stored securely. The reality is more complicated. We all have heard stories of passwords hiding in plain sight, unencrypted, in code repositories, in log messages, in headers, in terminal history, wherever it’s convenient to just get the job done. Rotating old API keys can even be scarier. Who knows if keys have been shared, how many times they have been shared, and where all they have been shared? Did Alice delete the old API key? Was the new API key deployed everywhere!?

So what’s the solution? Step 1: Articulate and measure the problem. At Indeed, we embody our core value of being data-driven. Through our analysis, we recognized the risk posed by compromised credentials used by services. Our data revealed that half of our AWS IAM keys have access to some type of restricted data. We observed shared API keys being used across a wide range of our workloads. We discovered roughly eight times as many stored secrets as there are unique keys in all of our major authorization systems. This indicates a significant duplication of secrets, though we have not yet determined the exact scale of this duplication. Step 2: Implement a solution that works for Indeed’s heterogeneous workloads across third-party SaaS cloud vendors and Indeed’s own (first-party) apps.

Image showing API keys from a shared vault being used to access resources in multiple cloud providers

The starting point is to build an identity platform capable of provisioning temporary, verifiable, attestable, unique, and cryptographically secure workload credentials for access to third-party systems like Confluent Cloud and AWS, and first-party services as well. Indeed promotes responsible use of Open Source Software and dedicated platforms with clear responsibilities leveraging industry standards to solve common problems. Our workload identity platform is built on SPIRE, embracing open standards like SPIFFE, OAuth 2.0 and OIDC to provide managed identities in x509 PKI Certificate or JSON Web Tokens standards.

SPIRE

SPIRE is a PKI project that graduated from the Cloud Native Computing Foundation. SPIRE is open source, widely used in the industry and has a vibrant and active community of engineers. SPIRE can be deployed in a scalable and resilient manner and has been operating reliably at scale in production at Indeed for over a year now. SPIRE-issued x509 identities are used in our Istio service mesh for mTLS, and JWT identities are used to enable OIDC-based federated access with Confluent and AWS resources.

Istio Opinions

Adopting Istio to replace our legacy service mesh created conflicts with certain SPIRE configurations already in production.

SPIFFE Format

We debated the granularity and uniqueness of identities suitable to represent an Indeed application. In this context identity refers to the subject, i.e., the SPIFFE ID of a workload. The discussion revolved around the SPIFFE template and its constituent parts, e.g.:

spiffe://<trust_domain>/<scheduling_platform>/<environment>/ns/<namespace>/sa/<service-account>

However, Istio is highly opinionated about the SPIFFE ID format a workload must have:

spiffe://<trust.domain>/ns/<namespace>/sa/<service-account>

An image showing a cautionary note on workload ID formatting from the Istio / SPIRE documentation

 This is a known problem that is still open with Istio: Customizing SPIFFE ID format if using an external SPIFFE-compliant SDS should be supported · Issue #43105 · istio/istio · GitHub

If you have a SPIRE deployment already in production with a different SPIFFE ID format for your Kubernetes workloads, be aware of Istio requirements. Updating the subject of your workloads is not trivial. While it’s only a configuration change in SPIRE, the subject likely appears wherever access control and authorization rules are defined for your workloads. 

SPIRE Agent Socket Name

Istio requires SPIRE Agent APIs be available on the /var/run/secrets/workload-spiffe-uds/socket Unix domain socket only—another (unnecessary) Istio opinion that affects the entirety of the platform and will require careful planning to accommodate. Since we already had SPIRE in production, we used K8s to mount our socket path to /var/run/secrets/workload-spiffe-uds and only had to update the file name from agent.socket to socket. We made the practical choice of temporarily disabling mTLS in the mesh and rolling out our SPIRE Agent socket name changes one cluster at a time, as it affected the proxy SDS (Secret Discovery Service) configuration as well. During this time, our mesh was only protected by the network perimeter behind the VPN. After both SPIRE Agent and service mesh SDS configuration were updated, mTLS was turned back on.

SPIRE Architecture

Topology and Trust Domain

At Indeed, we manage a single trust domain in SPIRE deployed in a nested topology. We run multiple SPIRE Servers in each Kubernetes cluster for redundancy. SPIRE Servers in each cluster have a common datastore for synchronization. There’s one root SPIRE CA deployed in a special cluster reserved for infrastructure services. All other Kubernetes clusters have their own intermediate SPIRE CAs with the root CA as their upstream authority.

An image showing an example of a nested SPIRE deployment

A and N represent cardinality and any number greater than 1 is suitable. The cardinality for M is the number of nodes in the cluster, as each node has its own instance of SPIRE Agent.

This topology is scalable, performant and resilient. A single Spire Server can go down in any cluster without any outage. All SPIRE Servers in a cluster going down only affects workloads in that cluster. Each SPIRE component in each cluster can be configured and tuned separately. SPIRE configures each Server with its own CA signing keys. That’s also desirable from a security perspective, as any compromised SPIRE Server private keys are not used elsewhere.

We use a unified trust domain for all our workloads in production and non-production environments (excluding local development). A single trust domain is easier to reason about and maintain. Namespace naming conventions at Indeed typically include environment names in the namespace and that provides sufficient logical separation from an operational and security perspective. E.g., we treat metrics from spire–dev namespace differently to those from spire–prod. We help our developer teams understand that they can use variations in namespace and service account to create different permission boundaries for similar workloads in different environments.

SPIRE Performance and Deployment Tuning: Lessons from Production

Through our experience running various SPIRE components across a fleet of 3000 pods, we discovered some Kubernetes configurations that keep our platform stable even as nodes and pods come and go. These settings were also influenced by stress testing of our SPIRE platform by scheduling thousands of workloads in a limited amount of time and observing how our platform behaved during major upgrades. Here are some settings we recommend:

  1. Set the criticality of the SPIRE components to minimize eviction. priorityClassName: XXXX for SPIRE Server and Agent.
    • Kubernetes has a hard limit of 110 pods per node. We need to guarantee that the SPIRE Agent gets scheduled on each node. It’s a runtime requirement for all pods. Secondly, we want to prevent pre-emption for core SPIRE components as much as possible. Without priorityClassName Kubernetes will default to priority of zero or globalDefault. This setting must be set explicitly and high enough to ensure scheduling of SPIRE Agents on each node.
  2. Set resource request/limits for ephemeral storage for SPIRE Agent. We observed SPIRE Agent pod evictions related to disk pressure on the node. Our solution was to explicitly set both requests/limits to ephemeral-storage: XXXMi to prevent the SPIRE Agent from being evicted.
  3. Leverage vertical pod autoscaling (VPA) for SPIRE components (Servers, Registrars, and Agents). SPIRE runs in a myriad of clusters with unique and varying performance characteristics. Our performance testing revealed the CPU and memory upper bounds we can expect. But overallocation for the worst case is costly and inefficient. With VPA we are able to set CPU minAllowed to 15m, i.e., 0.015 CPU for SPIRE components! The max was based on observations during performance testing.
    • Note that updatePolicy for SPIRE Agents was set to updateMode: Initial. This is to prevent evictions from VPA updates. We made a conscious choice to minimize SPIRE Agent disruption from VPA changes and apply VPA policies during expected SPIRE Agent restarts due to node upgrades, scheduled deployments, etc.
    • updateMode: Auto is in use for all other SPIRE components.
  4. Since SPIRE Agents are configured as  DaemonSet we also set our updateStrategy to type: RollingUpdate with rollingUpdate set to maxUnavailable: 5. This slows the rollout of SPIRE Agents in a cluster but also ensures a large majority of the nodes in the cluster are being served by SPIRE as expected.

SPIRE Signing Keys and KeyManager Configuration

If your workload requires a JWT SPIFFE Verifiable Identity Document (SVID), it is highly likely you’ll need a stable, predictable number of signing keys in use across all SPIRE Servers. It is important to note:

  1. Each Spire Server has a separate and unique x509 and JWT key pair for signing.
  2. The in-memory KeyManager results in new x509 and JWT signing keys generated upon every restart.
  3. SPIRE doesn’t have the option to use the SQL Datastore as a KeyManager also.

We encountered issues in using AWS EBS/EFS CSI as persistent volumes and thus couldn’t use the disk KeyManager plugin. We helped enhance the built-in AWS KMS KeyManager plugin so there’s an option for persistent key store without relying on persistent volumes for Spire Server pods. We found the AWS KMS KeyManager to be reliable.

Given M total SPIRE Servers, the number of JWT signing keys K in the JSON Web Key Sets is:  M <= K <= 2 * M. It is possible a SPIRE Server has an active JWT signing key that’s used for signing and verification and another unexpired key that’s used for verification only.

SPIRE as OAuth Identity Server

OIDC Discovery Provider

SPIRE can be integrated as an Identity Server in the OAuth flow. The use of SPIRE OIDC Discovery Provider further allows for federation based on SPIRE JWT SVIDs. We initially deployed the SPIRE OIDC Provider to all Spire Servers including Root and Intermediate CAs. Querying the Provider would return a varying number of public JWT signing keys! Our current strategy is to enable and serve the SPIRE OIDC Discovery Provider from the Root CAs only. We find the Root SPIRE CAs in a nested topology to be an accurate source for the full trust bundle (including all JWT signing keys being used in the entire SPIRE Server fleet).

CredentialComposer Plugin

SPIRE Server supports many customization plugins. There’s also a plugin that can modify the claims in a JWT SVID as needed. At Indeed, we implement a custom plugin that looks up a workload’s metadata and translates that into additional claims as needed. Our approach to federation with AWS is based on passing session tags using AssumeRoleWithWebIdentity. We tag AWS resources storing sensitive data and manage which workload has access to which tags in internal systems. The custom plugin looks up the appropriate session tags for a workload and adds them to the JWT SVID.

An image showing how a K8s workload uses SPIRE OAuth to access an S3 bucket with a custom JWT

The workload’s final access is the combination of the IAM Policy attached to the IAM Role and additional session tags the workload was granted. The IAM Role itself doesn’t need to be tagged.

The SPIFFE Helper utility runs as a sidecar to request, refresh, and store the JWT SVID at a fixed location on the workload pod.

Third-party Federation Using OIDC

At Indeed, popular Confluent and AWS technologies are used to store most of our critical data. Most of our workloads also access data in both clouds. It is important for us to implement federation with both successfully from the beginning. The details for enabling and configuring OIDC are well documented for both AWS and Confluent. Next we’ll cover how our experience differed for both vendors and lessons learned. It is fair to say that there were significant differences and nothing should be taken for granted, as you’ll see.

Opaque Limits on Keys Accepted in JWKS, and Too Many JWT Signing Keys

We discussed earlier that each Spire Server has its own unique JWT signing key pair and that the maximum number of signing keys is twice the number of SPIRE CA servers. One drawback of a nested topology scaled for fault tolerance is that there are many SPIRE CAs. So, given M SPIRE Server per N K8s cluster, there can be 2 * M * N JWT signing keys in the JSON Web Key Set (JWKS).

In the early phases of development, we saw the verification of the SPIRE JWT failed in both Confluent and AWS. Our proof of concept, which had used a single SPIRE CA server in a test trust domain, had worked. We investigated more and figured that AWS accepts ~100 signing keys and Confluent only a handful. Neither documents the limit anywhere, which made the whole process more difficult. We were able to work with Confluent to increase the soft limit to something more reasonable. The AWS limit remains the same. 

We have this issue open with the SPIRE community as well. SPIRE deployments of more than a few servers can create more keys in JWKS than OIDC federating system supports · Issue #4699 · spiffe/spire · GitHub 

While nested topologies are great for high availability, there’s a real risk that federation can fail based on arbitrary limits on signing keys supported by the federating system. SPIRE could benefit from providing a mechanism where the number of SPIRE instances can scale, but the number of JWT signing keys are fixed, i.e. be able to logically group Spire Servers that use the same key material.

OIDC Configuration

When configuring the OIDC Provider in AWS, the thumbprint for the top level certificate used in signing the OIDC endpoint is required. Confluent doesn’t require any such configuration. Our SPIRE OIDC server endpoint has a certificate issued by Let’s Encrypt. Confluent implicitly trusts globally trusted CAs. AWS requires that the thumbprint be set. This is challenging as Let’s Encrypt recently truncated the chain and has also shortened the duration of the new top-level Intermediate CA. You must define a process or automation to update the OIDC configuration in AWS before the signing CA for the OIDC server itself rotates.

Note: This is different from the JWT signing key pair used to sign the JWT and subsequently used in JWT verification.

Confluent Identity Pool vs. AWS IAM Role

In the context of OIDC, Confluent identity pools and AWS IAM roles are used for managing permissions, but have different implementations. We’ll look at some key differences.

Audience Claim

It is worth noting that AWS expects Audience(s) to be set per OIDC provider. Confluent expects the aud claim to be defined in each identity pool. The difference is that in AWS, the audience claim is tied to the Issuer relationship itself, so there’s no need for an audience check in the trust policy for an IAM Role. Confluent expects Identity Pool filters to explicitly verify the issuer, audience, etc.

Trusting OIDC Providers

A Confluent Identity Pool trusts a single OIDC Provider only. The Confluent documentation for identity pool may lead you to believe otherwise by supporting filter expressions like claims.iss in [“google”, “okta”], but an identity pool is bound to one OIDC Provider. AWS IAM Roles, on the other hand, rely on trust policies which can be configured to trust multiple OIDC Providers by repeating the principal block. This matters when thinking about migrating to new OIDC Providers or running multiple Identity Providers in your organization.

Size Limits

AWS IAM Roles have a limitation on the size of the trust policy, and Confluent has a limit on the size of the filter. Work with the vendor to understand the hard limits and soft limits for your company. It is better to know these limits ahead of time as that can influence the design of the trust policy and the workload identity format itself.

SDK and Standards Maturity

AWS has a mature and well documented credential provider chain. It walks a developer through what SDK configuration is needed so that the OAuth JWT will be automatically located and used in a call to AssumeRoleWithWebIdentity inside the client application. A few properly configured environment variables and a credential file containing the JWT are all that’s needed for the AWS SDK to automatically exchange it for an STS credential with Role assumption. No additional logic is needed when the credential file containing the JWT is automatically refreshed.

Confluent Kafka Simple Authentication and Security Layer (SASL) libraries provide interfaces that have to be implemented in multiple languages for the JWT to be located, refreshed and made available for use.

The biggest issue we’ve faced so far in our journey was the least expected: CredentialComposer plugin serializes integer claims as float · Issue #4982 · spiffe/spire · GitHub. The SPIRE credential composer plugin converts timestamp fields from integer to float. This led AWS STS to reject the JWT due to invalid data type for the iat and exp claims. Confluent, on the other hand, had no problem validating and verifying the JWT. The JWT spec defines timestamps to be numeric types, and both integer and float are valid types. We got stuck between poor data type handling in SPIRE and AWS STS aversion to fixing the issue on their end and bringing their JWT validation up to spec. A tactical fix was pushed by Indeed so SPIRE JWT SVIDS will be accepted by AWS.

Conclusion

Adopting SPIRE as your OIDC Provider with major cloud vendors allows you to specify identities independently of vendor-specific naming schemes and manage them centrally. This approach provides a consistent view of each workload, benefiting compliance, governance, and auditing efforts within the company.

If you are pushing for the latest and greatest in SPIRE architecture and security standards, be prepared to overcome gaps on behalf of SPIRE or the federating system. While no system is perfect, the problems SPIRE already solves, it solves well. A highly available SPIRE deployment as an OIDC provider is a road less traveled, and we are excited to make things better wherever we can and share our learnings for everyone’s benefit. We hope this guide accelerates your journey for embracing secure workload identity in your organization.

The Importance of Using a Composite Metric to Measure Performance

A still image depicting a page loading evenly over four seconds

In the past, Indeed has used a variety of metrics to evaluate our client-side performance, but we’ve tended to focus on one at a time. Traditionally, we chose a single performance metric and used it as the measuring stick for whether we were improving or degrading the user experience. 

This made it simple to track performance because we only needed to instrument and monitor a single datapoint. Technical and non-technical consumers could easily parse this information and understand how we were doing as an organization.

However, this type of thinking also brought about significant drawbacks that, in many cases, ended up resulting in overall degraded performance and wasted effort. This document examines those drawbacks, and suggests that using a “composite metric” enables us to much better measure what our users are experiencing. 

Past Performance Measurements

Below we look at a few metrics we’ve used to try and understand client-side performance, attempting to answer the following questions:

“When did the main JavaScript for the page execute?” —  JSV Delay

One of the earliest metrics widely used at Indeed was “JSV delay” (JavaScript Verification Delay) which measured the point at which JavaScript loaded, parsed, and began to execute. It was instrumented as a client-side network request which marked the time at which our main JavaScript began to execute. 

This metric was helpful in measuring whether we were degrading the experience by adding extra JS, or content before the JS bundle since that also resulted in slowdowns in JSV Delay. Over time, this measurement was widely adopted but suffered from significant issues:

  • Failure to capture performance impact of third party content (Google Analytics, Micro Frontends, etc)
  • Inability to measure what a user was actually experiencing even if JS loaded, the page wasn’t actually usable at the time, and the time to usability wasn’t being measured 
  • Bespoke implementation of the metric meant we were not uniformly measuring performance across our pages JSV delay meant something different from one page to another
  • No one really knew what the metric meant because it’s only a standard inside Indeed, we were continually explaining the metric, its advantages, and its downsides

“When did all critical CSS and JavaScript Load?” — domContentLoadEnd

After we decided JSV Delay was no longer serving our needs we decided to adopt a metric which was more broadly used in the software industry. domContentLoadEnd is defined as:

when the HTML document has been completely parsed, and all deferred scripts… have downloaded and executed. It doesn’t wait for other things like images, subframes, and async scripts to finish loading.

In layman’s terms, we can interpret domContentLoadEnd as a more generalized JSV Delay it fires only after critical HTML, CSS, and JavaScript have loaded. This gave us a much better idea of how the page as a whole was performing, and it was no longer a custom metric, which reduced confusion and ensured that we were uniformly measuring performance across all of our pages. However, this metric too came with significant issues:

  • domContentLoadEnd doesn’t capture async scripts, which means it misses out on significant portions of the page
  • Similar to JSV Delay, the fact that much of the code had loaded didn’t necessarily mean the page was interactive
  • For some pages, domContentLoadEnd could trigger for entirely blank pages (e.g., single page applications).

“When did users see the most important content on the page?” — largestContentfulPaint

Our last usage of “a single metric to explain performance” was largestContentfulPaint (LCP), which was a big step forward for us because it was our first adoption of a Google-recommended metric which was created to try and measure an ever-evolving web landscape.

This allowed us to, for the first time, use a metric that captured “perceived performance,” rather than a more arbitrary datapoint from a browser API. By using LCP, we were making a conscious choice to measure the actual user experience, which was a big step in the right direction. 

Because of Indeed’s usage of server-side rendering on high-traffic job search pages, where HTML is immediately visible to users on initial page load, LCP corresponded to the moment where users first saw job cards, the job description, and other critical content. The faster we show our user content, the more time we save them, the more delightful the experience. 

Again, however, this measurement came with significant issues:

  • LCP is not supported on iOS and other legacy browsers, which means we fail to capture this metric on a large percentage of our page loads, users, etc. 
  • Although users can see the critical content, it probably isn’t yet interactive.
  • LCP is a web-based metric, only collectible in web browsers, and thus excludes native applications. 

Differing Page Loads 

The lifecycle of a page is complex — from a technical perspective, a lot happens between the initial navigation to a page and when a user begins interacting with its content. The core problem with using a single metric to understand this complex workflow is that it removes much of the context which is necessary in understanding “how the user perceived the page load”. 

Let’s consider the following diagram:

Animated timeline showing a page loading evenly over four seconds

Here we see a standard page which takes 4 seconds to load. To start, the job seeker sees a blank page for 1 second; a second later they see a header and a loading indicator. 1 second later they see the main content of the page (LCP), and a second later the page is fully interactive. Now let’s take a look at the next diagram: 

Animated timeline showing a page loading four seconds, with the first three changes happening more quickly

Here we see the same page loading, but we see the main content of the page much quicker! But .. we wait 2.5 seconds for the page to become interactive. If we were using a single metric, say LCP, we would believe the second page is much faster. However, users would be experiencing a lot of frustration waiting for the page to become interactive. 

Finally, let’s look at this scenario: 

Animated timeline showing a page loading four seconds, with the last three changes happening quickly near the end of the four seconds

Here we see that the page is still taking 4 seconds to load but that users don’t see any content until the last second. It’s pretty intuitive that this is a poor experience, since much of the time we’re looking at a blank page, and we don’t even know if it’s working/loading at all. Again if we chose a single metric, we wouldn’t be capturing the actual perceived experience of the page load. What if we improved the time to seeing initial content to 2 seconds from 3.5, while total loading time stayed the same? The user would feel that the page is faster, but we wouldn’t be capturing that improvement. 

The Single Metric Problem

As we can see from the above, the lifecycle of a page can be highly variable, where small changes can have big impacts on how users perceive performance. When we look back on our historical performance measurements which utilized the “single metric approach”, we see two fundamental issues:

One metric can’t capture perceived performance

Holistic performance cannot be captured by a single metric — as depicted in the diagrams above, there is no single point in a page load which measures how quickly a user becomes engaged with content. 

There are thousands (or an infinite number?) of ways to build a web page, and each brings about their own trade offs when it comes to performance. 

For pages that don’t implement server-side rendering (SSR), if we chose to only measure firstContentfulPaint, we would be measuring a datapoint which has effectively no value (since this metric would capture when the first blank page was rendered). 

For single page applications, if we chose to measure only time to interactive (TTI), we would be ignoring how quickly users saw initial content, and how quickly they could begin to interact with the page. The reason is that although TTI is an important indicator, it fails to precisely capture when a page is truly interactive. 

Another problem with using a single metric is that our pages change over time, and as a result, so too changes how users perceive the performance of a page. Using the above examples, what if an application went from a server-side rendered approach, to a client-side rendered approach? If we stuck with the same performance measurement, say TTI, we would actually think we hurt performance but in reality we’re now showing content much sooner to the user, with the tradeoff of negligible impact to TTI. Overall the perceived page performance would be drastically improved, but we would fail to measure it. 

From a business and organizational perspective, that’s an observability gap which has profound implications in the ways we spend our time, and effort. 

Improving one metric often degrades another

The second, and perhaps more significant issue with using a single metric to measure speed is that it often results in degraded performance without us realizing it. 

The easiest way to improve performance is to ship fewer bytes, and render less content overall. In reality, that’s not always a decision we can make for the business. So as we begin to try to improve performance, we often end up in situations where we’re able to improve a single metric but it either has no bearing on holistic performance, or it actually hurts it! 

Let’s take a look at a new diagram (depicted below):

Animated timeline showing a page loading four seconds, with the page becoming progressively more useful over the four seconds

Here we see that our page begins loading normally and at the 2 second mark we have our main content, and the page is interactive. At this point our users can perform their primary goal with the page (let’s say apply for a job for example). At the 3 second mark more content pops in, and finally a second later, all content is visible on the page. This is a common loading pattern for async, or client-side rendered applications (e.g., single page apps). 

Ideally, what we’d like to do is shift each of these frames to the left, improving the perceived performance of each step. However, if we were only measuring time to interactive, which occurs in frame 4, we would completely disregard the most important part of the page load which is “how quickly can we make the main content of our page visible and interactive (frame 2). Similarly, if we only measured LCP (which occurs in frame 2), we would be disregarding TTI, which is where all of the content is finally visible. 

In this example, we can see that no single metric captures the true performance of the page, but rather it’s a collection of metrics which help us understand the true perceived performance. 

Perceived performance is very dependent on how quickly the page loads, but perhaps more important, how it loads. 

Using a Composite Metric: LightHouse Explained

Finally, this brings us to the use of a “composite metric” which is a term used in statistics that simply means “a single measurement based on multiple metrics”. With a LightHouse score we’re able to derive a single score based on 5 data points, each which represent a different aspect of a page load. 

These data points are:

A table showing the different metrics in the composite LightHouse score, and how they're weighted

For brevity, we won’t go into detail on each data point you can read more about these page markers here. At a high level, industry experts have agreed upon these 5 markers and weighted them according to how much they contribute to a user perceiving a page as fast and responsive. 

As is hopefully evident based on the explanations above, the purpose of using these 5 data points is to best capture the holistic perceived performance. We weight LCP, total blocking time (TBT), and cumulative layout shift the highest because we believe these are the most important indicators of speed. FCP and speedIndex are contributors but less significant overall. 

During each page load, we’re able to calculate all of these metrics and use an algorithm to determine a single score users who receive a score >= 90 are determined to be “fast and responsive”. Scores below 90 are in need of improvement.

Composite Metrics in Action

If we use the same page load diagram from above, we can imagine how using a composite metric allows us to fully capture performance for our users.

A still image depicting a page loading evenly over four seconds

Let’s run through a few scenarios: 

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), and did no harm to frames 3 and 4, we would see an improvement to our overall LightHouse score.

If we ended up shipping a change which improved FCP and LCP (frames 1 and 2), but degraded frames 3 and 4, we would see no improvement to our overall LightHouse score.

If we ended up with an improvement which improved FCP, but degraded frames 2, 3, and 4, we would see an overall degradation that we would have missed if we were monitoring only a single metric. 

Why Can’t We Simply Use “Time to Interactive” (TTI)? 

This is a common question within the performance realm so I wanted to address it here, and how it relates to composite metrics. 

First, what is TTI? The most common definition is as follows: 

TTI is a performance metric that measures a page’s load responsiveness and helps identify situations where a page looks interactive but actually isn’t. TTI measures the earliest time after First Contentful Paint (FCP) when the page is reliably ready for user interactivity.

This sounds great, so why not just use this? Isn’t the most important thing for performance when the page is interactive? 

Like all things in software, there’s nuance and tradeoffs. Let’s look at the pros and cons:

Pros:

  • A single metric which estimates how long the overall page took to become usable

Cons:

  • TTI is no longer recommended, and has been taken out of LightHouse calculations because it’s not believed to be an accurate metric across a wide variety of page load types (CSR, SSR, etc).
  • TTI is an estimation based on network activity, and DOM mutations, not an actual marker of page completion.
  • Because TTI is just a single metric, it suffers from “the single metric problem” which is explained above.

My point here isn’t that TTI is bad, but rather that it’s an incomplete way of looking at performance. TTI is a useful indicator, but it’s only meaningful if we look at it in context to our other metrics (FCP, LCP, etc). TTI’s main purpose is to provide a corroborating metric, rather than to explain performance overall. 

As an organization, we can imagine hundreds of ways to improve TTI without actually improving the most critical aspects of perceived performance. Additionally, we can imagine ways which improve TTI that actually hurt the earlier marks of a page load, which may result in degraded performance overall. 

Conclusions 

My hope for readers that have made it this far is that we now have a more nuanced understanding of how we can measure client-side performance. With the advent of the web we developed metrics which helped us figure out how fast static pages were loading — as the web advanced (thanks a lot jQuery!), so too have our measurements advanced.

Based on the past ~4 years of deep investment in performance improvements at Indeed, I believe these are my most important takeaways: 

  • Use a composite metric, but be willing to change the underlying internal metrics.
  • Be wary of the silver bullet — metrics or tools that purport to capture everything you need nearly always don’t. 
  • Technology changes, and we need to change how we measure performance as a result.
  • Corroborate your speed metrics with how your page loads and ensure it actually represents what users are experiencing.