The Agentic Identity Journey

Posted on June 3, 2025 by Ken Adler

Every so often, the web changes in a way that rewires how we live.

In the early days, Web 1.0 let us read. It was a window into information — static pages, digital brochures, news sites. We were spectators peering into a new world.

Then came Web 2.0, and we learned to write. We didn’t just consume the web; we co-authored it. Blogs, social networks, wikis — suddenly, the line between audience and creator blurred.

Web 3.0 promised ownership. Decentralized networks and identities, blockchains, Bitcoin, NFTs.

And now, it’s happening again.

We’re moving into Web 4.0: the era of delegation.

Where humans don’t just do things — they delegate them. To agents. To software that not only responds to commands, but anticipates needs and takes action.

Web Revolution: From Read to Delegate

With hundreds of millions of monthly active users today, Indeed.com operates at an extraordinary scale.

As we look toward an agentic future, we’re not just preparing for more human users — we’re preparing for a surge of autonomous actors, including malicious agents, interacting across our platform.

It’s not just about knowing who or what is connecting — it’s about ensuring each has exactly the right level of access, no more and no less.

Traditional identity and access management implementations weren’t designed for this level of scale and nuance. To succeed, we need an Agentic IAM architecture that delivers rich authorizations, enables trustworthy delegation, and provides verifiable auditing — all while preserving the speed, resilience, and privacy our users count on.

This post is the first in a series… and is an invitation to follow that journey: the insights, the challenges, and the innovations shaping how we reimagine identity systems for the agentic era.

Ken Adler is a Technical Fellow and Director of Identity and Access Management at Indeed.

David McPike is a Principal Architect with Indeed’s Identity and Access Management team.

For more posts on this topic, visit AgenticIAM.AI .

Disclaimer: This post was crafted with a little help from AI (ChatGPT), but all insights and opinions are entirely my own. No AI was harmed in the making of this post.

How Indeed Replaced Its CI Platform with Gitlab CI

Posted on August 6, 2024 by Carl Myers

Here at Indeed, our mission is to help people get jobs. Indeed is the #1 job site in the world with over 580M+ Job Seeker Profiles. For Indeed’s Engineering Platform teams, we have a slightly different motto: “We help people to help people get jobs”. As part of a data-driven engineering culture that has spent the better part of two decades always putting the job seeker first, we are responsible for building the tools that not only make this possible, but empower engineers to deliver positive outcomes to job seekers every day.

Do you want to build a Jenkins snowman?

Like many large technology companies, our Continuous Integration (CI) platform was built organically as the company scaled. In fact, Indeed was using Hudson, Jenkins’ direct predecessor, back in 2007. At the time, Indeed had fewer than 20 engineers. Today, through nearly two decades of growth, we have thousands of engineers. We built our platform on top of the de facto open source and industry standard solutions available at the time. As new technology became available, we made incremental improvements, switching to Jenkins after Oracle bought Sun and caused the Jenkins/Hudson fork around 2011. Another improvement allowed us to move most of our workloads to dynamic cloud worker nodes using AWS EC2. As we entered the Kubernetes age, however, the system architecture reached its limits. Hudson was first released in 2005. In 2005, J2SE 5.0 was less than a year old. Java with generics was novel! AWS was not a thing. Clouds were made of water vapor, not servers and software defined networking.

Suffice it to say, Jenkins’ architecture was not created with the cloud in mind and could not have been, because the cloud did not yet exist. Jenkins operates by having a “controller” node, a single point of failure which runs critical parts of a pipeline and farms out certain steps to worker nodes (which can scale horizontally to some extent). Controllers are not only a single point of failure, they are also a manual scaling axis. If you have too many jobs to fit on one controller, you must partition your jobs across controllers manually. Cloudbees, the largest company offering Jenkins enterprise support, has some mitigations for this including the Cloudbees Jenkins Operations Center (CJOC), which allows you to manage your constellation of controllers from a single centralized place, but they remain challenging to run in a Kubernetes environment because each controller is a fragile single-point-of-failure. Activities like node rollouts or hardware failures cause downtime.

Follow the yellow brick road

Besides the technical limitations baked into Jenkins itself, our CI platform also had several problems of our own making. We used the Groovy Jenkins DSL to generate jobs from code which were checked into each repository – an industry best practice and the minimum necessary for sanity. However, these scripts were based upon shared code using a library model, rather than a template model. This means that a large portion of the job logic was essentially copy-pasted into each project repository and only called out to shared modules leveraging shared code.

This pattern had several drawbacks. Each project had its own copy-pasted version of the job pipeline, which was copied from the skeleton for that project type at the time of creation and then rarely, if ever, updated. This resulted in hundreds of different versions of our various pipelines all existing at the same time and depending upon our shared library modules. That in turn made them extremely difficult to update without breaking pipelines. Testing changes against the wide variety of pipelines was an intractable challenge. Furthermore, modifying pipelines to adopt new features often required asking our users to manually update their own build code, since hundreds of divergent versions existed across the company, many with customization implemented by the teams.

To understand why things were this way, it is important to understand that Indeed’s engineering culture includes a core value of flexibility. We accept that there are many valid ways to do something and different teams and products may have different optimal choices. Furthermore, being agile and data-driven often requires a degree of flexibility. We do not subscribe to a monorepo model and instead each project lives in its own repository (we have tens of thousands of repositories).

This flexibility serves us well in many contexts but unfortunately, too much flexibility can be a double-edged sword. The inevitable result of this balance was that teams were spending an unacceptable portion of their time just addressing “platform asks”. This is our term for regular maintenance that would come up when we needed teams to modify their build, as we deployed new versions of our platform, moved resources to the cloud, or made other changes to our infrastructure. The flexibility we gave our users (other engineers at Indeed) meant we couldn’t easily make the changes for them. It was around the time that we were looking to solve the hardware scaling and resiliency problems of Jenkins that we realized the scope and depth of our self-imposed technical debt for our build platform code. The solution came from the Golden Path pattern. Using this pattern, we could give our users the flexibility to do things their own way while still making sure it was easy to choose the default way when possible, and modify only the parts of the path they really needed to while leveraging the shared path as much as possible for the rest.

The CI Platform team at Indeed

The CI Platform team at Indeed is not very large. Our team of ~11 engineers supports thousands of users, fielding support requests, performing upgrades and maintenance, and enabling follow-the-sun support for our global company.

Because our team not only supports Gitlab but also the entire CI platform including the artifact server, our shared build code, and multiple other custom components of our platform, we had our work cut out for us. We needed a plan to get where we were going that makes the most efficient use of the resources we have.

A plan comes together

After a careful design review with key stakeholders, we successfully built consensus for the new CI Platform. We would migrate the entire company from Jenkins to Gitlab CI. The primary reasons for choosing Gitlab CI were:

Gitlab is a complete offering (already in use for SCM) which provides everything we need for CI
Gitlab CI is designed for scalability and the cloud
Gitlab CI enables us to write templates that extend other templates, which is compatible with our golden path strategy.

By the time we officially announced that the Gitlab CI Platform would be generally available to users, we already had 23% of all builds happening in Gitlab CI from a combination of grassroots efforts and early adopters wanting to switch ASAP. The challenge of the migration, however, would be the long tail. Due to the number of custom builds in Jenkins, an automated migration tool would not work for the majority of teams. Most of the benefits of the new system would not come until the old system was at 0%. Only then could we turn off the hardware and save the Cloudbees license fee.

Gitlab CI is Open Source Software

Another factor that influenced our decision-making process and ended up being critical to our success was that Gitlab itself is Open Source software. As a proof of concept, we had a project to make a small change to Gitlab. We picked a few simple looking bugs (a Gitlab Geo issue, and a template parsing bug) we had noticed and submitted the fixes. Gitlab was massively supportive of this and helped us shepherd our changes through. This reduced uncertainty because we knew we could always fix our own issues if Gitlab was not able to prioritize fixing them for us.

This foresight would become especially prescient the next year when we discovered an unexpected behavior in the CI job runner that caused an internal security issue due to Indeed’s unique access configuration. We were able to leverage our experience from contributing to Gitlab and compile and run a fork of the Gitlab CI job runner immediately to mitigate the issue. Meanwhile, we were able to submit the fork as an MR to Gitlab so they could understand the vulnerability and come up with an acceptable long-term fix. In the end we only had to run a fork for a few months, but that flexibility proved the value of choosing open source software.

Feature parity and the benefits of starting over

Though we support many different technologies at Indeed, the three most common languages are Java, Python, and Javascript. These language stacks are used to make libraries, deployables (i.e. web services or applications), and cron jobs (a process that runs at regular intervals, for example, to build a data set in our data lake). Each of these formed a matrix of project types (Java Library, Python Cronjob, Javascript Webapp, etc) for which we had a skeleton in Jenkins. Therefore, we had to produce a golden path template in Gitlab CI for each of these project types. Most users could use these recommended paths without change, but for those who did require customization, the golden path would still be a valuable starting point and enable them to change only what they needed, while still benefiting from centralized template updates in the future.

We quickly realized that most users, even those with customizations, were happy to take the golden path and at least try it. If they missed their customizations, they could always add them later. This was a surprising result! We thought that teams who had invested in significant customization would be loath to give them up, but in the majority of cases teams just didn’t care about them anymore. This allowed us to migrate many projects very quickly – we could just drop the golden path (a small file about 6 lines long with includes) into their project, and they could take it from there.

InnerSource to the rescue

The CI Platform team also adopted a policy of “external contributions first” to encourage everyone in the company to participate. This is sometimes called InnerSource. We wrote tests and documentation to enable external contributions – contributions from outside our immediate team – so teams that wanted to write customizations could instead include them in the golden path behind a feature flag. This let them share their work with others and ensure we didn’t break them moving forward (because they became part of our codebase, not theirs).

This also had the benefit that particular teams who were blocked waiting for a feature they needed were empowered to work on the feature themselves. We could say “we plan to implement the feature in a few weeks, but if you need it earlier than that we are happy to accept a contribution”. In the end, many core features necessary for parity were developed in this manner, more quickly and better than our team had resources to do it. The migration would not have been a success without this model.

Ahead of schedule and under budget

Our Cloudbees license expired on April 1, 2024. This gave us an aggressive target to achieve the full migration. This was particularly aggressive considering at the time, 80% of all builds (60% of all projects) still used Jenkins for their CI. This meant over 2000 Jenkinsfiles would still need to be rewritten or replaced with our golden path templates. The wide consensus was that this date was extremely aggressive and an alternative (such as a smaller license engagement for the teams that still required Jenkins) would be needed. Nonetheless, we took the approach that one must aim for the stars to land on the moon. We made documentation and examples available, implemented features where possible, and helped our users contribute features where they were able.

We started regular office hours, where anyone could come and ask questions or seek our help to migrate. We additionally prioritized support questions relating to migration ahead of almost everything else. Our team became Gitlab CI experts and shared that expertise inside our team and across the organization.

Automatic migration for most projects was not possible, but we discovered it could work for a small subset of projects where customization was rare. We created a Sourcegraph batch change campaign to submit merge requests (MRs) to migrate hundreds of projects, and poked and prodded our users to accept these MRs. We took success stories from our users and shared them widely. As users contributed new features to our golden paths, we advertised that these features “came free” when you migrated to Gitlab CI. Some examples included built in security and compliance scanning, Slack notifications for CI builds, and integrations with other internal systems.

We also conducted a campaign of aggressive “scream tests”. We automatically disabled Jenkins jobs that hadn’t run in a while or hadn’t succeeded in a while, telling users “if you need these, turn them back on, it is self-service”. This was a low-friction way to get some signal about what jobs were actually needed. We had thousands of jobs that hadn’t been run a single time since our last CI migration (which was Jenkins to Jenkins). This allowed us to know we could safely ignore almost all of them.

In January 2024, we nudged our users by announcing that all Jenkins controllers would become read-only (no builds) unless an exception was explicitly requested. We had much better ownership information for controllers and they generally aligned with our organization’s structure, so it made sense to focus on controllers rather than jobs. The list of controllers was also a much more manageable list than the list of jobs. The only thing we asked of our users in order to obtain an exception was to find their controllers in a spreadsheet and put their contact information next to it. This enabled us to get a guaranteed up-to-date list of stakeholders we could follow up with as we sprinted to the finish line, but also enabled users to clearly say “we need these jobs, please don’t break them without talking to us”. At peak we had about 400 controllers, by January we had 220, but only 54 controllers required exceptions (several of them owned by us, to run our tests and canaries).

With a list of ~50 teams to reach out to, we had an approachable list we could divide among our team and start doing the work of understanding where they were at. We spent January and February discovering that some teams planned to finish their migration without our help before February 28th, others were planning to deprecate their projects before then, and a very small number were very worried they wouldn’t make it.

We were able to work with this smaller set of teams and provide them with “white-glove” service. We still explained that while we lacked the expertise necessary to do it for them, we could pair together with a subject matter expert from their team. For some projects we wrote and they reviewed, for others they wrote and we reviewed. In the end, all of our work paid off and we turned off Jenkins on the very day we had announced 8 months earlier.

All’s well that ends well

At peak, our Jenkins CI platform ran over 14,000 pipelines per day and serviced our thousands of projects. Today, our Gitlab CI platform has run over 40,000 pipelines in a single day and regularly runs over 25,000 per day. The incremental cost of each job of each pipeline is similar to Jenkins, but without the overhead of hardware to run the controllers. Additionally, these controllers served as single points of failure and scaling limiters that forced us to artificially divide our platform into segments. While an apples-to-apples comparison is difficult, we find that with this overhead gone our CI hardware costs are 10-20% lower. Additionally, the support burden of Gitlab CI is lower since the application automatically scales in the cloud, has cross-availability-zone resiliency, and the templating language has excellent public documentation available.

A benefit just as important, if not moreso, is that now we are at over 70% adoption of our golden paths. This means that we can roll out an improvement and over 5000 projects at Indeed will benefit immediately with no action required on their part. This has enabled us to move some jobs to more cost-effective ARM64 instances, keep users’ build images updated more easily, and better manage other cost saving opportunities. Most importantly, our users are happier with the new platform.

This post is long enough, so I will leave you with two of my favorite graphs of my entire career.

Acknowledgements

This migration would not have been possible without the tireless efforts of Tron Nedelea, Eddie Huang, Vivek Nynaru, Carlos Gonzalez, Lane Van Elderen, and the rest of the CI Platform team. The team also especially appreciates the leadership of Deepak Bitragunta, and Irina Tyree for helping secure buy-in, resources and company wide alignment throughout this long project. Finally, our thanks go out to everyone across Indeed who contributed code, feedback, bug reports, and helped migrate projects.

Secure Workload Identity with SPIRE and OIDC: A Guide for Kubernetes and Istio Users

Posted on July 3, 2024 by Nikhil Arora

Goal

This blog is for engineering teams, architects, and leaders responsible for defining and implementing a workload identity platform and access controls rooted in Zero Trust principles to mitigate the risks from compromised services. It is relevant for companies using Kubernetes to manage workloads, using Istio for service mesh, and aiming to define identities in a way that aligns with internal standards, free from platform-specific constraints. Specifically, we’ll discuss Indeed’s solution for third-party authentication, opinionated best practices, and challenges faced. It is not practical to share all the alternatives, trade-offs and engineering insights supporting our decisions; we want to share design choices and implementation details that can accelerate decision making and problem solving for others in similar situations.

Introduction

Passwords are a tale as old as ancient civilizations. Modern systems routinely rely on API key & ID pairs (analogous to username and passwords) to access other systems. These API keys in theory are complex, managed by developers, and stored securely. The reality is more complicated. We all have heard stories of passwords hiding in plain sight, unencrypted, in code repositories, in log messages, in headers, in terminal history, wherever it’s convenient to just get the job done. Rotating old API keys can even be scarier. Who knows if keys have been shared, how many times they have been shared, and where all they have been shared? Did Alice delete the old API key? Was the new API key deployed everywhere!?

So what’s the solution? Step 1: Articulate and measure the problem. At Indeed, we embody our core value of being data-driven. Through our analysis, we recognized the risk posed by compromised credentials used by services. Our data revealed that half of our AWS IAM keys have access to some type of restricted data. We observed shared API keys being used across a wide range of our workloads. We discovered roughly eight times as many stored secrets as there are unique keys in all of our major authorization systems. This indicates a significant duplication of secrets, though we have not yet determined the exact scale of this duplication. Step 2: Implement a solution that works for Indeed’s heterogeneous workloads across third-party SaaS cloud vendors and Indeed’s own (first-party) apps.

Image showing API keys from a shared vault being used to access resources in multiple cloud providers

The starting point is to build an identity platform capable of provisioning temporary, verifiable, attestable, unique, and cryptographically secure workload credentials for access to third-party systems like Confluent Cloud and AWS, and first-party services as well. Indeed promotes responsible use of Open Source Software and dedicated platforms with clear responsibilities leveraging industry standards to solve common problems. Our workload identity platform is built on SPIRE, embracing open standards like SPIFFE, OAuth 2.0 and OIDC to provide managed identities in x509 PKI Certificate or JSON Web Tokens standards.

SPIRE

SPIRE is a PKI project that graduated from the Cloud Native Computing Foundation. SPIRE is open source, widely used in the industry and has a vibrant and active community of engineers. SPIRE can be deployed in a scalable and resilient manner and has been operating reliably at scale in production at Indeed for over a year now. SPIRE-issued x509 identities are used in our Istio service mesh for mTLS, and JWT identities are used to enable OIDC-based federated access with Confluent and AWS resources.

Istio Opinions

Adopting Istio to replace our legacy service mesh created conflicts with certain SPIRE configurations already in production.

SPIFFE Format

We debated the granularity and uniqueness of identities suitable to represent an Indeed application. In this context identity refers to the subject, i.e., the SPIFFE ID of a workload. The discussion revolved around the SPIFFE template and its constituent parts, e.g.:

spiffe://<trust_domain>/<scheduling_platform>/<environment>/ns/<namespace>/sa/<service-account>

However, Istio is highly opinionated about the SPIFFE ID format a workload must have:

spiffe://<trust.domain>/ns/<namespace>/sa/<service-account>

An image showing a cautionary note on workload ID formatting from the Istio / SPIRE documentation

This is a known problem that is still open with Istio: Customizing SPIFFE ID format if using an external SPIFFE-compliant SDS should be supported · Issue #43105 · istio/istio · GitHub

If you have a SPIRE deployment already in production with a different SPIFFE ID format for your Kubernetes workloads, be aware of Istio requirements. Updating the subject of your workloads is not trivial. While it’s only a configuration change in SPIRE, the subject likely appears wherever access control and authorization rules are defined for your workloads.

SPIRE Agent Socket Name

Istio requires SPIRE Agent APIs be available on the /var/run/secrets/workload-spiffe-uds/socket Unix domain socket only—another (unnecessary) Istio opinion that affects the entirety of the platform and will require careful planning to accommodate. Since we already had SPIRE in production, we used K8s to mount our socket path to /var/run/secrets/workload-spiffe-uds and only had to update the file name from agent.socket to socket. We made the practical choice of temporarily disabling mTLS in the mesh and rolling out our SPIRE Agent socket name changes one cluster at a time, as it affected the proxy SDS (Secret Discovery Service) configuration as well. During this time, our mesh was only protected by the network perimeter behind the VPN. After both SPIRE Agent and service mesh SDS configuration were updated, mTLS was turned back on.

SPIRE Architecture

Topology and Trust Domain

At Indeed, we manage a single trust domain in SPIRE deployed in a nested topology. We run multiple SPIRE Servers in each Kubernetes cluster for redundancy. SPIRE Servers in each cluster have a common datastore for synchronization. There’s one root SPIRE CA deployed in a special cluster reserved for infrastructure services. All other Kubernetes clusters have their own intermediate SPIRE CAs with the root CA as their upstream authority.

An image showing an example of a nested SPIRE deployment

A and N represent cardinality and any number greater than 1 is suitable. The cardinality for M is the number of nodes in the cluster, as each node has its own instance of SPIRE Agent.

This topology is scalable, performant and resilient. A single Spire Server can go down in any cluster without any outage. All SPIRE Servers in a cluster going down only affects workloads in that cluster. Each SPIRE component in each cluster can be configured and tuned separately. SPIRE configures each Server with its own CA signing keys. That’s also desirable from a security perspective, as any compromised SPIRE Server private keys are not used elsewhere.

We use a unified trust domain for all our workloads in production and non-production environments (excluding local development). A single trust domain is easier to reason about and maintain. Namespace naming conventions at Indeed typically include environment names in the namespace and that provides sufficient logical separation from an operational and security perspective. E.g., we treat metrics from spire–dev namespace differently to those from spire–prod. We help our developer teams understand that they can use variations in namespace and service account to create different permission boundaries for similar workloads in different environments.

SPIRE Performance and Deployment Tuning: Lessons from Production

Through our experience running various SPIRE components across a fleet of 3000 pods, we discovered some Kubernetes configurations that keep our platform stable even as nodes and pods come and go. These settings were also influenced by stress testing of our SPIRE platform by scheduling thousands of workloads in a limited amount of time and observing how our platform behaved during major upgrades. Here are some settings we recommend:

Set the criticality of the SPIRE components to minimize eviction. “priorityClassName: XXXX” for SPIRE Server and Agent.
- Kubernetes has a hard limit of 110 pods per node. We need to guarantee that the SPIRE Agent gets scheduled on each node. It’s a runtime requirement for all pods. Secondly, we want to prevent pre-emption for core SPIRE components as much as possible. Without priorityClassName Kubernetes will default to priority of zero or globalDefault. This setting must be set explicitly and high enough to ensure scheduling of SPIRE Agents on each node.
Set resource request/limits for ephemeral storage for SPIRE Agent. We observed SPIRE Agent pod evictions related to disk pressure on the node. Our solution was to explicitly set both “requests/limits” to “ephemeral-storage: XXXMi” to prevent the SPIRE Agent from being evicted.
Leverage vertical pod autoscaling (VPA) for SPIRE components (Servers, Registrars, and Agents). SPIRE runs in a myriad of clusters with unique and varying performance characteristics. Our performance testing revealed the CPU and memory upper bounds we can expect. But overallocation for the worst case is costly and inefficient. With VPA we are able to set CPU “minAllowed” to “15m”, i.e., 0.015 CPU for SPIRE components! The max was based on observations during performance testing.
- Note that “updatePolicy” for SPIRE Agents was set to “updateMode: Initial”. This is to prevent evictions from VPA updates. We made a conscious choice to minimize SPIRE Agent disruption from VPA changes and apply VPA policies during expected SPIRE Agent restarts due to node upgrades, scheduled deployments, etc.
- “updateMode: Auto” is in use for all other SPIRE components.
Since SPIRE Agents are configured as “DaemonSet” we also set our “updateStrategy” to “type: RollingUpdate” with “rollingUpdate” set to “maxUnavailable: 5”. This slows the rollout of SPIRE Agents in a cluster but also ensures a large majority of the nodes in the cluster are being served by SPIRE as expected.

SPIRE Signing Keys and KeyManager Configuration

If your workload requires a JWT SPIFFE Verifiable Identity Document (SVID), it is highly likely you’ll need a stable, predictable number of signing keys in use across all SPIRE Servers. It is important to note:

Each Spire Server has a separate and unique x509 and JWT key pair for signing.
The in-memory KeyManager results in new x509 and JWT signing keys generated upon every restart.
SPIRE doesn’t have the option to use the SQL Datastore as a KeyManager also.

We encountered issues in using AWS EBS/EFS CSI as persistent volumes and thus couldn’t use the disk KeyManager plugin. We helped enhance the built-in AWS KMS KeyManager plugin so there’s an option for persistent key store without relying on persistent volumes for Spire Server pods. We found the AWS KMS KeyManager to be reliable.

Given M total SPIRE Servers, the number of JWT signing keys K in the JSON Web Key Sets is: M <= K <= 2 * M. It is possible a SPIRE Server has an active JWT signing key that’s used for signing and verification and another unexpired key that’s used for verification only.

SPIRE as OAuth Identity Server

OIDC Discovery Provider

SPIRE can be integrated as an Identity Server in the OAuth flow. The use of SPIRE OIDC Discovery Provider further allows for federation based on SPIRE JWT SVIDs. We initially deployed the SPIRE OIDC Provider to all Spire Servers including Root and Intermediate CAs. Querying the Provider would return a varying number of public JWT signing keys! Our current strategy is to enable and serve the SPIRE OIDC Discovery Provider from the Root CAs only. We find the Root SPIRE CAs in a nested topology to be an accurate source for the full trust bundle (including all JWT signing keys being used in the entire SPIRE Server fleet).

CredentialComposer Plugin

SPIRE Server supports many customization plugins. There’s also a plugin that can modify the claims in a JWT SVID as needed. At Indeed, we implement a custom plugin that looks up a workload’s metadata and translates that into additional claims as needed. Our approach to federation with AWS is based on passing session tags using AssumeRoleWithWebIdentity. We tag AWS resources storing sensitive data and manage which workload has access to which tags in internal systems. The custom plugin looks up the appropriate session tags for a workload and adds them to the JWT SVID.

An image showing how a K8s workload uses SPIRE OAuth to access an S3 bucket with a custom JWT

The workload’s final access is the combination of the IAM Policy attached to the IAM Role and additional session tags the workload was granted. The IAM Role itself doesn’t need to be tagged.

The SPIFFE Helper utility runs as a sidecar to request, refresh, and store the JWT SVID at a fixed location on the workload pod.

Third-party Federation Using OIDC

At Indeed, popular Confluent and AWS technologies are used to store most of our critical data. Most of our workloads also access data in both clouds. It is important for us to implement federation with both successfully from the beginning. The details for enabling and configuring OIDC are well documented for both AWS and Confluent. Next we’ll cover how our experience differed for both vendors and lessons learned. It is fair to say that there were significant differences and nothing should be taken for granted, as you’ll see.

Opaque Limits on Keys Accepted in JWKS, and Too Many JWT Signing Keys

We discussed earlier that each Spire Server has its own unique JWT signing key pair and that the maximum number of signing keys is twice the number of SPIRE CA servers. One drawback of a nested topology scaled for fault tolerance is that there are many SPIRE CAs. So, given M SPIRE Server per N K8s cluster, there can be 2 * M * N JWT signing keys in the JSON Web Key Set (JWKS).

In the early phases of development, we saw the verification of the SPIRE JWT failed in both Confluent and AWS. Our proof of concept, which had used a single SPIRE CA server in a test trust domain, had worked. We investigated more and figured that AWS accepts ~100 signing keys and Confluent only a handful. Neither documents the limit anywhere, which made the whole process more difficult. We were able to work with Confluent to increase the soft limit to something more reasonable. The AWS limit remains the same.

We have this issue open with the SPIRE community as well. SPIRE deployments of more than a few servers can create more keys in JWKS than OIDC federating system supports · Issue #4699 · spiffe/spire · GitHub

While nested topologies are great for high availability, there’s a real risk that federation can fail based on arbitrary limits on signing keys supported by the federating system. SPIRE could benefit from providing a mechanism where the number of SPIRE instances can scale, but the number of JWT signing keys are fixed, i.e. be able to logically group Spire Servers that use the same key material.

OIDC Configuration

When configuring the OIDC Provider in AWS, the thumbprint for the top level certificate used in signing the OIDC endpoint is required. Confluent doesn’t require any such configuration. Our SPIRE OIDC server endpoint has a certificate issued by Let’s Encrypt. Confluent implicitly trusts globally trusted CAs. AWS requires that the thumbprint be set. This is challenging as Let’s Encrypt recently truncated the chain and has also shortened the duration of the new top-level Intermediate CA. You must define a process or automation to update the OIDC configuration in AWS before the signing CA for the OIDC server itself rotates.

Note: This is different from the JWT signing key pair used to sign the JWT and subsequently used in JWT verification.

Confluent Identity Pool vs. AWS IAM Role

In the context of OIDC, Confluent identity pools and AWS IAM roles are used for managing permissions, but have different implementations. We’ll look at some key differences.

Audience Claim

It is worth noting that AWS expects Audience(s) to be set per OIDC provider. Confluent expects the aud claim to be defined in each identity pool. The difference is that in AWS, the audience claim is tied to the Issuer relationship itself, so there’s no need for an audience check in the trust policy for an IAM Role. Confluent expects Identity Pool filters to explicitly verify the issuer, audience, etc.

Trusting OIDC Providers

A Confluent Identity Pool trusts a single OIDC Provider only. The Confluent documentation for identity pool may lead you to believe otherwise by supporting filter expressions like “claims.iss in [“google”, “okta”]”, but an identity pool is bound to one OIDC Provider. AWS IAM Roles, on the other hand, rely on trust policies which can be configured to trust multiple OIDC Providers by repeating the principal block. This matters when thinking about migrating to new OIDC Providers or running multiple Identity Providers in your organization.

Size Limits

AWS IAM Roles have a limitation on the size of the trust policy, and Confluent has a limit on the size of the filter. Work with the vendor to understand the hard limits and soft limits for your company. It is better to know these limits ahead of time as that can influence the design of the trust policy and the workload identity format itself.

SDK and Standards Maturity

AWS has a mature and well documented credential provider chain. It walks a developer through what SDK configuration is needed so that the OAuth JWT will be automatically located and used in a call to AssumeRoleWithWebIdentity inside the client application. A few properly configured environment variables and a credential file containing the JWT are all that’s needed for the AWS SDK to automatically exchange it for an STS credential with Role assumption. No additional logic is needed when the credential file containing the JWT is automatically refreshed.

Confluent Kafka Simple Authentication and Security Layer (SASL) libraries provide interfaces that have to be implemented in multiple languages for the JWT to be located, refreshed and made available for use.

The biggest issue we’ve faced so far in our journey was the least expected: CredentialComposer plugin serializes integer claims as float · Issue #4982 · spiffe/spire · GitHub. The SPIRE credential composer plugin converts timestamp fields from integer to float. This led AWS STS to reject the JWT due to invalid data type for the iat and exp claims. Confluent, on the other hand, had no problem validating and verifying the JWT. The JWT spec defines timestamps to be numeric types, and both integer and float are valid types. We got stuck between poor data type handling in SPIRE and AWS STS aversion to fixing the issue on their end and bringing their JWT validation up to spec. A tactical fix was pushed by Indeed so SPIRE JWT SVIDS will be accepted by AWS.

Conclusion

Adopting SPIRE as your OIDC Provider with major cloud vendors allows you to specify identities independently of vendor-specific naming schemes and manage them centrally. This approach provides a consistent view of each workload, benefiting compliance, governance, and auditing efforts within the company.

If you are pushing for the latest and greatest in SPIRE architecture and security standards, be prepared to overcome gaps on behalf of SPIRE or the federating system. While no system is perfect, the problems SPIRE already solves, it solves well. A highly available SPIRE deployment as an OIDC provider is a road less traveled, and we are excited to make things better wherever we can and share our learnings for everyone’s benefit. We hope this guide accelerates your journey for embracing secure workload identity in your organization.

1current
2
3
Older»