Become Builders, Not Coders

Why agentic coding tools demand a new identity for software engineers

After more than two decades of professional software engineering, I have arrived at a set of conclusions that I find very uncomfortable.

The era of mostly manual coding has ended. IDEs, in their current form, are no longer necessary. Traditional software development languages are showing signs that they already have entered the beginning of their end (see nanolang, an experimental language designed for agents rather than humans).

These statements are deliberately provocative, and I expect that many of you reading them will disagree, some strongly.

My conclusion, and impassioned plea, is this: each one of us must adapt to the new world not by lamenting how our jobs change, but by embracing the notion that we were never paid to code. Coding was just something we did.

We are paid to build products that solve customer problems using code.

Doing that with Agentic Coding Tools is a hugely different set of actions but it is the same output, and ultimately requires the same high-level skills, while freeing us from much of the minutia.

A sudden change

I have long had interest in neural networks – way before LLMs were possible – and I thought we were many years, probably between generations and “never” away from what has become real in the past few years. My personal perspective on AI coding can be summed up:

  • Through 2023: AI tab complete is a great demo and a neat toy, but only useful in languages without strong types and solid deterministic auto-complete (JetBrains Java quality), or for very inexperienced developers.
  • 2024 to Early 2025: Chat Oriented Programming (CHOP) seems real, but it seems to require top-1% developer skills to realize significant gains, and I am getting concerned that I will not catch up. Vibe Coding is a silly fad, similar to titling yourself “ninja” or “wizard” in a resume.
  • Mid-2025: Agentic coding is magic. Context Engineering is a real skill, and Model Content Protocol (MCP) is reckless, but amazing. Things I thought impossible are now trivial. The security risks and automation problems are huge, but everything is going to change.
  • Now: AGENTS.md MCP, Skills, Commands, Sandboxing, Subagents, Ralph Loops, Beads, Orchestrators, etc. – I cannot keep up; maybe no one can fully. Figuring out what to build, and how to have agents build it better and faster than humans is all that matters. Many of the things I assumed were critically important are now simply irrelevant. I now routinely produce code in languages I can barely read, easily, and with vastly more confidence than I would have as a typical beginner – yes, I’ve had known experts review it, and gotten good feedback that it is not “slop”; it looks like code written by humans, except with better tests than average.

Why such a sudden change?

The trite answer to the rapid advancement is simply that the models got better, cheaper to train and run (per parameter or token), and more available. That is definitely a factor, but, I believe that more critical advances included:

  • Context Size: The usable context size grew to the point that agents started to accomplish significant tasks with minimal oversight. Their ability to quickly process and return meaningful results from vast amounts of short-term context now clearly exceeds human capability.
  • Context Availability: MCP allowed agents to explore context beyond that which was available on the local machine.
  • Tool Use: Tool use has largely eliminated hallucinations for LLM use where output can be verified. (Karpathy predicted this 8 years ago!) Areas in which they continue hallucinating tend to be obvious and easily corrected.
  • TODOs: A simple TODO tool added to Claude Code made it discontinuously better at staying on task; here is an interview with its creators discussing it.

What does this mean for the profession of Software Engineering?

I do not think anyone understands the full ramifications of this change. It is too new and too fast. I use the metaphor of going to sleep one night as a blacksmith knowing only hammers and bellows, and waking up the next morning employed at a modern metal shop with hydraulic presses, CNC machines, laser cutters, advanced welding equipment, and even additive manufacturing. The change happened so suddenly that it is hard to express how shocking it is to those not used to the pre-agentic methods.

I am personally optimistic for the profession of Software Engineering. Jevons Paradox describes how increases in efficiency can result in more consumption of a resource, not less. Jevons observed that as steam engines became more efficient in their use of coal, total coal consumption increased rather than decreased — because the efficiency made coal-powered applications economically viable in far more contexts. We may already see this in how AI affects Radiologists.

Most software I use personally is pretty awful. It is buggy, has UI that clearly was developed without any UX expertise, isolated instead of integrated with other systems, and has huge security holes. Yes, “slop” could produce more of that. But, fixing these problems is largely the application of expertise that can be encoded into context. This expertise can be applied by software engineers who, without AI tooling, would have neither deep specialty skills nor the time to improve any of them.

I fully expect that Marc Andreessen’s observation that Software is Eating the World will accelerate, driven by the efficiency from agentic tools. This will lead to a new era of demand for solid engineers using those tools.

That transition is not happening, yet, and may still take a few years. I do not mean to minimize the real pain in the industry right now: Many companies did over-hire in 2020-2022; we really did promise too many college students that studying CS was a golden ticket. I have been personally laid off multiple times in the past, and early in my career, spent half a year to find a job with a 50% pay cut. It hurts, and I do not mean to imply that the last few years, or the next few, have not been or will not be painful for many.

What does this mean for software engineers?

In my experience, this shift is already well underway at many companies. Most are not large enough to build their own agentic coding stacks like Google, Meta, Amazon, or Microsoft, but are investing in commercial tools, training, and internal interest groups. Some are even adjusting their performance evaluation criteria to reward adoption of agentic coding.

A quarter century into my career, I feel like an old dog trying to learn new tricks. But, I am personally grateful that my employer provides access to these tools. I am too risk averse to want to pay hundreds of dollars speculatively for access to tools just to train myself. Having access through work eliminated that barrier and got me started.

The focus on re-training will not last forever. Companies are not paying for it altruistically. The investment needs to translate into real capability, not just a line on a resume.

The goal is increased productivity, which is notoriously difficult to measure directly. As proxies, we should look for more prototyping, improved experiment velocity, lower maintenance costs, and higher quality (more and better tests, more rigorous standards implemented more consistently, etc.). Early results are promising, and more clearly, the constraints, risks, and barriers are becoming visible — which allows us to focus on overcoming them.

Here is what I have seen work, and what I believe engineers at every level need to think about.

A shift in perspective

All of us need to shift our identity from a focus on coding, to a focus on solving problems with software. This is a huge request – almost a shift in identity, not just thought.

I have introduced myself for most of my career as “mostly a Java guy.” Yes, I have significant professional experience in several other languages. But, if I was really honest about it to myself, I thought of myself as being a coder, who wrote and read Java as a first language, and a dozen or two others as second languages with various levels of competence.

Agentic coding has revealed that this way of speaking was always an idiom. No one who buys software really cares that I know Java. I was never paid for that. I was paid to solve problems with software, and for a large part of the last 25 years, Java just happened to be a relatively good tool.

Very deliberately, I have to think of myself as a problem solver, who uses code.

What to change

Details of how this change in perspective will be worked out vary based on role.

Individual contributors – roughly senior engineer and below – who traditionally coded most of the time should focus on learning key skills that we have long expected our senior engineers to master:

  • Work Decomposition: Breaking large tasks into tasks small enough for a single context window is one of the core skills of Context Engineering. This breakdown was previously done mostly by tech leads and managers. With agentic coding, it must be learned almost immediately since agents can do hours of typing in seconds.
  • Rapid Code Review: The ability to read code quickly is critical, with less focus on minutia and more focus on the core of the change, overall style, good patterns, etc. Soon, agents will likely make this easier, but it is important to be able to do so directly today.
  • Technical Writing: Models use human languages, most commonly English at the moment. Improve your writing skills. Learn to use Grammarly, CoPilot, or Gemini (or coding tools!) to improve your style. Have agents assess your writing, and ask them to interview you to help find ways to communicate more effectively, both improving style and filling gaps.
  • Clean Code: Emphasize specification. Build minimal solutions; ask for agentic review (before code reviews) regarding patterns, alignment with standards, style, etc.; be willing to start over. Write great tests: if you need to make a big change, and they are missing, ask agents to write deliberately over-specified tests before making the change, then use test breakage as a signal that the change is what you intend.

Engineering leaders – staff-and-above individual contributors, and technical managers – need to:

  • Reconsider the Cost of Software: Leaders learn – often the hard way – that code is a liability. Many think, “lines of code spent,” a limited resource due to maintenance costs, not “lines of code written.” This is now less true. Software 2.0 means a clear specification can be translated into code or rewritten in another language with exponentially less effort, cost, and risk.
  • Use Agents to Understand Your Codebase: Ask agents to explain your codebase. Study the output and look for things you know to be right or wrong. Ask for reviews or critiques. Try with different instructions, focusing on different aspects or different personas.
  • Build Again: Get your development environment working again. Fix simple bugs. Do work that is less interesting like migrations. Learn to use low-code automation platforms and AI assistants to automate things. Or, build entirely new projects yourself, particularly personal or internal tools. The roles you hold demand the ability to decompose work, review code, and write about technical topics. You are extremely well skilled to build agentic tools. Doing so can help you lead, coach, and mentor others. One of my key learnings has been that I need to use the tools to understand the depth of how different it is to develop software with these tools.

And the implications extend well beyond engineering. Product managers, business analysts, and others outside R&D are finding that low-code automation tools and AI assistants allow them to automate repetitive work, build prototypes that communicate requirements better than any document, and even verify outputs against specifications. The bar for who can build useful software is dropping fast, and non-engineers who adapt will have an outsized impact on their organizations.

How to change

Change is happening so rapidly that this list will probably seem incomplete, irrelevant, or perhaps even wrong in weeks. But, today, here is my recommendation:

  • Learn the tools. Become a constant user of at least one agentic coding tool – whether CLI-based or IDE-integrated. The landscape is evolving fast, but the current leaders include Claude Code, OpenAI Codex, Gemini CLI, Amp, Cursor, and Windsurf. Pick one and commit to using it daily.
    • For every task beyond a few lines or clicks, not just implementation, spend a few minutes trying to get AI to do it.
    • Join or create internal communities for sharing AI development techniques; ask for help and help others.
    • Blog on big wins; help your teammates. Observe something; do it; teach someone else – “See one, do one, teach one” (SODOTO).
  • Focus on building agentically: Avoid typing code or copying & pasting from chat. Let the agents make the changes, build, observe outputs, and iterate. Remember that Agents need context, constraints, and success criteria, not instructions of what to do.
  • Learn Context Engineering: Some of the more common complaints are that agents make assumptions and hallucinate. Much of that is caused by gaps in what they are presented. They have been trained on nearly every text document that can be legally presented to them, so there are a lot of differing decisions and even bad practices built in. Add context with good examples, standards, etc. Sometimes, this is a prompt or an AGENTS.md, but just as often, it is Skills, Hooks, MCP, or Commands that encode fixed behavior.
  • Pay attention to risks. Learn about things like the Lethal Trifecta, sandboxing, and prompt injection. Learn how to allow-list tool use and assess risk. Keep your tools updated.
  • Build your skills incrementally; roughly in order, that is:
    • Start with agentic coding and deliberate Context Engineering.
    • Take opportunities to figure out how to use agents to break down and build smaller tasks. Use them for planning and research.
    • Experiment with Spec Driven Development (SDD) to use multiple context windows for a single task to produce more consistent results with fewer interruptions.
    • Figure out how to run a simple Ralph Loop – a scripted loop that repeatedly invokes an agent across many context windows to make changes too large for any single session.
    • Experiment with multiple, parallel, agentic sessions or even Agent Orchestrators. Learn from research on scaling agents.
  • Follow your organization’s AI coding policies. Use the provided tools and ask for permission to explore new opportunities.

What will happen in the next few years?

My personal speculation is that the key change will be that documentation, processes and common knowledge that collectively helped growing groups of humans work will be encoded into context and software that manages agents. This will not happen all at once, and right now, initial efforts can best be described as chaotic. Things that I am almost sure will happen, in some form are:

  • TODOs will rapidly evolve into a hierarchy of tasks. Agents will gain the ability to identify tasks that are too large and break them down, as well as to discover new ones. LLMs that now ignore missing context and make assumptions will get better at identifying these gaps and finding ways to fill them, which will often include asking humans but also involve better automated context seeking, and collective memories (per user, per team, per company, etc.).
  • Orchestration will become the norm, not something that seems novel (e.g. Gas Town).
  • Sandboxing, techniques using adversarial agents, context isolation (strip the why of a requested action out and consider only if it is reasonable – should cut off most prompt injection), and less intrusive permission requests, will mature to the point that agents will run almost continuously to improve code.
  • IDEs will disappear in their current form, but their capabilities – refactoring, debuggers, profilers, structural search & replace, etc. – will become tools agents use to reduce the number of tokens consumed to accomplish the same tasks, paralleling how humans benefit from those tools.
  • Traditional languages, built for humans, will be replaced with languages built for agents that do not care about the amount of typing, are willing to accept required testing, can be indexed easily, have strong type constraints, etc. Things like Foreign Function Interfaces (FFI, as opposed to system calling conventions) will become less important since the complexity of lower-level interfaces do not seem to be a problem for LLMs.
  • Code review will evolve into change review: humans will, for the most part, stop reading the code but still need to be able to reason about how the system evolves. Change review will describe changes with clear prose and diagrams, not present line-by-line diffs, and allow conversational exploration of the change.

The collective result of this will be an increase in software scale at least equivalent to the jump from computers that ran programs directly constructed in machine code, to computers running an operating system running software written in “high level” languages.

Conclusion

Two quotes come to mind as I consider this paradigm shift (forgive the over-corporate term – it is literally appropriate here).

The first is the quote often attributed to Thomas J. Watson, then IBM Chairman: “I think there is a world market for maybe five computers.” Whether or not he actually said it, I think the sentiment was correct. There really was a market for only a few computers when all software was written directly in machine code, with no operating system, on enormously expensive hardware. The past 80 years have brought computers to things around which buildings were built to things so cheap that they are thrown away in common single-use disposable medical devices. I very much suspect that LLMs are the next step in this change. Whether or not history will see them as an extension or a second revolution is something for later generations to decide, but I am certain that the change is happening far more rapidly now.

The second is more alarming: Upton Sinclair wrote in his memoir, “It is difficult to get a man to understand something, when his salary depends upon his not understanding it.” I find it extremely challenging to think about the consequences of agentic gains. So much of my career has been focused on the skills required to do things that LLMs can now do trivially that it very much feels like I am being replaced.

I have to remind myself constantly that the core skills of engineering do not go away with better tools. CAD didn’t eliminate civil engineering or architecture; it eliminated pencil skills for drafting. Word Processing didn’t eliminate writing; it made typing vastly easier. Agentic Coding will not eliminate Software Engineering but it will very likely eliminate coding. The blacksmith who woke up in a modern metal shop still needs to know metallurgy, tolerances, and what the customer actually needs built. The tools changed. The craft did not.

Knowledge of code is not what our value depends on. Knowledge of how to build software is the skill that has always been, and is now very clearly most critical.


Michael Werle is a Technical Fellow in Core Infrastructure at Indeed, where he serves as tech lead across the organization’s platform engineering and SRE teams. He can be reached on LinkedIn.

This article was written by hand, with agentic tools used for feedback and editing.

Bringing Lighthouse to the App: Building Performance Metrics for React Native

At Indeed we’ve open sourced a new React Native repository which makes it simple to measure Lighthouse scores in your mobile apps. We think it will help other organizations better measure their app performance, especially for companies similar to Indeed who are transitioning from a web-first to an app-first approach.

You can check out the code here, and read on for more details.

The Challenge

Indeed had traditionally been a web company. Site speed wasn’t just a nice-to-have — it was fundamental to how we built systems. We believed good software was always fast, and for many years now, we had relied on Lighthouse to keep us honest. In the past we’ve written in depth on this topic, but as we’ve transitioned to a Mobile App first company, we needed a way of bringing the same performance rigor to our native code.

As React Native proliferated across our most critical pages—ViewJob, SERP, Homepage—we found ourselves flying blind. We had no standardized way to measure whether our mobile performance was improving, degrading, or holding steady. We needed answers to fundamental questions: How fast did our screens load? When could users actually interact with them? Were we maintaining the performance standards that Indeed was known for?

The Solution: Core Web Vitals for React Native

Rather than reinvent the wheel, we looked to the industry standards that had proven effective on the web: Core Web Vitals. These metrics — designed by Google to capture the essence of user experience — translated remarkably well to mobile apps. We just needed to adapt them for React Native’s unique threading model and lifecycle.

The Metrics That Matter

  1. Time to First Frame (TTFF) — When users saw content
    Our analog to Largest Contentful Paint (LCP). It measures how quickly users see meaningful content after a component starts mounting. In a native app context, this needs to be fast — there’s no network request to fetch the document, no HTML parsing, no CSS cascade. Code is pre-bundled. Users expect instant visual feedback.
    Threshold: < 300ms is good, > 800ms is poor.
  2. Time to Interactive (TTI) — When users could actually do something
    The most critical metric for mobile apps. We measure when a component transitions from “loading” to “ready for interaction.” Unlike the web, where TTI was algorithmically determined, we let components self-report when they’re truly interactive — when data is loaded, UI is rendered, and touch handlers are ready. While not ideal in every case, we’ve found algorithmic TTI (e.g., TTI Polyfill) can also be inaccurate.
    Threshold: < 500ms is good, > 1500ms is poor.
  3. First Input Delay (FID) — How responsive the app felt
    Captures the delay between a user’s first touch and when the app responds. On mobile, touch interactions should feel instantaneous. Any perceptible lag breaks the illusion of direct manipulation that makes mobile apps feel native.
    Threshold: < 50ms is good, > 150ms is poor.

Why These Thresholds?

Our thresholds are significantly stricter than Core Web Vitals (roughly 40% tighter). This was intentional. Native apps need to be faster than web apps:

  • ✅ No network requests for initial render
  • ✅ Code is pre-bundled in the app
  • ✅ No HTML/CSS/JS parsing overhead
  • ✅ Users expect native app speed

For context, Core Web Vitals consider LCP < 2.5s as “good.” We consider TTFF > 800ms as “poor” — about 6× stricter. Mobile users have different expectations, and our thresholds reflect that reality.

Integration: Dead Simple

The entire system is packaged as a single React hook. Integration takes three steps:

function MyComponent(): JSX.Element {
  // 1. Add the hook
  const { markInteractive, panResponder } = usePerformanceMeasurement({
    provider: 'myapp',
    componentName: 'MyComponent' as const,
  });

  // 2. Mark when interactive
  useEffect(() => {
    if (dataLoaded) {
      markInteractive();
    }
  }, [dataLoaded]);

  // 3. Attach pan responder to root view
  return (
    <View {...panResponder.panHandlers}>
      {/* Your component */}
    </View>
  );
}

That’s it. No configuration files, no complex setup, virtually no performance overhead in production. The hook handles everything: timing, measurement, logging, and cleanup.

Technical Implementation

Architecture Overview

The measurement system follows a component’s lifecycle from mount to interaction:

Component Mount → TTFF Measurement → TTI Marking → FID Capture → Logging

1. Measuring Time to First Frame

React Native’s InteractionManager is key. It lets us run code after the current frame finishes rendering — the perfect hook for measuring TTFF:

useEffect(() => {
  const handle = InteractionManager.runAfterInteractions(() => {
    const ttff = Date.now() - mountStartTime;
    // TTFF captured after first frame renders
  });
  return () => handle.cancel();
}, []);

2. Marking Time to Interactive

Components know best when they’re truly interactive. Rather than trying to algorithmically determine this (as Lighthouse does for web), we provide a markInteractive() callback that components call when they’re ready:

const { markInteractive } = usePerformanceMeasurement({
  provider: 'viewjob',
  componentName: 'ViewJobMainContent'
});

useEffect(() => {
  if (dataLoaded && uiReady) {
    markInteractive(); // Component decides when it's interactive
  }
}, [dataLoaded, uiReady]);

3. Capturing First Input Delay

React Native’s PanResponder gives us comprehensive input capture across all touch types. We measure the delay between touch start and when the main thread can process it:

const panResponder = PanResponder.create({
  onStartShouldSetPanResponder: () => {
    const inputTime = Date.now();
    setImmediate(() => {
      const processingTime = Date.now();
      const fid = processingTime - inputTime; // Main thread delay
    });
    return false; // Don't capture the gesture
  }
});

The setImmediate is crucial — it ensures we measure the actual main thread processing delay, not just the touch handler execution time.

4. Smart Logging Strategy

  • Wait for FID: Delay logging until first user interaction
  • Timeout fallback: Log after 5 seconds even without interaction
  • Single event: All metrics logged together for easier analysis

This approach gives complete performance profiles while avoiding metric fragmentation.

Real-World Results

We first integrated this system into ViewJob, one of Indeed’s highest-traffic pages. Here’s what we learned:

Console Output (Development)

[Performance-Debug] TTI marked: 172ms
[Performance-Debug] TTFF captured: 347ms
[Performance] viewjob/ViewJobMainContent: {
  TTFF_ms: 347,
  TTI_ms: 172,
  FID_ms: 0,
  FID_type: "touch"
}

The Lighthouse Score Equivalent

To make performance actionable, we created a composite score (0–100) that mirrors Lighthouse scoring:

const PERFORMANCE_WEIGHTS = {
  TTFF: 0.25, // Visual loading
  TTI: 0.45,  // Interactivity (most critical)
  FID: 0.30   // Responsiveness
};

TTI gets the highest weight (45%) because mobile users expect immediate interactivity. Visual loading and responsiveness are important, but nothing frustrates users more than tapping a button that doesn’t respond.

ViewJob Performance:
• Average score: 81 (Good)
• P75 score: 95 (Excellent)

These scores give us a single number to track over time, making it easy to spot regressions and measure improvements.

What We Learned

1. Native Apps Should Be Faster

Our initial thresholds were too lenient — we started with web-based Core Web Vitals and quickly realized native apps should perform better. The absence of network latency and parsing overhead means users rightfully expect faster experiences.

2. Components Know Best

Letting components self-report interactivity (markInteractive()) proved more accurate than algorithmic detection. Components understand their own loading states, data dependencies, and UI readiness in ways that external observers cannot.

3. Complete Profiles Matter

Waiting to log all metrics together (rather than logging each individually) made analysis significantly easier. It’s much simpler to query for “sessions with TTI > 500ms” than to join three separate metric events.

Looking Forward

This measurement system is now our foundation for mobile performance at Indeed. We’re expanding it beyond ViewJob to SERP, Homepage, and other React Native surfaces. Each integration gives us more data, more insights, and more confidence that we’re maintaining the performance standards Indeed is known for.

But measurement is just the beginning. The real value comes from what we do with the data:

  • Automated alerts when performance degrades
  • Performance budgets enforced in CI/CD
  • A/B testing to validate that optimizations actually improve user experience
  • Correlation analysis between performance and business metrics

We’re no longer flying blind in the mobile world. We have the metrics, the thresholds, and the tooling to ensure that as Indeed becomes app-first, we remain performance-first.

Get Involved

At Indeed we’ve open sourced this repository because we think it will help other organizations better measure their app performance, especially for companies similar to Indeed who are transitioning from a web-first to an app-first approach. To contribute, please see the details in our contribution guidelines: CONTRIBUTING.md.

Normalized Entropy or Apply Rate? Evaluation Metrics for Online Modeling Experiments

Introduction

At Indeed, our mission is to help people get jobs. We connect job seekers with their next career opportunities and assist employers in finding the ideal candidates. This makes matching a fundamental problem in the products we develop. 

The Ranking Models team is responsible for building Machine Learning models that drive matching between job seekers and employers. These models generate predictions that are used in the re-ranking phase of the matching pipeline serving three main use cases: ranking, bid-scaling, and score-thresholding.

 

The Problem

Teams within Ranking Models have been using varying decision-making frameworks for online experiments, leading to some inconsistencies in determining model rollout – some teams prioritized model performance metrics, while others focused on product metrics. 

This divergence led to a critical question: Should model performance metrics or product metrics be the primary metric for success? All teams provided valid justifications for their current choices. So we decided to study this question more comprehensively.

To find an answer, we must first address two preliminary questions:

  1. How well does the optimization of individual models align with business goals?
  2. What metrics are important for modeling experiments?

🍰 We developed a parallel storyline of a dessert shop that hopefully provides more intuitions to the discussion: A dessert shop has recently been opened. It specializes in strawberry shortcakes. We are part of the team that’s responsible for strawberry purchases.

 

Preliminary Questions

How well does the optimization of individual models align with business goals?

🍰 How much do investments in strawberries contribute to the dessert shop’s business goals?

To begin, we will review how individual models are used within our systems and define how optimizing these models relates to the optimization of their respective components. Our goal is to assess the alignment between individual model optimization and the overarching business objectives. 

Ranking

Predicted scores for ranking targets are used to calculate utility scores for re-ranking. These targets are trained to optimize binary classification tasks. As a result, optimization of individual targets may not fully align with the optimization of the utility score [1]. The performance gain from individual targets may be diluted or omitted when used in the production system.

Further, the definition of utility may not always align with the business goals. For example, utility was once defined as total expected positive application outcomes for invite-to-apply emails while the product goal was to deliver more hires (which is a subset of positive application outcomes). Such misalignment further complicates translating performance gains from individual targets towards the business goals.

In summary, optimization of ranking models is partially aligned with our business goals.

Bid-scaling

Predicted scores for bid-scaling targets determine the scaled bids: pacing bids are multiplied by the predicted scores to calculate the scaled bids. In some cases, additional business logic may be applied in the bid-scaling process. Such logic dilutes the impact of these models.

Scaled bids serve multiple functions in our system. 

First, similar to ranking targets, the scaled bids are used to calculate utility scores for re-ranking. Therefore, for the same reason, the optimization of individual bid-scaling targets may not fully align with the optimization of the utility score.

Additionally, the scaled bids may be used to determine the charging price and in budget pacing algorithms. Ultimately, performance changes in individual bid-scaling targets could impact budget depletion and short-term revenue.

In summary, optimization of bid-scaling models is partially aligned with our business goals.

Score-thresholding

Predicted scores for score-thresholding targets are used as filters within the matching pipeline. The matched candidates with scores that fall outside of the pre-determined threshold are filtered out. Similarly, these targets are trained to optimize binary classification tasks. As a result, the optimization of individual targets aligns fairly well with their usage.

In some cases, however, additional business logic may be applied during the thresholding process (e.g., dynamic thresholding), which may dilute the impact from score-thresholding models. 

Further, the target definition may not always align with the business goals. For example, p(Job Seeker Positive Response|Job Seeker Response) model optimizes for positive interactions from job seekers. It may not be the most effective lever to drive job-to-profile relevance. Conversely, p(Bad Match|Send) model optimizes for identifying “bad matches” based on job-to-profile relevance labeling, and it could be an effective lever to drive more relevant matches which was once a key focus for recommendation products.

In summary, optimization of score-thresholding models could be well aligned or partially aligned with our business goals.

What metrics are important for modeling experiments?

🍰 How do we assess a new strawberry supplier? 

Let’s explore key metrics for evaluating online modeling experiments. Metrics are grouped into three categories: 

  • Model Performance: measures the performance of a ML model across various tasks 
  • Product: measures user interactions or business performance
  • Overall Ranking Performance: measures the performance of a system on the ranking task

(You may find the mathematical definitions of model performance metrics in the Appendix.)

Normalized Entropy

Model Performance

Normalized Entropy (NE) measures the goodness of prediction for a binary classifier. In addition to predictive performance, it implicitly reflects calibration [2].

NE in isolation may not be informative enough to estimate predictive performance. For example, if a model predicts twice the value and we apply a global multiplier of 0.5 for calibration, the resulting NE will improve, although the predictive performance remains unchanged [3].

Further, when measured online, we can only calculate NE based on the matches delivered or shown to the users. It may not align with the matches the model was scored on in the re-ranking stage.

ROC-AUC

Model Performance

ROC-AUC is a good indicator of the predictive performance for a binary classifier. It’s a reliable measure for evaluating ranking quality without taking into account calibration [3].

However, as calibration is not being accounted for by ROC-AUC, we may overlook the over- or under-prediction issues when measuring model performance solely with ROC-AUC. A model that is poorly fitted may overestimate or underestimate predictions, yet still demonstrate good discrimination power. Conversely, a well-fitted model might show poor discrimination if the probabilities for presence are only slightly higher than for absence [2].

Similar to NE, when measured online, we can only calculate the ROC-AUC based on the matches delivered or shown to the users.

nDCG

Model Performance Overall Ranking Performance

nDCG measures ranking quality by accounting for the positions of relevant items. It optimizes for ranking more relevant items at higher positions. It’s a common performance metric to evaluate ranking algorithms [2]. 

nDCG is normally calculated using a list of items sorted by rank scores (e.g., blended utility scores). Relevance labels could be defined using various approaches, e.g., offline relevance labeling, user funnel engagement signals, etc. Note that when we use offline labelings to define relevance labels, we can additionally measure nDCG on matches in the re-ranked list that were not delivered or shown to the users.

When model performance improves against its objective function, nDCG may or may not improve. There are a few scenarios where we may observe discrepancies: 

  1. Mismatch between model targets and relevance label (e.g., model optimizes for job applications while relevance label is based on job-to-profile fit)
  2. Diluted impact due to system design
  3. Model performance change is inconsistent across segments

Avg-Pred-to-Avg-Label

Model Performance

Avg-Pred-to-Avg-Label measures the calibration performance for a binary classifier by comparing the average predicted score to average labels, where the ideal value is 1. It provides insight into whether the mis-calibration is due to over- (when above 1) or under-prediction (when below 1). 

The calibration error is measured in aggregate, which implies that the errors presented in a particular score range may be canceled out when errors are aggregated across score ranges. 

The error is normalized against the baseline class probabilities, which allows us to infer the degree of mis-calibration in a relative scale (e.g., 20% over-prediction against the average label).

Calibration performance directly impacts Avg-Pred-to-Avg-Label. Predictive performance alone won’t improve it.

Average/Expected Calibration Error

Model Performance

Calibration Error is an alternative measure for calibration performance. It measures the reliability of the confidence of the score predictions. Intuitively, for class predictions, calibration means that if a model assigns a class with 90% probability, that class should appear 90% of the time. 

Average Calibration Error (ACE) and Expected Calibration Error (ECE) capture the difference between the average prediction and the average label across different score bins. ACE calculates the simple average of the errors of individual score bins, while ECE calculates the weighted average of the errors weighted by the number of predictions in the score bins. ACE could over-weight bins with only a few predictions.

Both metrics measure the absolute value of the errors, and the errors are captured on a more granular level compared to Avg-Pred-to-Avg-Label. Conversely, it could be difficult to interpret over- or under-prediction issues using the absolute value. Also, these metrics are not normalized against the baseline class probabilities.

Similar to Avg-Pred-to-Avg-Label, calibration performance directly impacts Calibration Error. Predictive performance alone won’t improve it.

Job seeker positive engagement metrics

Product

Job seeker positive engagement metrics capture job seekers’ interactions with our products for the interactions that we generally consider to be implicitly positive, for example, clicking on a job post, submitting applications. The implicitness implies potential misalignments with users’ true preferences. For example, job seekers may click on a job when they see a novel job title.

When model performance improves against its objective function, job seeker positive engagement metrics may or may not improve. There are a few scenarios where we may observe discrepancies:

  1. Misalignment between model targets and engagement metrics (e.g., ranking model optimized for application outcomes which negatively correlates with job seeker engagements)
  2. Diluted impact due to system design
  3. Model improvement in the “less impactful” region (e.g., improvement on the ROC curve far from thresholding region)

Outcome metrics

Product

Outcome metrics measure the (expected) outcomes of job applications. The outcomes could be captured by employer interactions (e.g., employers’ feedback on the job applications, follow-ups with the candidates), survey responses (e.g., hires), or model predictions (e.g., expected hires model). 

Employers’ feedback can be either implicit or explicit. When it is implicit, it again leaves room for possible misalignment with true preferences – for example, we’ve observed spammy employers who aggressively reach out to candidates regardless of their fit to the position. 

Additionally, there are potential observability issues for outcome metrics when they are based on user interactions – not all post-apply interactions happen on Indeed, which could lead to two issues: bias (e.g., engagement confounded) and sparseness. 

When model performance improves against its objective function, outcome metrics may or may not improve. There are a few scenarios where we may observe discrepancies: 

  1. Misalignment between model targets and product goal (e.g., one of the ranking model optimized for application outcomes while product specifically aims to deliver more hires)
  2. Diluted impact due to system design
  3. Model performance change is inconsistent across segments (e.g., the model improved mostly in identifying the most preferred jobs, while not improving in differentiating the more preferred from the less preferred jobs, resulting in popular jobs being crowded out.)

User-provided relevance metrics

Product

User-provided relevance metrics capture match relevance based on user interactions on components that explicitly ask for feedback on relevance, for example, relevance ratings on invite-to-apply emails, dislikes on Homepage and Search.

User-provided relevance metrics often suffer from observability issues as well – feedback are optional in most scenarios and therefore sparseness and potential biases are two major drawbacks. 

When model performance improves against its objective function, user-provided relevance metrics may or may not improve. For example, we may observe discrepancy when there’s misalignment between model targets and relevance metrics.

Labeling-based relevance metrics

Product Overall Ranking Performance

Labeling-based relevance metrics measure match relevance through a systematic labeling process. The labeling process may follow rule-based heuristics or leverage ML-based models.

The Relevance team at Indeed has developed a few match relevance metrics:

  • LLM-based labels: match quality labels generated by model-based (LLM) processes.
  • Rule-based labels: match quality labels generated by rule-based processes.

Similar to nDCG, we may also use labeling-based relevance metrics to assess overall ranking performance, e.g., GoodMatch rate@k, given the blended utility ranked lists.

When model performance improves against its objective function, labeling-based relevance metrics may or may not improve. We may observe discrepancies when there’s misalignment between model targets and relevance metrics.

Revenue

Product

Revenue measures advertisers’ spending on sponsored ads. The spending could be triggered by different user actions depending on the pricing models, e.g., clicks, applies, etc.

Short-term revenue change is often driven by bidding and budget pacing algorithms, which ultimately influence the delivery and budget depletion. Long-term revenue change is additionally driven by user satisfaction and retention.

When model performance improves against its objective function, revenue may or may not improve.

  • For short-term revenue, bid-scaling models could impact delivery and ultimately budget depletion. However, the effect could be diluted due to system design, for example, when objectives for monetization have a trivial weight in the re-ranking utility formula, improvement to bid-scaling models may not have a meaningful impact on revenue..
  • For long-term revenue, we expect directionally positive correlation, though discrepancies could happen, e.g., when there’s misalignment between model targets and relevance, when impact is diluted due to system design.

 

Evaluation Metrics for Online Modeling Experiments – Our Thoughts

🍰 Purchasing higher-quality, tastier strawberries may not always lead to more sales or happier customers. Consider a few scenarios:

  • The dessert shop started to develop a new series of core products featuring chocolates as the main ingredient. It becomes more important to find strawberries that offer a good balance in taste and texture with the chocolate.
  • The dessert shop started to develop a new series of fruit cakes. Strawberries are now only one of many fruits that are used.
  • There’s a recent trend in gelato cakes. The dessert shops decided to introduce a few gelato cakes that use much less strawberries in them. However, gelato hype may go away, and strawberry shortcake has always been our star product.
  • The dessert shop moved to a location which it’s much harder to find, losing significant regular customers.

Product Metrics vs. Model Performance Metrics

1st place medal Top recommendation: Improvement over product metrics and guardrail on individual model performance metrics.

As previously discussed, optimizing individual models often doesn’t directly translate to achieving business goals, and the relationship between the two can be complex. Therefore, making investment decisions based solely on improvements in model performance are likely ineffective.

  • When model targets and business goals are misaligned, it’s challenging to derive product impact from model performance impact. Making decisions based on product metric improvements ensures the impact is realized. 
  • When the model’s contribution is diluted due to system design, it prompts investment in bigger bets or alternatively in components that allow incremental impact to be realized more effectively. 

2nd place medal Secondary recommendation: Improvement over either product metrics or overall ranking performance metrics.

Although optimizing individual models doesn’t always directly meet business goals, enhancing overall ranking performance through metrics like nDCG@k aligns better with business objectives. This approach also helps mitigate downstream dilution or biases, allowing us to concentrate on improving re-ranking performance more effectively. That said, when the downstream dilution is by design, we could be making ineffective investment decisions if simply ignoring their impact. 

This approach may also be valuable when the company temporarily focuses on short-term business goals. It allows ranking to be less distracted and more focused on delivering high quality matches when products take temporary detours.

Among Product Metrics

Product metrics for experiment decision making should ultimately be driven by business goals and product strategy. We want to share a few thoughts on the usage of different types of product metrics:

User engagement metrics are relatively easy to move in short-term experiments. They are often a fair proxy for positive user feedback. However, we shall be mindful that they could have an ambiguous relationship with long-term business goals [4]. For example, clicks or applications are often considered as implicit positive feedback. However, it’s not very costly for job seekers to explore or even apply to jobs that they are not a great fit for. At the same time, exploring or applying to more jobs could be driven by bad user experiences (e.g., when they do not get satisfactory outcomes so far).

Relevance metrics, conversely, generally align well with long-term business goals [4]. Nevertheless, there are a few drawbacks: 

  • User-based relevance metrics could be hard to collect and measure in short-term online experiments.
  • Heuristic-based metrics may not have great accuracy.
  • Model-based metrics could be hard to explain and may carry inherent biases that are hard to detect.

Therefore, we may consider leveraging a combination of user engagement metrics and relevance metrics to achieve a good balance in business goal alignment, observability, and interpretability.

Lastly, revenue is a key performance indicator for the business in the long term. However, short-term revenue may have an ambiguous relationship with long-term business goals as well [4]. We may drive more clicks or applications to increase spending in the short term, but if we are not bringing satisfactory outcomes to our users, they may not continue to use our product in the future. Hence, we recommend using revenue as a success metric only when we are improving components within the bidding ecosystem, where there are short-term objectives defined for the bidding algorithms to achieve. In all other cases, we may keep revenue as a monitoring metric to prevent unintended short-term harms.  

Among Model Performance Metrics

We recommend setting guardrails on individual model performance with Normalize Entropy — we don’t want to degrade either predictive performance or score calibration. In addition, monitor ROC-AUC to help with deep-dive analysis and debugging.

For bid-scaling models, we recommend we additionally monitor their calibration performance with Avg-Pred-to-Avg-Label. This allows for visibility into over- / under-predictions and scales the error to the baseline class probability.

 

References

  1. Handling Online-Offline Discrepancy in Pinterest Ads Ranking System 
  2. Predictive Model Performance: Offline and Online Evaluations 
  3. Practical Lessons from Predicting Clicks on Ads at Facebook 
  4. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned
  5. Measuring classifier performance: a coherent alternative to the area under the ROC curve 
  6. How Well do Offline Metrics Predict Online Performance of Product Ranking Models? – Amazon Science 
  7. Relaxed Softmax: Efficient Confidence Auto-Calibration for Safe Pedestrian Detection 

 

Appendix

Normalized Entropy

Normalized Entropy (NE) is defined as the following [3]:

where y_i is the true label, p_i is the predicted score, and p is the background average label.

Note: NE normalizes cross-entropy loss with the entropy of the background probability (average label). It’s equivalent to 1- Relative Information Gain (RIG) [2]

ROC-AUC

The Receiver Operating Characteristic (ROC) curve plots true positive rate (TPR) against the false positive rate (FPR) at each threshold setting. ROC-AUC stands for Area under the ROC Curve.

Note: Given its definition, ROC-AUC could also be interpreted as the probability that a randomly drawn member of class 0 will have a score lower than the score of a randomly drawn member of class 1 [5].

nDCG

nDCG stands for normalize Discounted Cumulative Gain. We define Discounted Cumulative Gain (DCG) at position k for a ranking list of query q_i as

y_i,j is the relevance label for j-th ranked item in query q_i. The gain function could be defined in different forms (e.g., linear form, exponential form).

Then, we normalize DCG to [0,1] for each query and define nDCG by summing the DCG values for all queries:

maxDCG is the DCG value of the ranking list obtained by sorting items in descending order of relevance [6].

Note: “query” may not be relevant in all search ranking tasks. Based on the product’s design, we may replace it with suitable groupings. For example, for the homepage, we may group on “feed.”

Avg-Pred-to-Avg-Label 

where y_i is the true label, p_i is the predicted score.

Note: The percentage change in this value may not be fully informative since the ideal value is 1. To use it for experimental measurements, we may consider taking the Abs(actual – 1) or establishing alternative decision boundaries. 

Average/Expected Calibration Error 

Average Calibration Error and Expected Calibration Error are defined as the following [7]:

where M+ is the number of non-empty bins, S_m is the average score for bin m, A_m is the average label for bin m.

Note:

  • Average calibration error is a simple average of calibration error across different score range bins
  • Expected calibration error is the weighted average of calibration error across different score range bins, weighted by the number of examples in the bin