Distilling Long-Tail User Behavior into Scalable Embeddings for Job Search

Authors : Marsan Ma, Nikhil Lopes, Raj Amrit, Hong Lu, Dipankar Biswas, Trent Kyono
Leadership: Iris Wang, Madhu Kurup

Recommendation and ranking systems power many of the most important experiences on large internet platforms. Yet the models that run in production are rarely the largest models we can train. They are usually compact, latency-sensitive supervised models that need to score huge candidate sets for millions of users under tight cost constraints.

That creates a practical tension:

  • We want to use rich, long-term behavioral histories and modern deep learning.
  • We still need millisecond-level latency for high-traffic ranking, recommendation, and bidding systems.

This post describes how we addressed that tension in job search by building a user behavior modeling system, or UBM, that learns from long-tail user histories offline and distills them into compact user embeddings that many online models can consume.

At a high level, UBM:

  • Mines long-term user behavior with deep sequence models.
  • Distills each user’s history into a fixed-length embedding.
  • Makes that embedding available through a feature store.
  • Lets existing production models use the embedding with minimal serving changes.
  • Produces consistent multi-percent lifts across several high-traffic surfaces.

The core idea is simple: do the expensive sequence modeling once, offline, and reuse the resulting user representation many times online.

Figure 1. Behavior modeling as a two-layer system

Why long-tail user behavior is hard to use directly on production

For a job platform, understanding job seekers is central to matching the right people with the right roles. A user’s history can include many signals across the hiring journey:

  • Search queries, including titles, keywords, companies, and locations.
  • Job impressions and clicks.
  • Saves and bookmarks.
  • Apply starts and completions.
  • Employer responses and downstream outcomes.

In principle, this history is highly valuable. In practice, traditional tabular modeling often forces several compromises.

First, we keep fixed-length windows, such as the most recent K actions, and discard the rest. Second, we aggressively aggregate sequences into statistics such as “top title words” or “fraction of clicks in industry X.” Third, we rely on one-hot or sparse features that lose semantic similarity across titles, skills, companies, and industries.

Those simplifications can distort user intent. Consider a job seeker who once applied to civil engineering roles, later explored medical software trainer roles, and eventually settled into senior account management. A naive aggregation over title tokens might over-weight the word “software” and push the system toward software engineering recommendations, even though the user’s recent and consistent behavior points elsewhere.

At the same time, directly serving stacked sequence models over raw per-impression history did not fit our existing production infrastructure, where high-traffic systems need to score large candidate sets within strict latency and cost budgets.

We needed a system that could let large models learn from raw sequences offline, then feed a distilled representation into the compact online models that already power production traffic.

 

Behavior modeling as sequence modeling

We model user behavior as sequences, much like sentences in natural language processing:

  • Sequences of jobs seen, clicked, saved, or applied to, with metadata such as title, location, salary, company, and category.
  • Sequences of search queries and their attributes.
  • Other contextual events are ordered over time.

Sequence models are useful here because they can denoise long histories, capture temporal structure, and learn semantic relationships between jobs, queries, and users. The serving constraint shapes the architecture: a large offline model reads long histories and emits a user embedding, while many small online models consume that embedding as an ordinary dense feature.

Conceptually:

Heavy sequence modeling happens offline; light scoring happens online many times.

 

Architecture: from raw events to user embeddings

1. Encode jobs and events

The first step is to encode individual jobs and events into dense vectors. Each job contains multiple feature types:

  • Numerical features, such as salary and seniority signals.
  • Categorical features, such as location, job category, and company.
  • Multi-hot features, such as normalized titles, skills, and industries.

Each feature type is mapped into embeddings. Those embeddings are concatenated and passed through a Deep & Cross Network, producing a compact job embedding that captures both linear and non-linear feature interactions.

The same job encoder is reused across behavior streams so that all sequences live in a consistent embedding space.

Figure 2. Job encoder architecture

 

2. Build behavior sequences

In practice, we have gone through two generations of behavior encoders: an earlier multi-sequence design and our current unified single-sequence design. In the first production version of UBM, user history was naturally treated as multi-channel: each action type was represented as a separate time-ordered sequence of job embeddings:

User U

     Apply: [job_a1, job_a2, …, job_aN]

     Click: [job_c1, job_c2, …, job_cM]

     Impression: [job_i1, job_i2, …, job_iK]

Each sequence includes positional encodings so the model can reason about recency, order, and temporal patterns instead of treating user history as an unordered set.

Figure 3. Multi-sequence user history with positions

3. Denoise long histories with self-attention

Long histories are noisy. People explore, change direction, compare roles casually, or click jobs that are only loosely relevant. The sequence model needs to separate durable intent from one-off behavior.

In the multi-sequence version, each action-specific sequence is passed through multi-head self-attention or transformer encoder blocks. The model re-weights each event in the context of the full sequence. Consistent patterns are amplified; isolated or off-topic events are down-weighted.

After attention, we pool the denoised sequence and combine it with skip connections from the original embeddings. Another Deep & Cross block then produces a per-sequence embedding, such as an “apply history embedding” or “click history embedding.”

Figure 4. Denoising a behavior sequence with self-attention

 

4. Evolve from multi-sequence fusion to a unified event timeline

Different actions carry different information:

  • Apply history is strong but sparse.
  • Click and impression history is weaker but dense, and captures exploration.
  • Saves, ignores, and other actions add nuance.

The first generation of UBM learned separate encoders for different behavior streams, then concatenated or fused their outputs into a single user history embedding. That design worked well: the final vector was expressive enough to capture long-term structure, but small and fixed-length enough for downstream models to consume cheaply.

More recently, we found that we could simplify the architecture and improve results by moving to a single-sequence design:

  • Merge all events, such as applies, clicks, impressions, and saves, into one unified timeline sorted by time.
  • Add an explicit event-type embedding to each step so the model knows whether the event was an apply, click, impression, or another action.
  • Let one transformer stack attend over the unified sequence instead of maintaining separate stacks per action type.

This design has two practical advantages.

First, it simplifies the model and pipeline. There is one sequence encoder to train, maintain, and monitor. Adding a new event type becomes a matter of adding a new event-type representation, rather than wiring a new encoder into the model and downstream feature pipeline.

Second, it gives attention layers direct access to cross-event patterns. The model can learn transitions such as “impression to click to apply” inside one timeline, rather than relying on separate per-action summaries that are fused only after each stream has already been compressed. This also follows a broader trend in modern sequence modeling: expose as much raw event data as possible and minimize hand-crafted aggregation, so the model can discover these patterns directly from first-hand signals.

In a large downstream model, switching from the legacy multi-sequence UBM to the unified single-sequence behavior model roughly doubled the relative ROC-AUC gain over the no-UBM control model on several targets. For example, in one experiment the relative ROC-AUC improvement over control on the CTR target went from about +1.6% to +3.5%, and on the apply-start target from about +1.3% to +2.3%, using the same downstream architecture and data.

Figure 5. Legacy multi-sequence UBM vs. unified single-sequence UBM

 

Training and serving at scale

Offline training

Training the offline sequence model is challenging because each user can interact with hundreds of jobs, and each job can have many attributes. Naively materializing all joins between users, events, and job attributes would inflate the training table by hundreds of times (each example references hundreds of jobs), turning a TB-scale dataset into PB-scale. Keeping jobs in a side table and doing cached GPU lookups instead cuts the effective data volume by hundreds of times and gives us over 100× faster training in internal benchmarks, since dense array lookups are exactly what GPUs are good at.

The offline model is trained on months of historical data and updated periodically; in the unified single-sequence version, the same pipeline emits one chronologically sorted event stream per user, with event-type features attached to each step.

Daily embedding refresh

The model itself does not need to be retrained every day, but user behavior changes continuously. To keep embeddings fresh:

  • We run daily batch inference with the latest user histories.
  • The model remains fixed during that refresh cycle.
  • Input sequences slide forward as new events arrive.
  • Updated embeddings are written to a feature store for downstream consumers.

In practice, the offline model can remain useful for months between retraining cycles, while daily inference keeps user representations current enough for production ranking and recommendation.

Figure 6. Data and deployment pipeline for UBM

 

Online consumption by many models

Downstream ranking, recommendation, and bidding models treat the UBM embedding as one more input feature. This is the key to make the system deployable:

  • Production models do not need to be rewritten as transformers.
  • Existing tabular models can be retrained with an additional dense feature.
  • Online serving cost changes only modestly because the heavy sequence modeling has already happened offline.

This “one producer, many consumers” pattern also improves consistency. Multiple surfaces share a common view of user intent instead of each team rebuilding its own local behavior features.

 

How much does this help?

Offline model quality

On an offline apply-prediction benchmark, we compared a strong baseline model with traditional features against the same model augmented with UBM embeddings. Adding more behavior streams improved ROC AUC consistently in the multi-sequence version:

  • Apply sequence only: roughly +3.0% relative ROC AUC.
  • Apply + click sequences: roughly +3.6%.
  • Apply + click + impression sequences: roughly +4.9%.

We also found that longer histories helped. Extending maximum sequence length from 8 to 32 events showed monotonic gains, with diminishing returns as the sequence length increased. While figure 7 only shows this early sweep up to 32 steps, our current production UBM models already use sequence lengths up to 256 events, and we are actively exploring longer contexts as we scale up infrastructure.

The newer unified single-sequence model kept the same operational pattern but improved downstream lift by letting the model attend across event types directly:

  • Click-based target: roughly +1.6% with legacy multi-sequence UBM, improving to about +3.5% with unified single-sequence UBM.
  • Apply-start rate: roughly +1.3% with legacy multi-sequence UBM, improving to about +2.3% with unified single-sequence UBM.
Figure 7. More behavior, longer histories, and unified timelines improve model quality

These results support two intuitions:

  • More behavior channels matter because different actions contain complementary signals.
  • Long-tail history matters because the model can learn to denoise behavior instead of relying on hand-truncated windows.
  • Event order across behavior types matters because job search intent often emerges through transitions, not isolated action streams.

Impact on high-traffic job surfaces in production

The more important question is what happens in production.

Because the same embedding can be reused across many models, the impact compounds across surfaces. In production experiments, adding UBM features produced consistent gains in recommendation quality, apply efficiency, and monetization metrics.

Surface Relative lift
Jobseeker Email Recommendations Application rate +5.24%
Jobseeker Homepage recommendations Application rate +2.04%
Employer Resume Search NDCG@10 +2.89%
Employer Candidate Recommendations Employer Acceptance rate +1.64%

Across downstream models, the UBM embedding is often ranked as a top-1 or top-2 feature by importance, consistent with the observed business impact.

 

Extending beyond user behavior

Once the infrastructure existed for user behavior, the same pattern became useful elsewhere in the marketplace.

Employer behavior modeling applies analogous techniques to employer interactions with candidates. Those embeddings can feed sourcing, ranking, and bidding systems.

Knowledge graph and item embeddings offer another extension. Graph-based methods can encode relationships between jobs, companies, skills, and users, and those representations can be combined with sequence-based UBM and employer behavior embeddings.

The broader pattern is reusable: train richer representation models offline, distill them into compact embeddings, and expose them as shared features for production systems.

 

Design trade-offs and lessons learned

The offline-online split is worth it

Splitting the model into a heavy offline encoder and light online consumers sacrifices some optimality. For example, attention is not conditioned on the exact target job at scoring time. But the latency and cost benefits are large, and the resulting system is much easier to deploy across many production models.

Sequence modeling beats hand-built aggregates

Raw sequences across behavior types are more expressive than handcrafted features such as “percentage of clicked jobs in industry X.” The model learns which events to emphasize, which to down-weight, and how behavior changes over time. The single-sequence design extends this lesson: preserving cross-event order can be more valuable than summarizing each action stream independently and fusing later.

Staleness is manageable, but must be monitored

Embeddings do not need to be retrained every week, but they do need operational monitoring. Useful diagnostics include offline ROC AUC and log loss, online KPIs, feature distribution checks, and day-over-day cosine similarity of user embeddings to catch pipeline anomalies.

Centralized embeddings reduce duplicated effort

A shared embedding producer lets multiple product teams benefit from the same representation learning investment. It also reduces duplicated local behavior features and makes user understanding more consistent across surfaces.

 

Where we are heading next

UBM is now a foundation layer for many ranking, recommendation, and bidding systems, but several directions remain active:

  • Richer unified sequence modeling across search, browse, apply, and off-platform signals.
  • Joint user-employer modeling so both sides of the marketplace can be represented together.
  • Tighter integration with retrieval and approximate nearest-neighbor search.
  • Better use of behavior embeddings as context for large language models in hiring workflows.
  • Standardized tooling and observability so downstream teams can adopt embeddings safely.

The goal is straightforward: use long-term behavioral data to make job search and hiring more relevant, while staying within the real-world constraints of large-scale production systems.

Become Builders, Not Coders

Why agentic coding tools demand a new identity for software engineers

After more than two decades of professional software engineering, I have arrived at a set of conclusions that I find very uncomfortable.

The era of mostly manual coding has ended. IDEs, in their current form, are no longer necessary. Traditional software development languages are showing signs that they already have entered the beginning of their end (see nanolang, an experimental language designed for agents rather than humans).

These statements are deliberately provocative, and I expect that many of you reading them will disagree, some strongly.

My conclusion, and impassioned plea, is this: each one of us must adapt to the new world not by lamenting how our jobs change, but by embracing the notion that we were never paid to code. Coding was just something we did.

We are paid to build products that solve customer problems using code.

Doing that with Agentic Coding Tools is a hugely different set of actions but it is the same output, and ultimately requires the same high-level skills, while freeing us from much of the minutia.

A sudden change

I have long had interest in neural networks – way before LLMs were possible – and I thought we were many years, probably between generations and “never” away from what has become real in the past few years. My personal perspective on AI coding can be summed up:

  • Through 2023: AI tab complete is a great demo and a neat toy, but only useful in languages without strong types and solid deterministic auto-complete (JetBrains Java quality), or for very inexperienced developers.
  • 2024 to Early 2025: Chat Oriented Programming (CHOP) seems real, but it seems to require top-1% developer skills to realize significant gains, and I am getting concerned that I will not catch up. Vibe Coding is a silly fad, similar to titling yourself “ninja” or “wizard” in a resume.
  • Mid-2025: Agentic coding is magic. Context Engineering is a real skill, and Model Content Protocol (MCP) is reckless, but amazing. Things I thought impossible are now trivial. The security risks and automation problems are huge, but everything is going to change.
  • Now: AGENTS.md MCP, Skills, Commands, Sandboxing, Subagents, Ralph Loops, Beads, Orchestrators, etc. – I cannot keep up; maybe no one can fully. Figuring out what to build, and how to have agents build it better and faster than humans is all that matters. Many of the things I assumed were critically important are now simply irrelevant. I now routinely produce code in languages I can barely read, easily, and with vastly more confidence than I would have as a typical beginner – yes, I’ve had known experts review it, and gotten good feedback that it is not “slop”; it looks like code written by humans, except with better tests than average.

Why such a sudden change?

The trite answer to the rapid advancement is simply that the models got better, cheaper to train and run (per parameter or token), and more available. That is definitely a factor, but, I believe that more critical advances included:

  • Context Size: The usable context size grew to the point that agents started to accomplish significant tasks with minimal oversight. Their ability to quickly process and return meaningful results from vast amounts of short-term context now clearly exceeds human capability.
  • Context Availability: MCP allowed agents to explore context beyond that which was available on the local machine.
  • Tool Use: Tool use has largely eliminated hallucinations for LLM use where output can be verified. (Karpathy predicted this 8 years ago!) Areas in which they continue hallucinating tend to be obvious and easily corrected.
  • TODOs: A simple TODO tool added to Claude Code made it discontinuously better at staying on task; here is an interview with its creators discussing it.

What does this mean for the profession of Software Engineering?

I do not think anyone understands the full ramifications of this change. It is too new and too fast. I use the metaphor of going to sleep one night as a blacksmith knowing only hammers and bellows, and waking up the next morning employed at a modern metal shop with hydraulic presses, CNC machines, laser cutters, advanced welding equipment, and even additive manufacturing. The change happened so suddenly that it is hard to express how shocking it is to those not used to the pre-agentic methods.

I am personally optimistic for the profession of Software Engineering. Jevons Paradox describes how increases in efficiency can result in more consumption of a resource, not less. Jevons observed that as steam engines became more efficient in their use of coal, total coal consumption increased rather than decreased — because the efficiency made coal-powered applications economically viable in far more contexts. We may already see this in how AI affects Radiologists.

Most software I use personally is pretty awful. It is buggy, has UI that clearly was developed without any UX expertise, isolated instead of integrated with other systems, and has huge security holes. Yes, “slop” could produce more of that. But, fixing these problems is largely the application of expertise that can be encoded into context. This expertise can be applied by software engineers who, without AI tooling, would have neither deep specialty skills nor the time to improve any of them.

I fully expect that Marc Andreessen’s observation that Software is Eating the World will accelerate, driven by the efficiency from agentic tools. This will lead to a new era of demand for solid engineers using those tools.

That transition is not happening, yet, and may still take a few years. I do not mean to minimize the real pain in the industry right now: Many companies did over-hire in 2020-2022; we really did promise too many college students that studying CS was a golden ticket. I have been personally laid off multiple times in the past, and early in my career, spent half a year to find a job with a 50% pay cut. It hurts, and I do not mean to imply that the last few years, or the next few, have not been or will not be painful for many.

What does this mean for software engineers?

In my experience, this shift is already well underway at many companies. Most are not large enough to build their own agentic coding stacks like Google, Meta, Amazon, or Microsoft, but are investing in commercial tools, training, and internal interest groups. Some are even adjusting their performance evaluation criteria to reward adoption of agentic coding.

A quarter century into my career, I feel like an old dog trying to learn new tricks. But, I am personally grateful that my employer provides access to these tools. I am too risk averse to want to pay hundreds of dollars speculatively for access to tools just to train myself. Having access through work eliminated that barrier and got me started.

The focus on re-training will not last forever. Companies are not paying for it altruistically. The investment needs to translate into real capability, not just a line on a resume.

The goal is increased productivity, which is notoriously difficult to measure directly. As proxies, we should look for more prototyping, improved experiment velocity, lower maintenance costs, and higher quality (more and better tests, more rigorous standards implemented more consistently, etc.). Early results are promising, and more clearly, the constraints, risks, and barriers are becoming visible — which allows us to focus on overcoming them.

Here is what I have seen work, and what I believe engineers at every level need to think about.

A shift in perspective

All of us need to shift our identity from a focus on coding, to a focus on solving problems with software. This is a huge request – almost a shift in identity, not just thought.

I have introduced myself for most of my career as “mostly a Java guy.” Yes, I have significant professional experience in several other languages. But, if I was really honest about it to myself, I thought of myself as being a coder, who wrote and read Java as a first language, and a dozen or two others as second languages with various levels of competence.

Agentic coding has revealed that this way of speaking was always an idiom. No one who buys software really cares that I know Java. I was never paid for that. I was paid to solve problems with software, and for a large part of the last 25 years, Java just happened to be a relatively good tool.

Very deliberately, I have to think of myself as a problem solver, who uses code.

What to change

Details of how this change in perspective will be worked out vary based on role.

Individual contributors – roughly senior engineer and below – who traditionally coded most of the time should focus on learning key skills that we have long expected our senior engineers to master:

  • Work Decomposition: Breaking large tasks into tasks small enough for a single context window is one of the core skills of Context Engineering. This breakdown was previously done mostly by tech leads and managers. With agentic coding, it must be learned almost immediately since agents can do hours of typing in seconds.
  • Rapid Code Review: The ability to read code quickly is critical, with less focus on minutia and more focus on the core of the change, overall style, good patterns, etc. Soon, agents will likely make this easier, but it is important to be able to do so directly today.
  • Technical Writing: Models use human languages, most commonly English at the moment. Improve your writing skills. Learn to use Grammarly, CoPilot, or Gemini (or coding tools!) to improve your style. Have agents assess your writing, and ask them to interview you to help find ways to communicate more effectively, both improving style and filling gaps.
  • Clean Code: Emphasize specification. Build minimal solutions; ask for agentic review (before code reviews) regarding patterns, alignment with standards, style, etc.; be willing to start over. Write great tests: if you need to make a big change, and they are missing, ask agents to write deliberately over-specified tests before making the change, then use test breakage as a signal that the change is what you intend.

Engineering leaders – staff-and-above individual contributors, and technical managers – need to:

  • Reconsider the Cost of Software: Leaders learn – often the hard way – that code is a liability. Many think, “lines of code spent,” a limited resource due to maintenance costs, not “lines of code written.” This is now less true. Software 2.0 means a clear specification can be translated into code or rewritten in another language with exponentially less effort, cost, and risk.
  • Use Agents to Understand Your Codebase: Ask agents to explain your codebase. Study the output and look for things you know to be right or wrong. Ask for reviews or critiques. Try with different instructions, focusing on different aspects or different personas.
  • Build Again: Get your development environment working again. Fix simple bugs. Do work that is less interesting like migrations. Learn to use low-code automation platforms and AI assistants to automate things. Or, build entirely new projects yourself, particularly personal or internal tools. The roles you hold demand the ability to decompose work, review code, and write about technical topics. You are extremely well skilled to build agentic tools. Doing so can help you lead, coach, and mentor others. One of my key learnings has been that I need to use the tools to understand the depth of how different it is to develop software with these tools.

And the implications extend well beyond engineering. Product managers, business analysts, and others outside R&D are finding that low-code automation tools and AI assistants allow them to automate repetitive work, build prototypes that communicate requirements better than any document, and even verify outputs against specifications. The bar for who can build useful software is dropping fast, and non-engineers who adapt will have an outsized impact on their organizations.

How to change

Change is happening so rapidly that this list will probably seem incomplete, irrelevant, or perhaps even wrong in weeks. But, today, here is my recommendation:

  • Learn the tools. Become a constant user of at least one agentic coding tool – whether CLI-based or IDE-integrated. The landscape is evolving fast, but the current leaders include Claude Code, OpenAI Codex, Gemini CLI, Amp, Cursor, and Windsurf. Pick one and commit to using it daily.
    • For every task beyond a few lines or clicks, not just implementation, spend a few minutes trying to get AI to do it.
    • Join or create internal communities for sharing AI development techniques; ask for help and help others.
    • Blog on big wins; help your teammates. Observe something; do it; teach someone else – “See one, do one, teach one” (SODOTO).
  • Focus on building agentically: Avoid typing code or copying & pasting from chat. Let the agents make the changes, build, observe outputs, and iterate. Remember that Agents need context, constraints, and success criteria, not instructions of what to do.
  • Learn Context Engineering: Some of the more common complaints are that agents make assumptions and hallucinate. Much of that is caused by gaps in what they are presented. They have been trained on nearly every text document that can be legally presented to them, so there are a lot of differing decisions and even bad practices built in. Add context with good examples, standards, etc. Sometimes, this is a prompt or an AGENTS.md, but just as often, it is Skills, Hooks, MCP, or Commands that encode fixed behavior.
  • Pay attention to risks. Learn about things like the Lethal Trifecta, sandboxing, and prompt injection. Learn how to allow-list tool use and assess risk. Keep your tools updated.
  • Build your skills incrementally; roughly in order, that is:
    • Start with agentic coding and deliberate Context Engineering.
    • Take opportunities to figure out how to use agents to break down and build smaller tasks. Use them for planning and research.
    • Experiment with Spec Driven Development (SDD) to use multiple context windows for a single task to produce more consistent results with fewer interruptions.
    • Figure out how to run a simple Ralph Loop – a scripted loop that repeatedly invokes an agent across many context windows to make changes too large for any single session.
    • Experiment with multiple, parallel, agentic sessions or even Agent Orchestrators. Learn from research on scaling agents.
  • Follow your organization’s AI coding policies. Use the provided tools and ask for permission to explore new opportunities.

What will happen in the next few years?

My personal speculation is that the key change will be that documentation, processes and common knowledge that collectively helped growing groups of humans work will be encoded into context and software that manages agents. This will not happen all at once, and right now, initial efforts can best be described as chaotic. Things that I am almost sure will happen, in some form are:

  • TODOs will rapidly evolve into a hierarchy of tasks. Agents will gain the ability to identify tasks that are too large and break them down, as well as to discover new ones. LLMs that now ignore missing context and make assumptions will get better at identifying these gaps and finding ways to fill them, which will often include asking humans but also involve better automated context seeking, and collective memories (per user, per team, per company, etc.).
  • Orchestration will become the norm, not something that seems novel (e.g. Gas Town).
  • Sandboxing, techniques using adversarial agents, context isolation (strip the why of a requested action out and consider only if it is reasonable – should cut off most prompt injection), and less intrusive permission requests, will mature to the point that agents will run almost continuously to improve code.
  • IDEs will disappear in their current form, but their capabilities – refactoring, debuggers, profilers, structural search & replace, etc. – will become tools agents use to reduce the number of tokens consumed to accomplish the same tasks, paralleling how humans benefit from those tools.
  • Traditional languages, built for humans, will be replaced with languages built for agents that do not care about the amount of typing, are willing to accept required testing, can be indexed easily, have strong type constraints, etc. Things like Foreign Function Interfaces (FFI, as opposed to system calling conventions) will become less important since the complexity of lower-level interfaces do not seem to be a problem for LLMs.
  • Code review will evolve into change review: humans will, for the most part, stop reading the code but still need to be able to reason about how the system evolves. Change review will describe changes with clear prose and diagrams, not present line-by-line diffs, and allow conversational exploration of the change.

The collective result of this will be an increase in software scale at least equivalent to the jump from computers that ran programs directly constructed in machine code, to computers running an operating system running software written in “high level” languages.

Conclusion

Two quotes come to mind as I consider this paradigm shift (forgive the over-corporate term – it is literally appropriate here).

The first is the quote often attributed to Thomas J. Watson, then IBM Chairman: “I think there is a world market for maybe five computers.” Whether or not he actually said it, I think the sentiment was correct. There really was a market for only a few computers when all software was written directly in machine code, with no operating system, on enormously expensive hardware. The past 80 years have brought computers to things around which buildings were built to things so cheap that they are thrown away in common single-use disposable medical devices. I very much suspect that LLMs are the next step in this change. Whether or not history will see them as an extension or a second revolution is something for later generations to decide, but I am certain that the change is happening far more rapidly now.

The second is more alarming: Upton Sinclair wrote in his memoir, “It is difficult to get a man to understand something, when his salary depends upon his not understanding it.” I find it extremely challenging to think about the consequences of agentic gains. So much of my career has been focused on the skills required to do things that LLMs can now do trivially that it very much feels like I am being replaced.

I have to remind myself constantly that the core skills of engineering do not go away with better tools. CAD didn’t eliminate civil engineering or architecture; it eliminated pencil skills for drafting. Word Processing didn’t eliminate writing; it made typing vastly easier. Agentic Coding will not eliminate Software Engineering but it will very likely eliminate coding. The blacksmith who woke up in a modern metal shop still needs to know metallurgy, tolerances, and what the customer actually needs built. The tools changed. The craft did not.

Knowledge of code is not what our value depends on. Knowledge of how to build software is the skill that has always been, and is now very clearly most critical.


Michael Werle is a Technical Fellow in Core Infrastructure at Indeed, where he serves as tech lead across the organization’s platform engineering and SRE teams. He can be reached on LinkedIn.

This article was written by hand, with agentic tools used for feedback and editing.

Bringing Lighthouse to the App: Building Performance Metrics for React Native

At Indeed we’ve open sourced a new React Native repository which makes it simple to measure Lighthouse scores in your mobile apps. We think it will help other organizations better measure their app performance, especially for companies similar to Indeed who are transitioning from a web-first to an app-first approach.

You can check out the code here, and read on for more details.

The Challenge

Indeed had traditionally been a web company. Site speed wasn’t just a nice-to-have — it was fundamental to how we built systems. We believed good software was always fast, and for many years now, we had relied on Lighthouse to keep us honest. In the past we’ve written in depth on this topic, but as we’ve transitioned to a Mobile App first company, we needed a way of bringing the same performance rigor to our native code.

As React Native proliferated across our most critical pages—ViewJob, SERP, Homepage—we found ourselves flying blind. We had no standardized way to measure whether our mobile performance was improving, degrading, or holding steady. We needed answers to fundamental questions: How fast did our screens load? When could users actually interact with them? Were we maintaining the performance standards that Indeed was known for?

The Solution: Core Web Vitals for React Native

Rather than reinvent the wheel, we looked to the industry standards that had proven effective on the web: Core Web Vitals. These metrics — designed by Google to capture the essence of user experience — translated remarkably well to mobile apps. We just needed to adapt them for React Native’s unique threading model and lifecycle.

The Metrics That Matter

  1. Time to First Frame (TTFF) — When users saw content
    Our analog to Largest Contentful Paint (LCP). It measures how quickly users see meaningful content after a component starts mounting. In a native app context, this needs to be fast — there’s no network request to fetch the document, no HTML parsing, no CSS cascade. Code is pre-bundled. Users expect instant visual feedback.
    Threshold: < 300ms is good, > 800ms is poor.
  2. Time to Interactive (TTI) — When users could actually do something
    The most critical metric for mobile apps. We measure when a component transitions from “loading” to “ready for interaction.” Unlike the web, where TTI was algorithmically determined, we let components self-report when they’re truly interactive — when data is loaded, UI is rendered, and touch handlers are ready. While not ideal in every case, we’ve found algorithmic TTI (e.g., TTI Polyfill) can also be inaccurate.
    Threshold: < 500ms is good, > 1500ms is poor.
  3. First Input Delay (FID) — How responsive the app felt
    Captures the delay between a user’s first touch and when the app responds. On mobile, touch interactions should feel instantaneous. Any perceptible lag breaks the illusion of direct manipulation that makes mobile apps feel native.
    Threshold: < 50ms is good, > 150ms is poor.

Why These Thresholds?

Our thresholds are significantly stricter than Core Web Vitals (roughly 40% tighter). This was intentional. Native apps need to be faster than web apps:

  • ✅ No network requests for initial render
  • ✅ Code is pre-bundled in the app
  • ✅ No HTML/CSS/JS parsing overhead
  • ✅ Users expect native app speed

For context, Core Web Vitals consider LCP < 2.5s as “good.” We consider TTFF > 800ms as “poor” — about 6× stricter. Mobile users have different expectations, and our thresholds reflect that reality.

Integration: Dead Simple

The entire system is packaged as a single React hook. Integration takes three steps:

function MyComponent(): JSX.Element {
  // 1. Add the hook
  const { markInteractive, panResponder } = usePerformanceMeasurement({
    provider: 'myapp',
    componentName: 'MyComponent' as const,
  });

  // 2. Mark when interactive
  useEffect(() => {
    if (dataLoaded) {
      markInteractive();
    }
  }, [dataLoaded]);

  // 3. Attach pan responder to root view
  return (
    <View {...panResponder.panHandlers}>
      {/* Your component */}
    </View>
  );
}

That’s it. No configuration files, no complex setup, virtually no performance overhead in production. The hook handles everything: timing, measurement, logging, and cleanup.

Technical Implementation

Architecture Overview

The measurement system follows a component’s lifecycle from mount to interaction:

Component Mount → TTFF Measurement → TTI Marking → FID Capture → Logging

1. Measuring Time to First Frame

React Native’s InteractionManager is key. It lets us run code after the current frame finishes rendering — the perfect hook for measuring TTFF:

useEffect(() => {
  const handle = InteractionManager.runAfterInteractions(() => {
    const ttff = Date.now() - mountStartTime;
    // TTFF captured after first frame renders
  });
  return () => handle.cancel();
}, []);

2. Marking Time to Interactive

Components know best when they’re truly interactive. Rather than trying to algorithmically determine this (as Lighthouse does for web), we provide a markInteractive() callback that components call when they’re ready:

const { markInteractive } = usePerformanceMeasurement({
  provider: 'viewjob',
  componentName: 'ViewJobMainContent'
});

useEffect(() => {
  if (dataLoaded && uiReady) {
    markInteractive(); // Component decides when it's interactive
  }
}, [dataLoaded, uiReady]);

3. Capturing First Input Delay

React Native’s PanResponder gives us comprehensive input capture across all touch types. We measure the delay between touch start and when the main thread can process it:

const panResponder = PanResponder.create({
  onStartShouldSetPanResponder: () => {
    const inputTime = Date.now();
    setImmediate(() => {
      const processingTime = Date.now();
      const fid = processingTime - inputTime; // Main thread delay
    });
    return false; // Don't capture the gesture
  }
});

The setImmediate is crucial — it ensures we measure the actual main thread processing delay, not just the touch handler execution time.

4. Smart Logging Strategy

  • Wait for FID: Delay logging until first user interaction
  • Timeout fallback: Log after 5 seconds even without interaction
  • Single event: All metrics logged together for easier analysis

This approach gives complete performance profiles while avoiding metric fragmentation.

Real-World Results

We first integrated this system into ViewJob, one of Indeed’s highest-traffic pages. Here’s what we learned:

Console Output (Development)

[Performance-Debug] TTI marked: 172ms
[Performance-Debug] TTFF captured: 347ms
[Performance] viewjob/ViewJobMainContent: {
  TTFF_ms: 347,
  TTI_ms: 172,
  FID_ms: 0,
  FID_type: "touch"
}

The Lighthouse Score Equivalent

To make performance actionable, we created a composite score (0–100) that mirrors Lighthouse scoring:

const PERFORMANCE_WEIGHTS = {
  TTFF: 0.25, // Visual loading
  TTI: 0.45,  // Interactivity (most critical)
  FID: 0.30   // Responsiveness
};

TTI gets the highest weight (45%) because mobile users expect immediate interactivity. Visual loading and responsiveness are important, but nothing frustrates users more than tapping a button that doesn’t respond.

ViewJob Performance:
• Average score: 81 (Good)
• P75 score: 95 (Excellent)

These scores give us a single number to track over time, making it easy to spot regressions and measure improvements.

What We Learned

1. Native Apps Should Be Faster

Our initial thresholds were too lenient — we started with web-based Core Web Vitals and quickly realized native apps should perform better. The absence of network latency and parsing overhead means users rightfully expect faster experiences.

2. Components Know Best

Letting components self-report interactivity (markInteractive()) proved more accurate than algorithmic detection. Components understand their own loading states, data dependencies, and UI readiness in ways that external observers cannot.

3. Complete Profiles Matter

Waiting to log all metrics together (rather than logging each individually) made analysis significantly easier. It’s much simpler to query for “sessions with TTI > 500ms” than to join three separate metric events.

Looking Forward

This measurement system is now our foundation for mobile performance at Indeed. We’re expanding it beyond ViewJob to SERP, Homepage, and other React Native surfaces. Each integration gives us more data, more insights, and more confidence that we’re maintaining the performance standards Indeed is known for.

But measurement is just the beginning. The real value comes from what we do with the data:

  • Automated alerts when performance degrades
  • Performance budgets enforced in CI/CD
  • A/B testing to validate that optimizations actually improve user experience
  • Correlation analysis between performance and business metrics

We’re no longer flying blind in the mobile world. We have the metrics, the thresholds, and the tooling to ensure that as Indeed becomes app-first, we remain performance-first.

Get Involved

At Indeed we’ve open sourced this repository because we think it will help other organizations better measure their app performance, especially for companies similar to Indeed who are transitioning from a web-first to an app-first approach. To contribute, please see the details in our contribution guidelines: CONTRIBUTING.md.