Building a Large-Scale Machine Learning Pipeline for Job Recommendations

(Editor’s note: This post was originally published on oreilly.com.)

With 200 million unique visitors every month, Indeed relies on a recommendation engine that processes billions of input signals every day — resulting in more external online hires than any other source.

To create this recommendation engine, we started with a minimum viable product (MVP) built with Apache Mahout and evolved to a hybrid, offline + online pipeline. Along the way, our changes affected product metrics, and we addressed challenges with incremental modifications to algorithms, system architecture, and model format. The lessons we learned in designing this system can apply to any high-traffic machine learning application.

From search engine to recommendation

Indeed’s production applications run in many data centers around the world. Clickstream data — and other application events from every data center — are replicated into a central HDFS repository, based in our Austin data center. We compute analytics and build our machine learning models from this repository.

Our job search engine is simple and intuitive, with two inputs: keywords and location. The search results page displays a list of matching jobs, ranked by relevance. The search engine is the primary way our users find jobs on Indeed.

Our decision to go beyond search and add job recommendations as a new mode of interaction was based on several reasons:

  • 25% of all searches on Indeed specify only a location and no search keywords. Many job seekers aren’t sure what keywords to use in their search.
  • When we deliver targeted recommendations, the job seeker’s experience is personalized.
  • Recommendations can help even the most sophisticated user discover additional jobs that their searches would not have uncovered.
  • With recommendations driving 35% of Amazon sales and 75% of Netflix content views, it’s clear they provide added value.

Recommending jobs is significantly different than recommending products or movies. Here are just a few of the things we took into careful consideration as we built our engine:

Rapid Inventory Growth. We aggregate millions of new jobs on Indeed every day. The set of recommendable jobs is changing constantly.

New Users. Millions of new job seekers visit Indeed every day and begin their job search. We want to be able to generate recommendations with very limited user data.

ChurnThe average lifespan of a job on Indeed is around 30 days. Content freshness matters a lot, because the older the job, the more likely it is to have been filled.

Limited Supply. One job posting is usually meant to hire one individual. This is different from books or movies, where as long as there is inventory it can be recommended to many users at the same time. If we over-recommend a job, we could bombard an employer with thousands of applications.

How to approach recommendation algorithms

Recommendations are a matching problem. Given a set of users and a set of items, we want to match users to their preferred items. There are two high-level approaches to this: content-based and behavior-based. They each have pros and cons, and there are also ways to combine these approaches to take advantage of both techniques.

Content-based approaches use data, such as user preferences and features of the items being recommended, to determine the best matches. For recommending jobs, using keywords of the job description to match keywords in a user’s resume is one content-based approach (note that users can upload their resume to the Indeed site). Using keywords in a job to find other similar jobs is another way to implement content-based recommendations.

Behavior-based approaches leverage user behavior to generate recommendations. These approaches are domain-agnostic, meaning the same algorithms that work on music or movies can be applied to the jobs domain. Behavior-based approaches do suffer from a cold start problem. If you have little user activity, it is much harder to generate good quality recommendations.

Mahout collaborative filtering

We started by building a behavior-based recommendation engine because we wanted to leverage our existing job seeker traffic and click activity. Collaborative filtering algorithms are well understood in this space.

Our first attempt at personalized recommendations was based on Apache Mahout’s user-to-user collaborative filtering implementation. We fed clickstream data into a Mahout builder that ran in our Hadoop cluster, and produced a map of users to recommended jobs. We built a new service to provide access to this model at runtime, and multiple client applications accessed this service to recommend jobs.

MVP results and roadblocks

As an MVP, the behavior-based recommendation engine showed us that it is important to start small and iterate. Building this system quickly and getting it in front of users demonstrated that these recommendations were useful to job seekers. However, we ran into several immediate problems using Mahout on our traffic:

  • The builder took around 18 hours on Indeed’s 2013 clickstream, which is about 3X smaller than present day.
  • We could only run the builder once a day, which meant that millions of new users joining Indeed every day wouldn’t see recommendations until the next day.
  • Millions of new jobs aggregated on Indeed were not visible as recommendations until the builder ran again.
  • The model we produced was a large map, which was around 50GB, and took several hours to copy over a WAN from the data center where it was built to our data centers around the globe.
  • Mahout’s implementation exposed a few tweakable parameters, like similarity thresholds. We could tweak the parameters of the algorithm, but we wanted the flexibility to test entirely different algorithms.

Implementing MinHash for recommendations

We addressed the most important problem first: the builder was too slow. We found that user-to-user similarity in Mahout is implemented by comparing every user to every other user in n2 time. For only U.S. traffic in Indeed (50 million UVs), this amounts to 15 * 1015 comparisons, which is intractable. This calculation is also batch in nature. Adding a new user or a new click event requires recalculating all similarities.

We realized that recommendations were an inexact problem. We were looking for ways to find the closest users to a given user, but we didn’t need 100% accuracy. We looked for ways to approximate similarity without having to calculate it exactly.

Principal contributor Dave Griffith came across MinHash from a Google News academic paper. MinHash, or minwise independent permutations, allows approximating Jaccard similarity. Applying this measure to jobs that two users clicked on at Indeed, we see that the more jobs these two users have in common, the higher their Jaccard similarity. Calculating Jaccard similarity for all pairs of users is O(n2), and with MinHash, we can reduce this down to O(n).

The MinHash of a set of items, given a hash function h1, is defined as taking the hash of all items in that set using that hash function, and then taking the minimum of those values. A single hash function is not sufficient to approximate the Jaccard similarity because the variance is too high. We have to use a family of hash functions to reasonably approximate Jaccard distance. With a family of hash functions, MinHash can be used to implement personalized recommendations with tweakable Jaccard similarity thresholds.

Mining Massive Datasets, a recent Coursera course from Stanford professors Leskovec, Rajaraman, and Ullman, explains MinHash in great detail. Chapter 3 (PDF) of their book, “Mining Massive Datasets,” explains the mathematical proof behind MinHash. 

Our implementation of MinHash for recommendations involved the following three phases:

Phase 1: Signature calculation/cluster assignment

Every job seeker is mapped to a set of h clusters, corresponding to a family of hash functions H. The following pseudocode shows this: 

H = {H1, H2, ..H20}
for user in Users
     for hash in H
          minhash[hash] = min{x∈Itemsi| hash(x)}

Where H is a family of h hash functions. At the end of this step, each user is represented by a signature set, also known as cluster consisting of h values.

Phase 2: Cluster expansion

Users that share the same signature are considered similar, and their jobs are cross recommended to each other. We expand each cluster with all the jobs from each user in that cluster.

Phase 3: Recommendation generation

To generate recommendations for a given user, we union all jobs from the h clusters that the user is in. We remove any jobs that this user has already visited to get the final set of recommended jobs.

Recommending jobs to new users

MinHash’s mathematical properties allow us to address recommending jobs to new users and recommending new jobs to all users. We update the MinHash signature for users incrementally as new clicks come in. We also maintain a map in memory of new jobs and their MinHash clusters. By keeping these two pieces of data in memory, we are able to recommend jobs to new users after they click on a few jobs. As soon as any new jobs posted throughout the day receive clicks, they are recommended to users.

After transitioning to MinHash, we had a hybrid recommendation model — consisting of an offline component that builds daily in Hadoop — and an online component implemented in memcache — consisting of the current day’s click activity. Both models are combined to compute the final set of recommendations per user. The recommendations for each user became more dynamic after we made this change, because they would update as users clicked on jobs that interested them.

With these changes, we learned that we could trade off a little bit of accuracy for a lot of performance. We also learned to complement a slower offline model with an online model for fresher results.

Engineering infrastructure improvements

The recommendation model that contained a map from each user to their recommendations was a large monolithic file. Because jobs are local to each country, we first attempted to shard our data into zones based on approximate geographic boundaries. Instead of running one builder for the entire world, we ran one builder per zone. Each zone consisted of multiple countries. As an example, the East Asian zone contained recommendations for China, Japan, Korea, Hong Kong, Taiwan, and India.

Even after sharding, some of our zones produced data files that were too big and took hours to copy from our Austin Hadoop cluster over a WAN to a remote data center in Europe. To address this, we decided to ship recommendation data incrementally rather than once per day. We reused sequential write ahead logs and log structured merge trees to implement this. This was already validated in other large production applications at Indeed, like our document service.

Instead of producing a large model file, we modified our builder to write small segments of recommendation data. Each segment file is written using sequential I/O and optimized for fast replication. These segments get reassembled into a log structured merge tree in recommendation services running in remote data centers.

This infrastructure change caused users to see their new recommendations hours faster in remote data centers. From our A/B testing of this change, we saw a 30% increase in clicks due to the fact that users received newer recommendations faster.

This improvement demonstrated that engineering infrastructure improvements can make as much of an impact on metrics as algorithm improvements.

A/B testing velocity

Building out the pipeline to compute and update recommendations was only the beginning. To improve the coverage and quality of recommendations, we needed to increase our A/B testing velocity.

We were making many decisions in the builder to tune the final set of recommendations. These decisions included similarity thresholds, the number of jobs to include in an individual’s recommendations, and different ways to filter out poor quality recommendations. We wanted to tweak and optimize every aspect of computing recommendations, but to do so would require building and shipping a new model per algorithm tweak. Testing multiple improvement ideas meant many times more disk and memory usage on the servers that handled requests from users.

We began to improve our A/B testing velocity by shipping the individual components of the recommendation calculation, rather than the final results. We changed the recommendation service to perform the final calculation by combining the pieces, instead of simply reading the model and returning results. Critical subcomponents of recommendations are cluster assignments per user, the mapping from each cluster to jobs in that cluster, and a blacklist for each user that contained jobs that should not be recommended for them. We modified our builder to produce these components and modified our service to put them together at request time to return the final list of recommendations.

By implementing this architectural change, we only shipped subcomponents that changed per A/B test. For example, if the test only tweaked what jobs got removed from a user’s recommendations, we would only ship the blacklist for the test group.

This change improved A/B testing velocity by orders of magnitude. We were able to test and validate several ideas that improved the quality and coverage of recommendations in a short period of time. Previously, we averaged testing one improvement in the model every quarter because of the overhead in setting up our experiments.

Our experience shows that A/B testing velocity should be considered when designing large machine learning systems. The faster you can get your ideas in front of users, the faster you can iterate on them.

This post summarizes a series of algorithmic and architectural changes we made as we built our recommendation engine. We build software iteratively at Indeed — we start with an MVP, learn from it, and improve it over time. As a result of these changes, job recommendations grew from a small MVP to contributing 14% of all clicks on Indeed, which is up from 3% in early 2014.

recommendations architecture

Architecture diagram for Indeed’s recommendation engine

Conclusion

Moving forward, we continue to refine our recommendation engine. We are prototyping a model using Apache Spark. We are building an ensemble of models, and we are refining our optimization criteria to combat popularity bias.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

A Bounty of Security

“Do what’s best for the job seeker.” This has been Indeed’s guiding principle since the beginning. One way we put the job seeker first is by keeping their information safe and secure. We always consider the security of our systems as we develop the services that millions of people use every day. But someone will outsmart us. Hackers are always trying out new ways of bypassing security and gaining access to systems and information. Our challenge: to bring these security experts over to our side and benefit from their findings.

lock_and_chain

Image by stockarch – stockarch.com (Licensed by Creative Commons)

Our answer to this challenge is, well, money. Actually, money and fame. Indeed offers security testers a legitimate route to reporting their findings, and we award them for their time with cold, hard cash and recognition. Through our bug bounty program we have awarded over 300 submissions in the past year and a half, with payouts as high as $5,000 for the most severe bugs. Our most successful participants (looking at you, Angrylogic, Avlidienbrunn, and Mongo) have earned cash while building their reputations as highly regarded testers for Indeed.

 

Reward amount per submissions in the last 18 months
Criticality Reward Amount Relative Submission Counts
CRITICAL Up to $5000 0.7%
HIGH Up to $1800 4%
MEDIUM Up to $600 31%
LOW Up to $100 64%

Why create this program?

Prior to our bug bounty program, we occasionally received messages that sounded like blackmail. An anonymous person would contact us, insisting that we pay them, or they would publicly release the details of an unspecified, but totally serious, security bug. These individuals expected payment up front, with no guarantee that they even had a bug to expose. While we’re happy to compensate researchers for helping us improve our services, we didn’t want to encourage this coercive behavior. It felt wrong.

To solve the mutual distrust, we started using Bugcrowd.com as an impartial arbiter. On Bugcrowd, security researchers are more willing to provide evidence up front, giving us the chance to fairly assess the bug’s severity. Indeed can now provide rewards without abuse, and everyone lives happily ever after…

Theory vs practice

“Happily ever after…” is more difficult in practice. Since the program started, we have received almost 2,500 submissions, each issue potentially taking hours to validate. Every time we advertise our bounty program or raise our payouts, we see a large spike in submissions. To an outsider, it might look like we’re dragging our feet, but in reality, it’s all hands on deck to reply to these submissions. This blog post alone will generate several more hours worth of bug validation thanks to the increased visibility of the program.

We initially struggled to quickly respond to testers’ submissions, creating a backlog. This backlog grew because we received more submissions than we had time to process. We ended up doubling down on our efforts over a painful couple of weeks and then implementing a new standard for response time. Since then, response times have been under control.

 Bugcrowd_TicketDaysSum of open Ticket Days over Time

Note: The value of Ticket Days is the sum of days that every ticket is open on a particular date. For example, on a given date, one ticket open for 3 days + one ticket open for 2 days = 5 Ticket Days.

Communicating clearly with the researchers is also important, so that they don’t think we are trying to take advantage of them. We keep in mind that they don’t have as much visibility into the process as we do. One common issue is handling duplicates. Paying for an issue the first time you hear about it makes sense, but how should we handle a duplicate submission from another researcher? The second submission doesn’t add any additional value, but from the tester’s point of view, they found a real bug. Clearly communicating why you are marking a ticket a duplicate and quickly fixing identified issues helps minimize this concern. In some cases, we decide to pay for the duplicate if it has great reproduction steps and a proof of concept.

Finally, we’re working on balancing the time we spend finding new bugs and fixing known bugs. Building and managing a popular bounty program leads to lots of good submissions, but that all falls to pieces if we don’t also spend the time fixing the bugs. At Indeed, the benefits of investing time improving our bug bounty program can’t be overstated.

Our successes so far

It seems we’re doing something right. Bugcrowd recently asked their security researchers which company’s program was their favorite, and you’ll never guess who won!

…Tesla won (we blame those fabulous Teslas). But we took runner up, with 8% of all votes, racing against over 35 other programs. Many of the specific responses for our program referenced our fair payout practices, great communication, and permissive scope. While we know that we can still rev up the experience, we are happy for the validation that we are headed down the right road.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Forget Methodology — Focus on What Matters

At Indeed, we tackle interesting and challenging problems, at scale. We move from idea to implementation as fast as possible. We ship incremental changes, pushing code to production frequently. Our small releases reduce risk and increase quality.

But before we work on any solution, we ask: how will we measure success? This question keeps our solutions focused on what matters — measurable results.

Our approach to software development might be called “measure-learn-evolve.” Our teams employ techniques from various software development methodologies, but no single published methodology rules. We collaborate, we iterate, and we measure. Sometimes we succeed, sometimes we fail, but we are always learning.

MeasureLearnEvolve_blog

We don’t view process implementation and improvement as success. Process is a means to an end. Process doesn’t deliver a successful product. (People do.) Process doesn’t provide talent and passion. (People do.) But the right process and tools can help people do those things and provide predictable mechanisms for:

  • planning what we need to do and setting relative priorities
  • communicating what we are doing or might do
  • remembering what we’ve done
  • managing our risk

We use Atlassian’s JIRA to implement these mechanisms. In JIRA, we propose ideas, define requirements, and plan projects. We document dependencies, track work, and manage releases. We describe experiments and record results. Customizing JIRA to our needs has helped us collaborate on success metrics and maintain our engineering velocity.

It wasn’t always this way. We started simple. We were a startup and we focused on getting stuff done, quickly.

As we grew, we didn’t want to lose this focus on getting things done quickly and with quality. But our ad hoc process was neither repeatable nor predictable. Inconsistencies abounded and we were not creating a memory for the future. So we began to model our development process in JIRA.

Customizing JIRA

We have our own JIRA issue types, workflows, fields, and roles. These customizations allow us to plan, communicate, and deliver our software in the way we want.

Linking custom project types

We use two types of JIRA projects for product development: a “planning project” that corresponds to the product, and an “engineering project” that corresponds to a deployable application or service.

Our planning projects contain Initiative and Experiment issues. We use the Initiative type to capture goals, plans, and success metrics for a product change. We plan product initiatives each quarter, and we iterate on them throughout the quarter. As part of that iteration, we use the Experiment type to describe specific ideas we want to test to optimize our products.

The engineering projects include issues that detail the implementation necessary for the initiatives and experiments. Each deployable application or service has a corresponding engineering project. Issue links connect related issues to one another. JIRA provides multiple types of bi-directional links. The following table gives examples of how we use them.

incorporates / incorporated by Product initiatives incorporate engineering project issues.
depends upon / depended on by Issues can depend upon on other issues. This can model feature development dependencies or deploy order dependencies, for example.
references / referenced by An issue for a functional regression references the project issue that introduced the bug.

Issue types and workflows

We use JIRA’s standard issue types: Bug, Improvement, New Feature. The workflow for these standard issue types is a slight modification of a typical JIRA workflow:

  1. We create an issue and assign it to a project lead. The issue transitions to a Pending Triage state.
  2. If we can target work to a near-term release, we triage the issue, setting its Fix Version and assigning it to a developer. The issue then moves to Pending Acceptance. We move other issues to On Backlog.
  3. The developer accepts the issue, moving it to Accepted when they make a plan to start work.
  4. When the code is complete, the developer resolves the issue, moving it to Pending Review.
  5. After code review, we transition the issue to Pending Merge.
  6. When we’re ready to create a release candidate, we merge changes into the release branch and deploy to the QA environment, transitioning the issue to Pending Verification.
  7. The QA analyst verifies the work and either reopens the issue or verifies it, transitioning it to Pending Closure.
  8. After we verify all issues in a targeted release, we can release the build to production and move all the issues to Closed.

We also use custom issue types to model our process. In a previous post, we described the ProTest issue type (short for Proctor Test). We use this custom issue type to request new Proctor A/B tests or to change test allocations.

We have another custom issue type and associated workflow for localization. As we continue to grow internationally, we need a localization process that doesn’t slow us down. Coordinating with many translators can be a challenge, so we model our translation process in JIRA. Our Explosion issue type incorporates an issue for each target translation language. The workflow follows:

  1. We create an issue with English strings that require translation.
  2. We triage the issue and submit it for review.
  3. When the strings are ready to be translated, an automated step creates one Translation issue for each target language and links them all to the Explosion issue.
  4. Each “exploded” issue follows its own workflow: Accept, Resolve, Verify and Close.
  5. When all Translation issues are closed, we verify and close the Explosion issue.

The Explosion and Translation custom issue types and workflows help streamline a process with many participants. Because we triage by language and feature, translation issues do not block the release of an entire feature. Using JIRA also allows us to integrate machine translation and outside translation services.

Team triage

Many of our development teams use dashboards and agile boards in JIRA for easy access to issues associated with a product. During routine triage meetings, product development teams use these tools to prioritize and distribute development work.

Closing the memory loop

Each code commit in Git is traceable to a corresponding issue in JIRA. Further, if the referenced JIRA links to the initiative, the trail leads all the way to the initiative. This means that an engineer can review any code commit and follow the trail in JIRA to understand all related implementation details, requirements, and business motivation.

Production deploys

Deploying code to production requires clear communication and coordination, and our Deploy issue type helps us track this process. Using JIRA to track deploys results in smooth handoffs and transparency for all stakeholders.

A deploy ticket is associated with each Fix Version and has a unique workflow that facilitates communication for moving artifacts through the build and release process. We use issue links to document all sysadmin tasks necessary for a successful deployment. The deploy ticket has the same fix version as the other issues in the release.

Most teams plan their work weekly but deliver to production as they complete the work. On some regular cadence – semi-weekly, daily, or more often – the release manager creates a release candidate from all open merge requests. We developed an internal webapp that coordinates across Git (branch management), JIRA (code changes and deploys), Crucible (code review), and Jenkins (build). Status changes to the deploy ticket trigger issue reassignments, promoting smooth handoffs.

This approach provides our teams with the information they need to assess and manage risk for their production releases. The QA analyst can better understand potential regressions that a change may cause. The release manager can have a holistic view of what’s changing and quickly react when issues arise. And small releases make bug investigation more straightforward.

Working in the open

JIRA enables effective, efficient collaboration for our software development and deployment process. We use it to clarify requirements, discuss implementation choices, verify changes, and deploy to production.

Across teams and up and down the organization, our use of JIRA provides transparency into the work that is getting done. By working in the open, we can achieve a shared understanding of plans, progress, and challenges for hundreds of active projects and initiatives.

Do what makes sense for you

Methodology and process only help when they provide repeatable and predictable mechanisms for planning, communication, and delivery. JIRA has helped us establish these mechanisms.

Try to avoid taking a methodology “off the shelf” and implementing it. And don’t depend on tools to solve your problems. Instead, think about how your team needs to plan, communicate, and deliver. Then, define the best process and tools that serve your needs. Iterate on your process as needed. And stay focused on what really matters: success.


Adapted from Jack Humphrey’s presentation at Keep Austin Agile 2014.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone