Building Resume Instant Search

[Editor’s Note: This post is a companion piece to a recent @IndeedEng talk.]

Our resume instant search feature allows employers to see search results as they type, helping them efficiently connect with job seekers. Google found that Instant saved users 2-5 seconds per search. To prevent undesired search results for partially-typed queries, we also had to implement an auto-complete feature that predicts a user’s query before they finish typing it. Both of these features must be lightning-fast in order to be effective.

The combination of the instant and auto-complete features allows us to deliver a streamlined search to our users. We know from testing that the faster we deliver results to users, the more time they spend doing what matters — reading resumes and contacting applicants.

Building Instant Search

One of our goals for the architecture of Instant Search was to build a clean, modular JavaScript application. Our use of the Google Closure framework helped us achieve this goal. The Google Closure tools include a dependency manager, a feature-rich JavaScript library, a JavaScript compiler, and templates that compile to both Java and JavaScript. We used these tools and implemented an event-driven architecture, resulting in a modular, easy-to-maintain application.

In event-driven models, program flow is based on events such as a mouse click or a key press. A common example in web development is the onclick handler, which allows developers to capture the click event on a DOM element. The element being clicked has no knowledge of the code that runs as a result of the click. It fires an event to signal it has been clicked, and the browser dispatches that event to all components that have registered to receive it.

Instant Search has four components:

  • AutoComplete – offers search query completions based on what the user has typed.
  • Instant – retrieves and renders resume search results.
  • Preview – retrieves and renders a preview of the resume on mouseover.
  • Result Page – the main dispatcher; the only component aware of the other components.

Each component fires off events as they occur without knowing anything about the other components on the page. For example, AutoComplete will fire an event when the search completion changes, and that event is handled by the Instant component, which retrieves results. The Result Page component listens for all events and dispatches them to the registered components.

Here is example code that shows the Result Page component listening for a set of events from the AutoComplete component:

// Attach events for Query AutoComplete
goog.events.listen(this.queryAutocomplete_,
  indeed.AutoComplete.EventType.QUERY_TYPE,
  this.handleQueryChange_, false, this);
goog.events.listen(this.queryAutocomplete_,
  indeed.AutoComplete.EventType.QUERY_CHANGE,
  this.handleQueryChange_, false, this);
goog.events.listen(this.queryAutocomplete_,
  indeed.AutoComplete.EventType.ENTER,
  this.handleQueryEnter_, false, this);
goog.events.listen(this.queryAutocomplete_,
  indeed.AutoComplete.EventType.TAB,
  this.handleQueryTab_, false, this);
goog.events.listen(this.queryAutocomplete_,
  [indeed.AutoComplete.EventType.SELECT,
    indeed.AutoComplete.EventType.NEW_SUGGESTIONS],
  this.handleQuerySelect_, false, this);
goog.events.listen(this.queryAutocomplete_,
  indeed.AutoComplete.EventType.FOCUSIN,
  this.handleQueryFocusIn_, false, this);
goog.events.listen(this.queryAutocomplete_,
  indeed.AutoComplete.EventType.FOCUSOUT,
  this.handleQueryFocusOut_, false, this);

The function handleQueryChange_ passes the text of the query to the Instant component:

/**
 * Handles changes to the query by notifying Instant
 * @param {indeed.events.TextEvent} e Change event.
 * @private
 */
indeed.Search.prototype.handleQueryChange_ = function(e) {
  if (this.instant_) {
    this.instant_.newQuery(e.text);
  }
};

All other components on the page act in a similar manner — Instant fires an event when it has new results, and Preview fires an event when it displays a resume. The architecture of the application ends up looking like this:

figure 1: application architecture flow chart. long description below. Figure 1: Result Page is composed of high-level components, including AutoComplete and Instant, which in turn are composed of lower-level handlers and renderers.

Long Description of Figure 1
The flow chart follows the following structure:
  • Result Page: Controls the whole page
    • AutoComplete: A single autocompleter
      • InputHandler: Handles the autocomplete text box
      • Renderer: Handles the AutoComplete rendering
      • RPCHandler
    • Instant: Handles all instant functionality
      • InputHandler: Handles the instant text box
      • FormHandler: Handles form actions
      • Renderer: Handles instant rendering
      • RPCHandler

figure 2: flow chart describing how AutoComplete handles input events. long description below.

Figure 2: How AutoComplete handles events from its sub-components

Long Description of Figure 2

The AutoComplete process is as follows, using “jav” as an example input:

  1. User types “jav” and the InputHandler sends the input to AutoComplete
  2. AutoComplete asks RPCHandler for suggestions for “jav”
  3. RPCHandler sends AutoComplete new suggestions for “jav”
  4. AutoComplete renders suggestions for “jav” using Renderer
  5. The user clicks on the option “java developer,” displayed by AutoComplete
  6. The InputHandler sets input text to “java developer”

This event-driven approach allows us to add new components to the search page without having to modify the components that were already on the page. Since the components have no knowledge of the Result Page, we have been able to package them into a library for use in other projects.

Navigation and Search Engines

If we just used JavaScript-triggered requests to populate search results and did not change the URL, our users would never be able to save or share specific search result pages, and search engines like Google could not crawl these pages for indexing.

In our initial implementation of Instant Search, we updated the URL fragment on every new search. If a user came to http://www.indeed.com/resumes and did a search for “java” in Austin, we updated the URL to http://www.indeed.com/resumes#!q=java&l=Austin. For comparison, before we implemented Instant Search, our search URLs used a more traditional query string approach (still supported): http://www.indeed.com/resumes?q=java&l=Austin.

We used #! instead of simply # in order to work with Google’s Making AJAX Applications Crawlable guidelines, which specify a mapping from #! URL fragments to query string parameters so that crawlers can make requests with information stored in the URL fragment.

A drawback of this approach is initial page load performance. Since the data in URL fragments is not sent to the server, the browser has to do a full round-trip (request/response) to the server before processing the fragment, then an additional round-trip to get the search results. In contrast, using the traditional query string approach allows results to come back in a single round-trip.

To improve performance for users coming directly to a search results page from an external site, we used the HTML5 history API supported by most modern browsers. This API enables modifying the whole URL via JavaScript rather than just the URL fragment. We now use this API to update the browser URL to include the search terms in the path (e.g. http://www.indeed.com/resumes/java/in-Austin). These URLs can then be shared and published widely, and users clicking on the resulting links receive search results in a single round-trip.

Templates and Rendering

Dynamically updating search results in the browser with JavaScript provides a superior user experience, but we also need to be able to render identical results on the server. We could have just developed two versions of our templates, one for server-side and one for client-side, but this approach leads to maintainability challenges. It would be annoying and error-prone to keep two versions, in two different template languages, in perfect sync. We could have generated HTML on the server-side only, and send the HTML back in response to the Instant request, but this approach results in a larger response and adds server load. Instead, we opted to use the same template on both the client and server. Technologies like Node.js, mustache, and Google Closure Templates are making this technique viable and more common.

We chose Google Closure Templates to enable a single template that works both on the server (Java-based) and the client (JavaScript-based). Here is an example template:

/**
 * Renders a single block of refinements on the result page
 * @param title Title of the refinement
 * @param refinements List of refinements
 *
 * @private
 */
{template .refinementBlock}
  {if length($refinements) > 0}
    <p class="refine_title">{$title}</p>
    <ul class="refine_by">
    {foreach $refinement in $refinements}
      <li>
        <a class="instl" href="?{$refinement['addParams']}">
          {$refinement['text']}
        </a>
        <span class="refine_count">{$refinement['count']}</span>
      </li>
    {/foreach}
    </ul>
  {/if}
{/template}

The Closure template compiler generates JavaScript that integrates well into the client-side rendering of our search result pages. We also use the templates in our server-side Java code, delivering results quickly for initial requests and enabling subsequent searches for users and crawlers that are not using JavaScript.

No Compromises

Our event-driven architecture has allowed us to build a maintainable library of components that can be used in multiple projects. We have used URL fragments and the HTML5 history API to keep navigation, link sharing, and search engine discovery simple. Closure Templates have helped us avoid unnecessary complexity and duplication. We have built a great resume search experience for our users without compromising on maintainability, reuse, and performance.

At Indeed, we focus on making the job seeker and employer experiences as fast and simple as possible. If these kind of challenges sound interesting to you, check out our open positions.

Indeed Coding Duel – UIUC vs UT Austin

On February 19, we hosted the annual Indeed Coding duel between UT Austin and UIUC. Sixty-six contestants, mostly Computer Science and Computer Engineering undergrads, battled it out online for four hours.  As in previous contests, the challenges for the contest were developed by Indeed software engineers, including recent UT and UIUC graduates as well as previous contestants from the renowned global programming contest ACM-ICPC World.

We presented the contestants with seven logic and mathematical problems, and challenged them to solve them with code. The winning programmers solved the most problems in the least amount of time. Three of the top five contestants were from UT, while UIUC was the overall winner with five of the top nine entrants. On top of the bragging rights, the highest-scoring contestants from each school took home cool prizes including Apple iPads, Amazon Kindle Fires, and gift cards.

Room filled with coders.

 

From 1 to 1 Billion: Evolution of a Document Serving System

[Editor’s note: This post is part 1 of a two-part companion piece to our first @IndeedEng talk. Slides and video are available.]

Indeed.com launched in November of 2004 in the US. Today, Indeed is in more than 50 countries and has 100 million unique visitors performing over 3 billion job searches each month. We have more traffic than any other job site in the world.

If we had tried to support this scale on day one, we might never have launched, since building a job search system to support our eventual level of traffic would have required time and knowledge we didn’t have. Likewise, if we hadn’t evolved our systems as growth demanded more of us, we wouldn’t be where we are today. The journey to 100 million unique visitors took us 8 years.

We believe in building products that are fast, simple, comprehensive, and relevant. We keep fast and simple in mind when building solutions, too — we put new features in front of our users and iterate quickly to continuously improve our products. This post provides a brief history of the iterations we made as we scaled our system to billions of searches per month.

2004: The simplest solution that works

To get the first version of job search implemented back in 2004, we turned to Lucene, a Java-based open source project that provides indexing and search. Once you’ve built an index of your documents — jobs, in our case — Lucene can efficiently and quickly turn a search query into a set of matching document IDs. Our search webapp used the Lucene index to get the document IDs that matched a user’s search, but we also needed the corresponding job data to display. Turning those IDs into displayable job data is what we call “document serving.”

The watch words for document serving are comprehensive and fast. We are constantly aggregating job postings from all over the world, and we need to be able to serve data for all jobs for all time. We need to make new jobs and job updates available on the site as quickly as possible, and we need job document serving to be extremely fast.

We show 10 jobs on a search results page. For each of these results, we display attributes like title, company, location, job age, and a snippet based on the job description.

title, company, location, snippet, age in a sample job search result

Figure 1: sample search result highlighting document attributes

For our first iteration, we kept the job document data in the Lucene index using stored fields. A commonly-used feature of Lucene, stored fields allow you to persist the document data with the index while it is being built. The data flow of this first iteration is shown in Figure 2.

The index builder ran every few minutes, read jobs from the database, and built the Lucene index. This index was then copied to the web servers with rsync and used by the search webapp to both search and serve the documents.

serving from stored fields

Figure 2: serving documents from stored fields

In our first month we aggregated 2 million jobs, and on our busiest day we served 160,000 job documents, with peaks of 7 documents/second. Our system worked; search was fast, and job results were fresh.

2005: Size matters

One year later, in November 2005, we were aggregating almost 4 million new jobs per month. We were serving up to 5 million documents per day with peaks of 100 documents/second. We had nearly doubled the number of searchable jobs, causing the index to grow to between 6GB and 7GB. The stored fields accounted for most of that file size. Lucene is extremely fast when the entire index can fit in the disk cache, but our production servers only had 4GB of RAM. To execute a search, the server had to page in data from disk, resulting in slower job search performance.

The index determines which jobs we show on a search result page, so we don’t want to keep jobs in it that are old, or have been removed from the original posting source. Even if we no longer showed a particular job on a search result page, a user might come back to a job details page at any time from a bookmark or email. Therefore, even when a job is not in our index, we still need to be able to serve it. Since it would have not been feasible to keep all jobs in our Lucene index forever, we had to use another approach.

The evolution begins

We needed to revise our architecture in order to optimize search performance, which is a key part of the job seeker experience. We decided to serve job data directly from a MySQL database that was already the primary storage location of job data. MySQL provides options for aggressive caching, and the main job data table already had a primary key index on the job ID, making lookup fast. The database was hosted on a separate, more powerful machine. Querying it directly, instead of requiring all that data to be available on the machine hosting the search webapp, would allow us to reduce resource utilization on those web servers.

Retrieving jobs directly from the database allowed us to quit using stored fields, drastically reducing the size of the Lucene index. We still indexed the same jobs, and the search webapp still used that index for searching. The difference was that now Lucene would only return the matching job IDs, and the search webapp would retrieve the data for those jobs from the database (see Figure 3). This change was a success. The index was smaller, document lookups were faster, and job search performance improved.

serving from database

Figure 3: serving documents from MySQL database

2006: Too much contention

We used the direct database access approach for over a year. In November of 2006, we were aggregating about 5.2 million new jobs each month. We were serving up to 21 million documents per day, with peaks of 500 documents/second, a 5x increase from the previous year. Though our average job search time measurements were still quite fast, we noticed slower performance at certain times of day.

The cause of this slowdown was write contention on the main jobs table, which used the MyISAM storage engine. Writes from our aggregation processes were locking out reads from search, because MyISAM uses table-level locking. Conversely, reads from the search side were blocking those writes. Executing user searches and getting new job data into our database were getting in each other’s way.

One option was to switch to MySQL’s InnoDB storage engine, which does row-level locking, but that migration required some tricky changes in terms of increased hardware cost and far-reaching code changes. We eventually did migrate, but we needed a solution we could deploy sooner.

Another solution we considered was to replicate the primary database and serve job documents from a replica. This approach works for systems with a read-heavy workload. We had too many writes happening for that to help us. Since MySQL replication is single-threaded, it would fall behind the master as the large number of reads from the search webapp locked out the replication stream writes from that single writer thread. If our search webapp was reading from a replica and replication fell behind, the index might return job IDs to the webapp that corresponded to jobs not yet in the replica.

When the database can’t keep up, cache it!

To address these problems, we chose a classic approach: application-level caching. We used Memcached, an open source, high performance, in-memory object caching system.

At this same time, we were evolving other aspects of our search webapp. We started transitioning from a single webapp that did everything to a service-oriented architecture. One of the first services we carved out of the search webapp was docservice, which handled the work of document serving. The search webapp made a request to the docservice for job document data. Multiple search webapp instances could be supported by a single docservice, as shown in Figure 4. For the first iteration, the search webapp communicated with the docservice over HTTP. We later changed the implementation of that communication to use our own efficient service infrastructure, called boxcar.

docservice using memcache

Figure 4: serving documents from memcache

To serve a document request, the docservice first looked in the memcached and only queried the database if the job was not cached. A memcached lookup took between 1 and 5 ms, and a database read could take 10ms or longer, depending on lock contention. To maximize the cache hit rate and avoid database access, each docservice had a background thread that would watch for new job data and load those jobs into the cache.

To avoid a new docservice with an empty memcache sending too many requests to the database, we “primed” the memcache. Before putting a new memcached instance into production, we would load data for as many recent jobs as possible into the cache. This technique allowed us to ensure that most job documents would be served out of the cache, even for a brand new docservice instance.

The introduction of memcached made job document serving reliably fast throughout the day. It resulted in far less database load and contention with aggregation processes, and it enabled continued rapid growth.

2009: Worldwide

This approach to document serving served us well for more than 2 years, through our early international rollout. By November 2008, Indeed was in 6 countries, and we were aggregating 7.1 million jobs per month. We were serving up to 150 million documents each day with peaks of 3,000 documents/second.

To support this scale, we added more production servers and brought new docservice instances online. We could not aggressively prime the memcache using the database, since that would lock out writes on the MySQL master for minutes at a time. Instead, we had to limit the rate at which the priming process queried the database. We projected that it would soon take 36 hours of rate-limited priming to warm up a new memcache daemon before it could serve live traffic.

At the same time, we faced a limitation with our data pipeline. We were extracting more data from jobs after our initial aggregation, and these attributes were stored in other tables and even other databases. Some example attributes were job language, job type (full-time, part-time, etc) and salary. The only options for making this information available for job display in our current architecture were table joins or multiple database queries per search. Both were unacceptable because they would slow down job document serving and increase database load. We could have also gone back to using stored fields but with our index sizes steadily increasing, that was a non-starter.

We also needed to host our site in data centers around the world to deliver the best experience for users outside the US. Even with memcached, we still needed the database sometimes. Having data centers around the world depend on a single database would add an unacceptable amount of latency, and we had already learned that replication of the database wasn’t workable for our use case.

It was time to implement a single data store that would contain all job attributes, replicate easily to remote data centers, and provide fast access.

A new data store

Our new solution, called docstore, was a denormalized, read-only, file-based store for job data. A docstore builder process wrote a job’s data, serialized using Thrift, and stored it in a compressed segment file of up to 100 jobs (in job ID order). A simple naming convention made it easy to look up a job given its ID, as shown below.

docstore example

When a request came in for a document, we would locate the segment file by a simple directory lookup, read in the segment file, and retrieve the job data by its offset in the file.

The docservice still used memcache as the preferred source of data, but it would fall back to the docstore when a job was not in the cache, and only make a database request if a job was absent in both. The docstore never deleted data, so a job would only be absent if the docstore building or replication had fallen behind the latest production index.

We later updated our search service interface to exclude results that were not yet in the docstore, so we could remove that database dependency entirely. This allowed us to launch in remote data centers, where it would have been too expensive to query a distant database, even infrequently. It worked — serving jobs and priming the cache to bring a new docservice online were fast worldwide.

With this new approach, the docstore builder became the only direct consumer of the job databases, dramatically reducing database contention. We again used rsync to replicate the resulting docstore to remote docservice servers (see Figure 5), and the docstore was the source of all job data for serving search results.

Launching the docstore enabled us to expand our international presence in 2009 to 23 countries. By November 2009, we were aggregating 22.5 million jobs per month worldwide. We were serving up to 312 million documents each day with peaks of 6,000 documents/second.

docstore serving

Figure 5: serving from docstore

2011: Great performance worldwide, until…

The docstore solution worked well for us into 2011. Breaking the runtime database dependency in document serving enabled rolling out to worldwide data centers while maintaining our high standards for performance and availability. By the end of 2011, we were aggregating 41 million new jobs per month and serving up to 850 million job documents each day, at a peak rate of 14,000 documents/second.

growth of indeed traffic through 2011

Basing our docstore directory structure on job ID made it easy to find jobs. However, it also meant that when our aggregation system discovered an update to a job, the docstore builder needed to change the contents of an existing segment file to stay up to date. We know from our data that the average number of updates per job is 1.5 — close to half of the jobs in our index will be updated at least once, and many are updated more than once. For each update, the docstore had to first find the segment file containing the job, decompress and load the data for all 100 jobs in the segment, update the job, then recompress and write the whole segment file back to disk.

The job update stream processed by the docstore builder was based on time and not jobID. Therefore, the writes to the docstore did not benefit from the jobID locality we had in our directory scheme. This lead to the first bottleneck we encountered with the docstore: write volume. There were so many updates happening essentially randomly across the disk that a single drive had trouble keeping up. This also slowed down replication for the same reasons. Updating the remote docstores required the exact same pattern of random updates that were slowing the primary docstore builder.

Another side effect of updates being written in place is that a single run of the docstore builder could affect disparate segment files across the docstore. This made it very hard to detect corruption of segment data due to unexpected issues. Though corruptions were rare, they required a lot of manual work to determine what jobs were being updated at the time of the corruption, locate all of the segment files affected, and restore them. Fundamentally, the docstore was based on having a single copy of every job, which was kept current as the job was updated, and had a strict requirement for where the file lived on disk. But the resulting write patterns of this paradigm were hitting the physical limits of our drives, and there was no clear way to improve that with the existing docstore.

By late 2011, it was time for something new. We built a completely new, highly scalable docstore, and it has been in production for over a year now with no scalability limits in sight. In our next blog post, we will describe the key implementation details of docstore v2.