From 1 to 1 Billion: Evolution of a Document Serving System

[Editor’s note: This post is part 1 of a two-part companion piece to our first @IndeedEng talk. Slides and video are available.]

Indeed.com launched in November of 2004 in the US. Today, Indeed is in more than 50 countries and has 100 million unique visitors performing over 3 billion job searches each month. We have more traffic than any other job site in the world.

If we had tried to support this scale on day one, we might never have launched, since building a job search system to support our eventual level of traffic would have required time and knowledge we didn’t have. Likewise, if we hadn’t evolved our systems as growth demanded more of us, we wouldn’t be where we are today. The journey to 100 million unique visitors took us 8 years.

We believe in building products that are fast, simple, comprehensive, and relevant. We keep fast and simple in mind when building solutions, too — we put new features in front of our users and iterate quickly to continuously improve our products. This post provides a brief history of the iterations we made as we scaled our system to billions of searches per month.

2004: The simplest solution that works

To get the first version of job search implemented back in 2004, we turned to Lucene, a Java-based open source project that provides indexing and search. Once you’ve built an index of your documents — jobs, in our case — Lucene can efficiently and quickly turn a search query into a set of matching document IDs. Our search webapp used the Lucene index to get the document IDs that matched a user’s search, but we also needed the corresponding job data to display. Turning those IDs into displayable job data is what we call “document serving.”

The watch words for document serving are comprehensive and fast. We are constantly aggregating job postings from all over the world, and we need to be able to serve data for all jobs for all time. We need to make new jobs and job updates available on the site as quickly as possible, and we need job document serving to be extremely fast.

We show 10 jobs on a search results page. For each of these results, we display attributes like title, company, location, job age, and a snippet based on the job description.

Figure 1: sample search result highlighting document attributes

For our first iteration, we kept the job document data in the Lucene index using stored fields. A commonly-used feature of Lucene, stored fields allow you to persist the document data with the index while it is being built. The data flow of this first iteration is shown in Figure 2.

The index builder ran every few minutes, read jobs from the database, and built the Lucene index. This index was then copied to the web servers with rsync and used by the search webapp to both search and serve the documents.

Figure 2: serving documents from stored fields

In our first month we aggregated 2 million jobs, and on our busiest day we served 160,000 job documents, with peaks of 7 documents/second. Our system worked; search was fast, and job results were fresh.

2005: Size matters

One year later, in November 2005, we were aggregating almost 4 million new jobs per month. We were serving up to 5 million documents per day with peaks of 100 documents/second. We had nearly doubled the number of searchable jobs, causing the index to grow to between 6GB and 7GB. The stored fields accounted for most of that file size. Lucene is extremely fast when the entire index can fit in the disk cache, but our production servers only had 4GB of RAM. To execute a search, the server had to page in data from disk, resulting in slower job search performance.

The index determines which jobs we show on a search result page, so we don’t want to keep jobs in it that are old, or have been removed from the original posting source. Even if we no longer showed a particular job on a search result page, a user might come back to a job details page at any time from a bookmark or email. Therefore, even when a job is not in our index, we still need to be able to serve it. Since it would have not been feasible to keep all jobs in our Lucene index forever, we had to use another approach.

The evolution begins

We needed to revise our architecture in order to optimize search performance, which is a key part of the job seeker experience. We decided to serve job data directly from a MySQL database that was already the primary storage location of job data. MySQL provides options for aggressive caching, and the main job data table already had a primary key index on the job ID, making lookup fast. The database was hosted on a separate, more powerful machine. Querying it directly, instead of requiring all that data to be available on the machine hosting the search webapp, would allow us to reduce resource utilization on those web servers.

Retrieving jobs directly from the database allowed us to quit using stored fields, drastically reducing the size of the Lucene index. We still indexed the same jobs, and the search webapp still used that index for searching. The difference was that now Lucene would only return the matching job IDs, and the search webapp would retrieve the data for those jobs from the database (see Figure 3). This change was a success. The index was smaller, document lookups were faster, and job search performance improved.

Figure 3: serving documents from MySQL database

2006: Too much contention

We used the direct database access approach for over a year. In November of 2006, we were aggregating about 5.2 million new jobs each month. We were serving up to 21 million documents per day, with peaks of 500 documents/second, a 5x increase from the previous year. Though our average job search time measurements were still quite fast, we noticed slower performance at certain times of day.

The cause of this slowdown was write contention on the main jobs table, which used the MyISAM storage engine. Writes from our aggregation processes were locking out reads from search, because MyISAM uses table-level locking. Conversely, reads from the search side were blocking those writes. Executing user searches and getting new job data into our database were getting in each other’s way.

One option was to switch to MySQL’s InnoDB storage engine, which does row-level locking, but that migration required some tricky changes in terms of increased hardware cost and far-reaching code changes. We eventually did migrate, but we needed a solution we could deploy sooner.

Another solution we considered was to replicate the primary database and serve job documents from a replica. This approach works for systems with a read-heavy workload. We had too many writes happening for that to help us. Since MySQL replication is single-threaded, it would fall behind the master as the large number of reads from the search webapp locked out the replication stream writes from that single writer thread. If our search webapp was reading from a replica and replication fell behind, the index might return job IDs to the webapp that corresponded to jobs not yet in the replica.

When the database can’t keep up, cache it!

To address these problems, we chose a classic approach: application-level caching. We used Memcached, an open source, high performance, in-memory object caching system.

At this same time, we were evolving other aspects of our search webapp. We started transitioning from a single webapp that did everything to a service-oriented architecture. One of the first services we carved out of the search webapp was docservice, which handled the work of document serving. The search webapp made a request to the docservice for job document data. Multiple search webapp instances could be supported by a single docservice, as shown in Figure 4. For the first iteration, the search webapp communicated with the docservice over HTTP. We later changed the implementation of that communication to use our own efficient service infrastructure, called boxcar.

Figure 4: serving documents from memcache

To serve a document request, the docservice first looked in the memcached and only queried the database if the job was not cached. A memcached lookup took between 1 and 5 ms, and a database read could take 10ms or longer, depending on lock contention. To maximize the cache hit rate and avoid database access, each docservice had a background thread that would watch for new job data and load those jobs into the cache.

To avoid a new docservice with an empty memcache sending too many requests to the database, we “primed” the memcache. Before putting a new memcached instance into production, we would load data for as many recent jobs as possible into the cache. This technique allowed us to ensure that most job documents would be served out of the cache, even for a brand new docservice instance.

The introduction of memcached made job document serving reliably fast throughout the day. It resulted in far less database load and contention with aggregation processes, and it enabled continued rapid growth.

2009: Worldwide

This approach to document serving served us well for more than 2 years, through our early international rollout. By November 2008, Indeed was in 6 countries, and we were aggregating 7.1 million jobs per month. We were serving up to 150 million documents each day with peaks of 3,000 documents/second.

To support this scale, we added more production servers and brought new docservice instances online. We could not aggressively prime the memcache using the database, since that would lock out writes on the MySQL master for minutes at a time. Instead, we had to limit the rate at which the priming process queried the database. We projected that it would soon take 36 hours of rate-limited priming to warm up a new memcache daemon before it could serve live traffic.

At the same time, we faced a limitation with our data pipeline. We were extracting more data from jobs after our initial aggregation, and these attributes were stored in other tables and even other databases. Some example attributes were job language, job type (full-time, part-time, etc) and salary. The only options for making this information available for job display in our current architecture were table joins or multiple database queries per search. Both were unacceptable because they would slow down job document serving and increase database load. We could have also gone back to using stored fields but with our index sizes steadily increasing, that was a non-starter.

We also needed to host our site in data centers around the world to deliver the best experience for users outside the US. Even with memcached, we still needed the database sometimes. Having data centers around the world depend on a single database would add an unacceptable amount of latency, and we had already learned that replication of the database wasn’t workable for our use case.

It was time to implement a single data store that would contain all job attributes, replicate easily to remote data centers, and provide fast access.

A new data store

Our new solution, called docstore, was a denormalized, read-only, file-based store for job data. A docstore builder process wrote a job’s data, serialized using Thrift, and stored it in a compressed segment file of up to 100 jobs (in job ID order). A simple naming convention made it easy to look up a job given its ID, as shown below.

When a request came in for a document, we would locate the segment file by a simple directory lookup, read in the segment file, and retrieve the job data by its offset in the file.

The docservice still used memcache as the preferred source of data, but it would fall back to the docstore when a job was not in the cache, and only make a database request if a job was absent in both. The docstore never deleted data, so a job would only be absent if the docstore building or replication had fallen behind the latest production index.

We later updated our search service interface to exclude results that were not yet in the docstore, so we could remove that database dependency entirely. This allowed us to launch in remote data centers, where it would have been too expensive to query a distant database, even infrequently. It worked — serving jobs and priming the cache to bring a new docservice online were fast worldwide.

With this new approach, the docstore builder became the only direct consumer of the job databases, dramatically reducing database contention. We again used rsync to replicate the resulting docstore to remote docservice servers (see Figure 5), and the docstore was the source of all job data for serving search results.

Launching the docstore enabled us to expand our international presence in 2009 to 23 countries. By November 2009, we were aggregating 22.5 million jobs per month worldwide. We were serving up to 312 million documents each day with peaks of 6,000 documents/second.

Figure 5: serving from docstore

2011: Great performance worldwide, until…

The docstore solution worked well for us into 2011. Breaking the runtime database dependency in document serving enabled rolling out to worldwide data centers while maintaining our high standards for performance and availability. By the end of 2011, we were aggregating 41 million new jobs per month and serving up to 850 million job documents each day, at a peak rate of 14,000 documents/second.

Basing our docstore directory structure on job ID made it easy to find jobs. However, it also meant that when our aggregation system discovered an update to a job, the docstore builder needed to change the contents of an existing segment file to stay up to date. We know from our data that the average number of updates per job is 1.5 — close to half of the jobs in our index will be updated at least once, and many are updated more than once. For each update, the docstore had to first find the segment file containing the job, decompress and load the data for all 100 jobs in the segment, update the job, then recompress and write the whole segment file back to disk.

The job update stream processed by the docstore builder was based on time and not jobID. Therefore, the writes to the docstore did not benefit from the jobID locality we had in our directory scheme. This lead to the first bottleneck we encountered with the docstore: write volume. There were so many updates happening essentially randomly across the disk that a single drive had trouble keeping up. This also slowed down replication for the same reasons. Updating the remote docstores required the exact same pattern of random updates that were slowing the primary docstore builder.

Another side effect of updates being written in place is that a single run of the docstore builder could affect disparate segment files across the docstore. This made it very hard to detect corruption of segment data due to unexpected issues. Though corruptions were rare, they required a lot of manual work to determine what jobs were being updated at the time of the corruption, locate all of the segment files affected, and restore them. Fundamentally, the docstore was based on having a single copy of every job, which was kept current as the job was updated, and had a strict requirement for where the file lived on disk. But the resulting write patterns of this paradigm were hitting the physical limits of our drives, and there was no clear way to improve that with the existing docstore.

By late 2011, it was time for something new. We built a completely new, highly scalable docstore, and it has been in production for over a year now with no scalability limits in sight. In our next blog post, we will describe the key implementation details of docstore v2.