Luck, Latitude, or Lemons? How Indeed Locates for Low Latency

Indeed likes being fast. Similar to published studies (here and here), our internal numbers validate the benefits of speed. It makes sense: a snappy site allows job seekers to achieve their goals with less frustration and wasted time.

Application processing time, however, is only part of the story, and in many cases it is not even the most significant delay. The network time – getting the request from the browser and then the data back again – is often the biggest time sink.

How do you minimize network time?

Engineers use all sorts of tricks and libraries to compress content and load things asynchronously. At some point, however, the laws of physics sneak in, and you just need to get your data center and your users communicating faster.

Sometimes, your product runs in a single data center, and the physical proximity of that data center is valuable. In this case, moving is not an option. Perhaps you can do some caching or use a CDN for static resources. For those who are less tied to a physical location, or, like Indeed, run their site out of multiple data centers, a different data center location may be the key. But how do you choose where to go? The straightforward methods are:

Word of Mouth. The price is good and you’ve talked to other customers from the data center. They seem satisfied. The list of Internet carriers the data center provide seems comprehensive. It’s probably a good fit for your users … if you’re lucky.

Location. You have a lot of American users on the East Coast. Getting a data center close to them, say in the New York area, should help make things faster for the East Coast.

Prepare to be disappointed.

These aren’t bad reasons to pick a data center, but the Internet isn’t based on geography – it’s based on peering points, politics, and price. If it’s cheaper for your customer’s ISP to route New York through New Jersey because they have dedicated fiber to a facility they own, they’re probably going to do that, regardless of how physically close your data center is to the person accessing your site. The Internet’s “series of tubes” don’t always connect where you’d think.

What we did

In October of 2012, Indeed faced a similar quandary. We had a few data centers spread out across the U.S., but the West Coast facility was almost full, and the provider warned that they were going to have a hard time with our predicted growth. The Operations team was eager to look at alternate data centers, but we also didn’t want to make things slower for the West Coast users. So we set up test servers in a few data centers. We pinged the test servers from as many places as we could, comparing the results to the ping times of the original data center. This wasn’t a terrible approach, but it also didn’t mimic the job seeker’s experience.

Meanwhile, other departments were thinking about the problem too. A casual hallway conversation with an engineering manager snowballed into the method we use today. It was important to use real user requests to test possible new locations. After all, what better measure would there be to how users perceive a data center than those same users?

After a few rounds of discussion, and some Dev and Ops time, we came up with the Fruits Test, named for the fruit-based hostnames of our test servers. Utilizing this technique, we estimated that the proposed new data center would shave an average of 30 milliseconds off of the response time for most of our West Coast job seekers. We validated this number once we migrated our entire footprint to the new facilities.

How it works

First, we assess a potential data center for eligibility. It doesn’t make sense to run a test against an environment that’s unsuitable because of space or cost. After clearing that hurdle, we set up a lightweight Linux system with a web server. This web server has a single virtual host named after a fruit, such as We set up the virtual host to serve ten static JavaScript files, named 0.js, 1.js, etc., up to 9.js.

Once the server is ready, we set up a test matrix in Proctor, our open-sourced A/B testing framework. We assign a fruit and a percentage to each test bucket. Then, each request to the site is randomly assigned to one of the test buckets based on the percentages. Each fruit corresponds to a data center being tested (whether new or existing). We publish the test matrix to Production, and then the fun begins!


Figure 1: Fruits test requests, responses, and logging


  1. The site instructs the client to perform the fruits test.
  2. The 0.js request and response call dcDNSCallback.
  3. dcDNSCallback sends the latency of the 0.js request to the site.
  4. The [1-9].js request and response call dcPingCallback.
  5. dcPingCallback sends the latency of the [1-9].js request to the site.

Requests in the test bucket receive JavaScript instructing their browser to start a timer and load the 0.js file from their selected fruit site. This file includes a blank comment and an instruction to call the dcDNSCallback function. On, it passes in "l" to indicate the test fruit:



dcDnsCallback then stops the previous timer, and sends a request to, which triggers a log event with the recorded request latency.

The dcDnsCallback function serves two purposes. Since the user’s system may not have the fruit hostname’s IP address in its DNS cache, we can get an idea of how long it takes to do a DNS lookup and a single request round trip. Then, subsequent requests to that fruit host within this session won’t have DNS lookup time as a significant variable, making those timing results more precise.

After the dcDnsCallback invocation, the test selects one of the 9 static JavaScript files at random and repeats the same process: start timer, get the file, run function in the file. These files look a little bit like:


These 9 files (1.js through 9.js) are basically the same as 0.js, but call a dcPingCallback function instead, and contain a comment whose length makes the overall response bulk up to a predefined size. The smallest, 1.js is just 26 bytes, and 9.js comes in at a hefty 50 kilobytes. Having different sized files helps us suss out areas where latency may be low, but available bandwidth is limited enough that getting larger files takes a disproportionately long time. It also can identify areas where bandwidth is plentiful enough that the initial TCP connection setup is the most time-consuming aspect of the transaction.

Once the dcPingCallback function is executed, the timer is stopped and the information about which fruit, which JavaScript file, and how long the operation took is sent to Indeed to be logged. These requests are all placed at the end of the browser’s page rendering and executed asynchronously to minimize the impact of the test on the user’s experience.

On, the logging endpoint receives this data and records it, along with the source IP address and the site the user is on. We then write the information to a specially formatted logstore that Indeed calls the LogRepo – mysterious name, I know.

After collecting the LogRepo logs, we build indexes from them using Imhotep, which allows for easy querying and graphing. Depending on the nature of the test, we usually let the fruits test run for a couple of weeks, collecting hundreds of thousands or even millions of samples from real job seekers that we can use to make a more informed decision. When the test has run its course, we just turn off the Proctor test and shut down the fruit test server. That’s it! No additional infrastructure changes needed.

One of the nice things about this approach is that it is flexible for other types of tests. Sure, we mainly use it for testing new data center locations, but when you boil it down to its essentials (fruit jam!), all the test does is download a set amount of data from a random sampling of users and tell you how long it took. Interpreting the results is up to the test designer.

Rather than testing data centers, you could test two different caching technologies, or the performance difference between different versions of web or app servers, or the geographic distribution of an Anycast/BGP IP (we’ve done that last one before). As long as the sample size is large enough to be statistically diverse, it makes for a valid comparison, and from the perspective of the best people to ask: your users.

That’s nice, but why “Fruits Test”?

When we were discussing unique names to represent potential and current data center locations, we wanted names that were:

  • easily identifiable to Operations
  • a little bit obscure to users, but not too mysterious
  • not meaningful for the business

As a placeholder while designing things, we used fruits since it was fairly easy to come up with different fruits for letters of the alphabet. Over the course of the design the names became endearing and they stuck. Now I relish opening up tickets to enable tests for jujube, quince (my favorite), and elderberry!


Now what?

Now that we have a pile of data, we graph the heck out of it! But more about that in Part 2 of the Fruits Test series.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Status: A Java Library For Robust System Status Health Checks

We are excited to highlight the open source availability of Status, a Java library that can report a system’s status in a readable format. The Status library enables dynamic health checks and monitoring of system dependencies. In this post, we will show how to add health checks to your applications.

Why use system status health checks?

Health checks play an important role at Indeed. We set up and run large-scale services and applications every day. Health checks allow us to see the problematic components at an endpoint, rather than combing through logs.

In production, a health check can let us know when a service is unreachable, a file is missing, or the system cannot talk with the database. Additionally, these health checks provide a controlled way for developers to communicate issues to system administrators. In any of these situations, the application can evaluate its own health check and gracefully degrade behavior, rather than taking the entire system offline.

The Status library will capture stack traces from dependencies and return the results in a single location. This feature makes it easy to resolve issues as they arise in any environment. Typical dependencies include MySQL tables, MongoDB collections, RabbitMQ message queues, and API statuses.

System states

When dependencies fail, they affect the condition of the system. System states include:

  • OUTAGE – the system is unable to process requests;
  • MAJOR – the system can service some requests, but may fail for the majority;
  • MINOR – the system can service the majority of requests, but not all;
  • OK – the system should be able to process all requests.

Get started with Status

Follow these instructions to start using the Status library:

Extend the AbstractDependencyManager. The dependency manager will keep track of all your dependencies.

public class MyDependencyManager extends AbstractDependencyManager {
  public MyDependencyManager() {

Extend PingableDependency for each component that your application requires to run.

public class MyDependency extends PingableDependency {
  public void ping() throws Exception {
    // Throw exception if considered unhealthy or unavailable

Extending the PingableDependency class is the simplest way to incorporate a dependency into your application. Alternatively, you can extend AbstractDependency or ComparableDependency to get more control over the state of a dependency. You can control how your dependency affects the system’s condition by providing an Urgency level.

Add your new dependencies to your dependency manager.


For web-based applications and services, create an AbstractDaemonCheckReportServlet that will report the status of your application.

public class StatusServlet extends AbstractDaemonCheckReportServlet {
  private final AbstractDependencyManager manager;

  public StatusServlet(AbstractDependencyManager manager) {
    this.manager = manager;

  protected AbstractDependencyManager newManager(ServletConfig config) {
    return manager;

Once this process is complete and your application is running, you should be able to access the servlet to read a JSON representation of your application status.

Below is a sample response returned by the servlet. If the application is in an OUTAGE condition, the servlet returns a 500 status code. Associating the health check outcome with an HTTP status code enables integration with systems (like Consul) that make routing decisions based on application health. Otherwise, the servlet returns a 200 since it can still process requests. In this case, the application may gracefully degrade less-critical functionality that depends on unhealthy code paths.

  "hostname": "pitz.local",
  "duration": 19,
  "condition": "OUTAGE",
  "dcStatus": "FAILOVER",
  "appname": "crm.api",
  "catalinaBase": "/var/local/tomcat",
  "leastRecentlyExecutedDate": "2015-02-24T22:48:37.782-0600",
  "leastRecentlyExecutedTimestamp": 1424839717782,
  "results": {
    "OUTAGE": [{
      "status": "OUTAGE",
      "description": "mysql",
      "errorMessage": "Exception thrown during ping",
      "timestamp": 1424839717782,
      "duration": 18,
      "lastKnownGoodTimestamp": 0,
      "period": 0,
      "id": "mysql",
      "urgency": "Required: Failure of this dependency would result in complete system outage",
      "documentationUrl": "",
      "thrown": {
        "exception": "RuntimeException",
        "message": "Failed to communicate with the following tables:
          user_authorities, oauth_code, oauth_approvals, oauth_client_token,
          oauth_refresh_token, oauth_client_details, oauth_access_token",
        "stack": [
      "date": "2015-02-24T22:48:37.782-0600"
    "OK": [{
      "status": "OK",
      "description": "mongo",
      "errorMessage": "ok",
      "timestamp": 1424839717782,
      "duration": 0,
      "lastKnownGoodTimestamp": 0,
      "period": 0,
      "id": "mongo",
      "urgency": "Required: Failure of this dependency would result in complete system outage",
      "documentationUrl": "",
      "date": "2015-02-24T22:48:37.782-0600"

This report includes these key fields to help you evaluate the health of a system and the health of a dependency:

condition Identifies the current health of the system as a whole.
leastRecentlyExecutedDate The last date and time that the report was updated.

Use these fields to inspect individual dependencies:

status Identifies the health of the current dependency.
thrown The exception that caused the dependency to fail.
duration The length of time it took to evaluate the dependency’s health. Because the system caches the result of a dependency’s evaluation, this value can be 0.
urgency The urgency of the dependency. Dependencies with a WEAK urgency may not need to be fixed immediately. Dependencies with a REQUIRED urgency must be fixed as soon as possible.

Learn more about Status

Stay tuned for a future post about using the Status library, in which we’ll show how to gracefully degrade unhealthy applications. To get started, read our quick start guide and take a look at the samples. If you need help, you can reach out to us on GitHub or Twitter.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Finding Great (and Profitable) Ideas in the Computer Science Literature

I spend quite a bit of time trawling through recent computer science papers, looking for anything algorithmic that might improve my team’s product and Help People Get Jobs. It’s been a mixed bag so far, often turning up a bunch of pretty math that won’t scale at Indeed. But looking through the computer science literature can pay off big, and more of us should use the research to up our game as software developers.


Word cloud generated by WordItOut

Why read a computer science paper

The first question you might ask is why? Most working developers, after all, simply never read any computer science papers. Many smart developers look at me blankly when I even suggest that they do a literature search. “You mean look on StackOverflow?”

The short answer: to get an edge on your problem (and occasionally on your competition or your peers).

Some academic is looking into some deep generalization of whatever problem you are facing. They are hungry (sometimes literally, on academic salaries) to solve problems, and they give away the solutions. They are publishing papers at a ferocious pace, because otherwise their tenure committees will invite them to explore exciting opportunities elsewhere. Academics think up good, implementable approaches and give them away for free. And hardly anyone notices or cares, which is madness. But a smart developer can sometimes leverage this madness for big payouts. The key is knowing how to find and read what academics write.

Finding computer science papers

Thousands of computer science papers are published each year. How do you find a computer science paper worth reading? As with so many questions in this new century, the answer is Google, specifically Google Scholar.

As near as I can tell, Google Scholar includes almost all the academic papers ever written, for free. Almost every computer science paper since Alan Turing is accessible there. With Scholar, Google is providing one of the most amazing resources anyone has ever given away. Some links point to papers behind paywalls, but almost all those have extra links to copies that aren’t. I’ve read hundreds of papers and never paid for one.

Google doesn’t even attempt to monetize it. Nobody in the general public has heard about More surprisingly: according to my Google contacts, not many Googlers have heard about it either.

With Google Scholar, you’ve solved the problem of finding interesting papers.

Filtering computer science papers

Next, the problem is filtering and prioritizing the interesting papers you find.

Google Scholar search algorithms are powerful, but they aren’t magic. Even your best search skills will net you too many papers to read and understand. The chance that you are reading the one that will most help your work is small.

Here’s my basic strategy for quickly finding the best ones.

First, figure out the paper’s publication date. This seems like an obvious bit of metadata, but you’ll rarely find the date on the paper itself. Instead, look for clues in Google Scholar. You can also assume that it’s two years after the latest paper listed in the citations. This seems sloppy, but it’s effective. Computer science papers older than fifteen years are unlikely to contain anything of value beyond historical interest.

Next, read the first paragraph of the paper. This paragraph covers the problem the researchers are trying to solve, and why it’s important. If that problem sounds like yours, score! Otherwise, unless the authors have hooked you on the intrinsic interest of their results, dump it and move on to the next paper.

If things still seem promising, read the second paragraph. This paragraph covers what the authors did, describes some constraints, and lets you know the results (in broad strokes). If you can replicate what they did in your environment, accept the constraints, and the results are positive, awesome. You’ve determined the paper is worth reading!

How to read a computer science paper

The biggest trick to reading an academic paper is to know what to read and what not to read. Academic papers follow a structure only slightly more flexible than that of a sonnet. Some portions that look like they would help you understand will likely only confuse. Others that look pointless or opaque can hold the secrets to interpreting the paper’s deeper meanings.

Here’s how I like to do it.

Don’t read the abstract. The abstract conveys the gist of the paper to other researchers in the field. These are folks who’ve spent the last decade thinking about similar problems. You’re not there yet. The abstract will likely confuse you and possibly frighten you, but won’t help you understand the topic.

Don’t read the keywords. Adding keywords to papers was a bad idea that nonetheless seems to have stuck. Keywords tend to mislead and won’t add anything you wouldn’t get otherwise. Skip ’em, they’re not worth their feed.

Read the body of the paper closely. Do you remember the research techniques your teachers tried to drum into you in eighth grade? You’ll need them all. You’re trying to reverse engineer just what the researchers did and how they did it. This can be tricky. Papers tend to leave out many shared assumptions behind the research, as well as many details and small missteps. Read every word. Look up phrases or words you don’t know — Wikipedia is usually fine for this. Write down questions. Try to figure out not just what the researchers did, but what they didn’t do, and why.

Don’t read the code. This is counterintuitive, because the clearest way software developers communicate is through code — ideally with documentation, revision history, cross-references, test cases, and review comments.

It doesn’t work that way with academics. To a first approximation, code in academic papers is worthless. The skills necessary to code well are either orthogonal to or actively opposed to the skills necessary for interesting academic research. It’s a minor scandal that most code used in academics is unreviewed, not version-controlled, lacks any test cases, and is debugged only to the point of “it didn’t crash, mostly, today.” That’s the good stuff. The bad stuff is simply unavailable, and quite probably long-deleted by the time the paper got published. Yes, that’s atrocious. Yes, even in computer science.

Read the equations. Academics get mathematics, so their equations have all the virtues that software developers associate with the best software: precision, correctness, conciseness, evocativeness. Teams of smart people trying to find flaws offer painstaking reviews of the equations. In contrast, a bored grad student writes the code, which nobody reads.

Don’t read the conclusions section. It adds nothing.

Leveraging a computer science paper for further search

Academic papers offers a bounty of contextual data in references to other papers. Google Scholar excels at finding papers, but there’s no substitute for actually following the papers that researchers used to inform their work.

Follow the citations in the related work. Authors put evocative descriptions of the work that matters to them in “Related Work.” This provides an interesting contrast for interpreting their work. In some ways, this section memorializes the most important social aspects of academic work.

Follow the citations in the references. Long before HTML popularized hypertext, academic papers formed a dense thicket of cross-references, reified as citations. For even the best papers, half of the value is the contents, half is the links. Citations in papers aren’t clickable (yet), but following them is not hard with Google Scholar.

Repeated citations of older papers? There’s a good chance those are important in the field and useful for context. Repeated citations of new papers? Those papers give insight into the trajectory of the subject. Odd sounding papers with unclear connections to the subject? They are great for getting the sort of mental distance that can be useful in hypothesis generation.

Once you’ve done all that…

It’s just a simple matter of coding. Get to it!

Dave Griffith is a software engineer at Indeed and has been building software systems for over 20 years.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone