IndeedEng San Francisco: 600% Growth in 2015

Indeed San Francisco is growing! It’s been a great year and we’ve enjoyed a lot of growth — starting the year with 5 people and ending with over 30!

Seated group in restaurant.

A mixer with the San Francisco and San Mateo offices

The Indeed San Francisco office is primarily focused on product development in these areas:

  • My Jobs provides job seekers with tools to more effectively organize and manage their search for a new job. The My Jobs team uses modern web technologies like ReactJS and ES6 to create an engaging user experience across desktop and mobile.
  • The Candidate Quality team builds models to predict how good of a fit a job seeker is for an opening. These models help employers quickly assess candidates who apply to their jobs. We also use these models to ensure that our job search products are helping people find jobs that they’re qualified for.
  • The Notifications team works on Indeed’s Job Alerts as well as our internal systems and infrastructure for high-volume email and mobile push notifications. These systems generate and deliver almost a billion emails to job seekers every week.
  • New product teams use rapid prototyping and iteration to investigate product ideas. The teams identify opportunities for new products, build and launch working implementations, and collect and analyze data to validate their ideas while iterating on new features. We’re always looking for new ways to connect job seekers with their ideal jobs.

Just like all teams at Indeed, the teams in San Francisco are small, work full-stack, deploy multiple times per week, and use our powerful experimentation and analytics tools — Proctor and Imhotep — to continuously test ideas and improve our products.

This year we moved to a larger space on Market St, had two hackathons (one at a cabin in the Sierra Nevadas), took a trip to the beach in Santa Cruz, and had a holiday party on the Embarcadero.

View from Market Street in San Francisco

View from our office

Small group on steps in front of woodsy cabin.

The site of our recent hackathon in the Sierra Nevadas

As we grow, Indeed San Francisco will take on more projects to help people get jobs. If you like that mission and want to help us tackle some really interesting problems, check out our open positions!

Luck, Latitude, or Lemons? How Indeed Locates for Low Latency

Indeed likes being fast. Similar to published studies (Speed Matters and Velocity and the Bottom Line), our internal numbers validate the benefits of speed. It makes sense: a snappy site allows job seekers to achieve their goals with less frustration and wasted time.

Application processing time, however, is only part of the story, and in many cases it is not even the most significant delay. The network time – getting the request from the browser and then the data back again – is often the biggest time sink.

How do you minimize network time?

Engineers use all sorts of tricks and libraries to compress content and load things asynchronously. At some point, however, the laws of physics sneak in, and you just need to get your data center and your users communicating faster.

Sometimes, your product runs in a single data center, and the physical proximity of that data center is valuable. In this case, moving is not an option. Perhaps you can do some caching or use a CDN for static resources. For those who are less tied to a physical location, or, like Indeed, run their site out of multiple data centers, a different data center location may be the key. But how do you choose where to go? The straightforward methods are:

Word of Mouth. The price is good and you’ve talked to other customers from the data center. They seem satisfied. The list of Internet carriers the data center provide seems comprehensive. It’s probably a good fit for your users … if you’re lucky.

Location. You have a lot of American users on the East Coast. Getting a data center close to them, say in the New York area, should help make things faster for the East Coast.

Prepare to be disappointed.

These aren’t bad reasons to pick a data center, but the Internet isn’t based on geography – it’s based on peering points, politics, and price. If it’s cheaper for your customer’s ISP to route New York through New Jersey because they have dedicated fiber to a facility they own, they’re probably going to do that, regardless of how physically close your data center is to the person accessing your site. The Internet’s “series of tubes” don’t always connect where you’d think.

What we did

In October of 2012, Indeed faced a similar quandary. We had a few data centers spread out across the U.S., but the West Coast facility was almost full, and the provider warned that they were going to have a hard time with our predicted growth. The Operations team was eager to look at alternate data centers, but we also didn’t want to make things slower for the West Coast users. So we set up test servers in a few data centers. We pinged the test servers from as many places as we could, comparing the results to the ping times of the original data center. This wasn’t a terrible approach, but it also didn’t mimic the job seeker’s experience.

Meanwhile, other departments were thinking about the problem too. A casual hallway conversation with an engineering manager snowballed into the method we use today. It was important to use real user requests to test possible new locations. After all, what better measure would there be to how users perceive a data center than those same users?

After a few rounds of discussion, and some Dev and Ops time, we came up with the Fruits Test, named for the fruit-based hostnames of our test servers. Utilizing this technique, we estimated that the proposed new data center would shave an average of 30 milliseconds off of the response time for most of our West Coast job seekers. We validated this number once we migrated our entire footprint to the new facilities.

How it works

First, we assess a potential data center for eligibility. It doesn’t make sense to run a test against an environment that’s unsuitable because of space or cost. After clearing that hurdle, we set up a lightweight Linux system with a web server. This web server has a single virtual host named after a fruit, such as lemon.indeed.com. We set up the virtual host to serve ten static JavaScript files, named 0.js, 1.js, etc., up to 9.js.

Once the server is ready, we set up a test matrix in Proctor, our open-sourced A/B testing framework. We assign a fruit and a percentage to each test bucket. Then, each request to the site is randomly assigned to one of the test buckets based on the percentages. Each fruit corresponds to a data center being tested (whether new or existing). We publish the test matrix to Production, and then the fun begins!

Flow between indeed.com and the client (steps 1, 3, and 5); and between the client and the fruit server (steps 2 and 4).

Figure 1: Fruits test requests, responses, and logging

Legend

  1. The site instructs the client to perform the fruits test.
  2. The 0.js request and response call dcDNSCallback.
  3. dcDNSCallback sends the latency of the 0.js request to the site.
  4. The [1-9].js request and response call dcPingCallback.
  5. dcPingCallback sends the latency of the [1-9].js request to the site.

Requests in the test bucket receive JavaScript instructing their browser to start a timer and load the 0.js file from their selected fruit site. This file includes a blank comment and an instruction to call the dcDNSCallback function. On lemon.indeed.com, it passes in "l" to indicate the test fruit:

/*

*/
dcDnsCallback("l");

dcDnsCallback then stops the previous timer, and sends a request to indeed.com, which triggers a log event with the recorded request latency.

The dcDnsCallback function serves two purposes. Since the user’s system may not have the fruit hostname’s IP address in its DNS cache, we can get an idea of how long it takes to do a DNS lookup and a single request round trip. Then, subsequent requests to that fruit host within this session won’t have DNS lookup time as a significant variable, making those timing results more precise.

After the dcDnsCallback invocation, the test selects one of the 9 static JavaScript files at random and repeats the same process: start timer, get the file, run function in the file. These files look a little bit like:

/*
3firaei1udgihufif5ly7zbsqyz59ghisb13u1j26tkffr7h67ppywg12lfkg7ortt5t3xoq5
*/
dcPingCallback("l");

These 9 files (1.js through 9.js) are basically the same as 0.js, but call a dcPingCallback function instead, and contain a comment whose length makes the overall response bulk up to a predefined size. The smallest, 1.js is just 26 bytes, and 9.js comes in at a hefty 50 kilobytes. Having different sized files helps us suss out areas where latency may be low, but available bandwidth is limited enough that getting larger files takes a disproportionately long time. It also can identify areas where bandwidth is plentiful enough that the initial TCP connection setup is the most time-consuming aspect of the transaction.

Once the dcPingCallback function is executed, the timer is stopped and the information about which fruit, which JavaScript file, and how long the operation took is sent to Indeed to be logged. These requests are all placed at the end of the browser’s page rendering and executed asynchronously to minimize the impact of the test on the user’s experience.

On indeed.com, the logging endpoint receives this data and records it, along with the source IP address and the site the user is on. We then write the information to a specially formatted logstore that Indeed calls the LogRepo – mysterious name, I know.

After collecting the LogRepo logs, we build indexes from them using Imhotep, which allows for easy querying and graphing. Depending on the nature of the test, we usually let the fruits test run for a couple of weeks, collecting hundreds of thousands or even millions of samples from real job seekers that we can use to make a more informed decision. When the test has run its course, we just turn off the Proctor test and shut down the fruit test server. That’s it! No additional infrastructure changes needed.

One of the nice things about this approach is that it is flexible for other types of tests. Sure, we mainly use it for testing new data center locations, but when you boil it down to its essentials (fruit jam!), all the test does is download a set amount of data from a random sampling of users and tell you how long it took. Interpreting the results is up to the test designer.

Rather than testing data centers, you could test two different caching technologies, or the performance difference between different versions of web or app servers, or the geographic distribution of an Anycast/BGP IP (we’ve done that last one before). As long as the sample size is large enough to be statistically diverse, it makes for a valid comparison, and from the perspective of the best people to ask: your users.

That’s nice, but why “Fruits Test”?

When we were discussing unique names to represent potential and current data center locations, we wanted names that were:

  • easily identifiable to Operations
  • a little bit obscure to users, but not too mysterious
  • not meaningful for the business

As a placeholder while designing things, we used fruits since it was fairly easy to come up with different fruits for letters of the alphabet. Over the course of the design the names became endearing and they stuck. Now I relish opening up tickets to enable tests for jujube, quince (my favorite), and elderberry!

Now what?

Now that we have a pile of data, we graph the heck out of it! But more about that in Part 2 of the Fruits Test series.

Status: A Java Library For Robust System Status Health Checks

We are excited to highlight the open source availability of Status, a Java library that can report a system’s status in a readable format. The Status library enables dynamic health checks and monitoring of system dependencies. In this post, we will show how to add health checks to your applications.

Why use system status health checks?

Health checks play an important role at Indeed. We set up and run large-scale services and applications every day. Health checks allow us to see the problematic components at an endpoint, rather than combing through logs.

In production, a health check can let us know when a service is unreachable, a file is missing, or the system cannot talk with the database. Additionally, these health checks provide a controlled way for developers to communicate issues to system administrators. In any of these situations, the application can evaluate its own health check and gracefully degrade behavior, rather than taking the entire system offline.

The Status library will capture stack traces from dependencies and return the results in a single location. This feature makes it easy to resolve issues as they arise in any environment. Typical dependencies include MySQL tables, MongoDB collections, RabbitMQ message queues, and API statuses.

System states

When dependencies fail, they affect the condition of the system. System states include:

  • OUTAGE – the system is unable to process requests;
  • MAJOR – the system can service some requests, but may fail for the majority;
  • MINOR – the system can service the majority of requests, but not all;
  • OK – the system should be able to process all requests.

Get started with Status

Follow these instructions to start using the Status library:

Extend the AbstractDependencyManager. The dependency manager will keep track of all your dependencies.

public class MyDependencyManager extends AbstractDependencyManager {
  public MyDependencyManager() {
    super("MyApplication");
  }
}

Extend PingableDependency for each component that your application requires to run.

public class MyDependency extends PingableDependency {
  @Override
  public void ping() throws Exception {
    // Throw exception if considered unhealthy or unavailable
  }
}

Extending the PingableDependency class is the simplest way to incorporate a dependency into your application. Alternatively, you can extend AbstractDependency or ComparableDependency to get more control over the state of a dependency. You can control how your dependency affects the system’s condition by providing an Urgency level.

Add your new dependencies to your dependency manager.

dependencyManager.addDependency(myDependency);
...

For web-based applications and services, create an AbstractDaemonCheckReportServlet that will report the status of your application.

public class StatusServlet extends AbstractDaemonCheckReportServlet {
  private final AbstractDependencyManager manager;

  public StatusServlet(AbstractDependencyManager manager) {
    this.manager = manager;
  }

  @Override
  protected AbstractDependencyManager newManager(ServletConfig config) {
    return manager;
  }
}

Once this process is complete and your application is running, you should be able to access the servlet to read a JSON representation of your application status.

Below is a sample response returned by the servlet. If the application is in an OUTAGE condition, the servlet returns a 500 status code. Associating the health check outcome with an HTTP status code enables integration with systems (like Consul) that make routing decisions based on application health. Otherwise, the servlet returns a 200 since it can still process requests. In this case, the application may gracefully degrade less-critical functionality that depends on unhealthy code paths.

{
  "hostname": "pitz.local",
  "duration": 19,
  "condition": "OUTAGE",
  "dcStatus": "FAILOVER",
  "appname": "crm.api",
  "catalinaBase": "/var/local/tomcat",
  "leastRecentlyExecutedDate": "2015-02-24T22:48:37.782-0600",
  "leastRecentlyExecutedTimestamp": 1424839717782,
  "results": {
    "OUTAGE": [{
      "status": "OUTAGE",
      "description": "mysql",
      "errorMessage": "Exception thrown during ping",
      "timestamp": 1424839717782,
      "duration": 18,
      "lastKnownGoodTimestamp": 0,
      "period": 0,
      "id": "mysql",
      "urgency": "Required: Failure of this dependency would result in complete system outage",
      "documentationUrl": "http://www.mysql.com/",
      "thrown": {
        "exception": "RuntimeException",
        "message": "Failed to communicate with the following tables:
          user_authorities, oauth_code, oauth_approvals, oauth_client_token,
          oauth_refresh_token, oauth_client_details, oauth_access_token",
        "stack": [
          "io.github.jpitz.example.MySQLDependency.ping(MySQLDependency.java:68)",
          "com.indeed.status.core.PingableDependency.call(PingableDependency.java:59)",
          "com.indeed.status.core.PingableDependency.call(PingableDependency.java:15)",
          "java.util.concurrent.FutureTask.run(FutureTask.java:262)",
          "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
          "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
          "java.lang.Thread.run(Thread.java:745)"
        ]
      },
      "date": "2015-02-24T22:48:37.782-0600"
    }],
    "OK": [{
      "status": "OK",
      "description": "mongo",
      "errorMessage": "ok",
      "timestamp": 1424839717782,
      "duration": 0,
      "lastKnownGoodTimestamp": 0,
      "period": 0,
      "id": "mongo",
      "urgency": "Required: Failure of this dependency would result in complete system outage",
      "documentationUrl": "http://www.mongodb.org/",
      "date": "2015-02-24T22:48:37.782-0600"
    }]
  }
}

This report includes these key fields to help you evaluate the health of a system and the health of a dependency:

condition Identifies the current health of the system as a whole.
leastRecentlyExecutedDate The last date and time that the report was updated.

Use these fields to inspect individual dependencies:

status Identifies the health of the current dependency.
thrown The exception that caused the dependency to fail.
duration The length of time it took to evaluate the dependency’s health. Because the system caches the result of a dependency’s evaluation, this value can be 0.
urgency The urgency of the dependency. Dependencies with a WEAK urgency may not need to be fixed immediately. Dependencies with a REQUIRED urgency must be fixed as soon as possible.

Learn more about Status

Stay tuned for a future post about using the Status library, in which we’ll show how to gracefully degrade unhealthy applications. To get started, read our quick start guide and take a look at the samples. If you need help, you can reach out to us on GitHub or Twitter.