Delaying Asynchronous Message Processing

At Indeed, we always consider what’s best for the job seeker. When a job seeker applies for a job, we want them to have every opportunity to be hired. It is unacceptable for a job seeker to miss an employment opportunity because their application was waiting to be processed while the employer makes a hire. The team responsible for handling applies to jobs posted on Indeed maintains service level objectives (SLOs) for application processing time. We constantly consider better solutions for processing applications and scaling this system.

Indeed first adopted RabbitMQ within our aggregation engine to handle the volume of jobs we process daily. With this success, we integrated RabbitMQ into other systems, such as our job seeker application processing pipeline. Today, this pipeline is responsible for processing more than 1.5 million applications a day. Over time, the team needed to implement several resilience patterns around this integration including:

  • Tracing messages from production to consumption
  • Delaying message processing when expected errors occur
  • Sending messages that cannot be processed completely to a dead letter queue

Implementing a delay queue

A delay queue prolongs message processing by setting a message aside for a set amount of time. To understand why we implemented this pattern, consider several key behaviors of most messaging systems. RabbitMQ:

  • Guarantees at least once delivery (some messages can be delivered multiple times)
  • Allows acknowledgement (ack), negative acknowledgement (nack), or requeue of messages
  • Requeues messages to the head of the queue, not the end

The team implemented a delay queue primarily to deal with the third item. Since RabbitMQ requeues messages to the head of the queue, the next message your consumer will likely process is the one that just failed. Although this is a non-issue for a small volume of messages, critical problems occur as the number of unprocessable messages exceeds the number of consumer threads. Since consumers can’t get past the group of unprocessable messages at the beginning of the queue, messages back up within the cluster.

Queue size
Time (24-hour clock)

Figure 1. Message backup within the cluster

How it works

While mechanisms such as a dead letter queue allowed us to delay message processing, they often required manual intervention to return a system to a healthy state. The delay queue pattern allows our systems to continue processing. Additionally, it requires less work from our first responders (engineers who are “on call” during business hours to handle production issues), Site Reliability Engineers (SREs), and our operations team. The following diagram shows the options for a consumer process that encounters an error:

Figure 2. Asynchronous message consuming system

When a consumer encounters an error and cannot process a message, engineers must choose to have the consumer requeue, place into the delayed queue, or deliver to the dead letter queue. They can make this decision by considering the following questions:

Was the error unexpected?

If your system encounters an unexpected error that is unlikely to happen again, requeue the message. This gives your system a second chance to process the message. Requeuing the message is useful when you encounter:

  • Network blips in service communication
  • A database operation failure caused by a transaction rollback or the inability to obtain a lock

Does the dependent system need time to catch up?

If your system encounters an expected error that may require a little time before reprocessing, delay the message. This allows downstream systems to catch up so the next time you try to process the message, it’s more likely to succeed. Delaying the message is useful for handling:

  • Database replication lag issues
  • Consistency issues when working with eventually consistent systems

Would you consider the message unprocessable?

If a message is unprocessable, send it to your dead letter queue. An engineer can then inspect the message and investigate before dropping or manually requeueing it. A dead letter queue is useful when your system:

  • Expects a message to contain information that is missing
  • Requires manual inspection of dependent resources before trying to reprocess the message

Escalation policy

To further increase your system’s resilience, you might establish an escalation policy among the three options. If a system requests a message to be requeued n times, start to delay the message. If the message is delayed another m times, send it to your dead letter queue. That’s what we have done.

This type of policy has reduced the work for our first responders, SREs, and operations team. We have been able to scale our application processing system as we process more and more candidate applications every day.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Automating Indeed’s Release Process

Indeed’s rapid growth has presented us with many challenges, especially to our release process. Our largely manual process did not scale and became a bottleneck. We decided to develop a custom solution. The lessons we learned in automating our process can be applied to any rapidly growing organization that wants to maintain software quality and developer goodwill.

How did we end up here?

Our software release process has four main goals:

  • Understand which features are being released
  • Understand cross-product and cross-team dependencies
  • Quickly fix bugs in release candidates
  • Record release details for tracking, analysis, and repeatability

Our process ended up looking like this:

This process was comprehensive but required a lot of work. To put it in perspective, a software release with 4 new features required over 100 clicks and Git actions. Each new feature added about 13 actions to the process.

We identified four primary problems:

  • Release management took a lot of time.
  • It was hard to understand what exactly was in a release.
  • There was a lot of potential for error through so many manual steps.
  • Only senior engineers knew enough to handle a release.

We came to a realization: we needed more automation.

But wait — why not just simplify?

Of course, rather than automating our process, we could just simplify it. However, our process provided secondary benefits that we did not want to lose:

Data. Our process provided us with a lot of data and metrics, which allowed us to make continual improvements.

History. Our process allowed us to keep track of what was released and when it was released.

Transparency. Our process, while complicated, allowed us to examine each step.

Automating our way out

We realized that we could automate much of our process and reduce our overhead. To do so, we would need to integrate better with the solutions we already had in place — and be smart about it.

Our process uses multiple systems:

  • Atlassian JIRA: issue management and tracking
  • Atlassian Crucible: code reviews
  • Jenkins: release candidate builds and deploys
  • Gitlab: source control
  • Various build and dependency management tools

Rather than replace these tools, we decided to create a unified release system that could communicate with each of them. We called this unified release system Control Tower.

Integration with dependency management tools allows release managers (RMs) to track new code coming in through library updates. RMs can quickly assess code interdependencies and see the progress of changes in a release. Finally, when an RM has checked everything, they can trigger a build through Jenkins.

The Control Tower main view allows RMs to see details from all the relevant systems. Changes are organized by JIRA issue key, and each change item includes links to Crucible code review information and Git repo locations.

By automating, we significantly reduced the amount of human interaction necessary in our release process. In the following image, every grey box represents a manual step that was eliminated.

After automating, we reduced the number of required clicks and Git actions from over 100 to fewer than 15. And new features now add no extra work, instead of requiring 13 extra actions.

To learn even more about Control Tower, see our Indeed Engineering tech talk. We talk about Control Tower starting at 32:45.

Lessons learned

In the process of creating our unified release system, we learned some valuable lessons.

Lesson 1: Automate the process you have, not the one you want

When we first set out to automate our release process, we did what engineers naturally do in such a situation — we studied the process to understand it as best as we could before starting. Then, we did what engineers also naturally do — we tried to improve it.

While it seemed obvious to “fix” the process while we were automating it, we learned that a tested, working process — even one with problems — is preferable to an untested one, no matter how slick. Our initial attempts at automation met with resistance because developers were unfamiliar with the new way.

Lesson 2: Automation can mean more than you think

When most people think of “automating” a process, they assume it means removing decisions from human actors — “set it and forget it.” But sometimes you can’t remove human interaction from a process. It might be too difficult technically, or you might want a human eye on a process to assure a correct outcome. Even in these situations, automation can come into play.

Sometimes automation means collecting and displaying data to help humans make decisions faster. We found that, even when we needed a human to make a choice, we were able to provide better data to help them make a more informed choice.

Deciding on the proper balance between human and machine action is key to automating. We see future opportunities for improvement by applying machine learning techniques to help humans make decisions even faster.

Lesson 3: Transparency, transparency, transparency

Engineers might not like inefficiency, but they also don’t like mystery. We wanted to avoid a “black box” process that does everything without giving insight as to how and why.

We provide abundant transparency through logging and messaging whenever we can. Allowing developers to examine what the process had done — and why — helped them to trust and adopt the automation solution. Logging also helps should anything go wrong.

Where do we go from here?

Even with our new system in place, we know that we can improve it. We are already working behind-the-scenes on the next steps.

We are developing algorithms that can monitor issue statuses, completed code reviews, build/test statuses, and other external factors. We can develop systems capable of programmatically understanding when a feature is ready for release. We can then automatically make the proper merge requests and set the release process in motion. This further reduces the time between creating and shipping a feature.

We can use machine learning techniques to take in vast amounts of data for use in our decision-making process. This can point out risky deploys and let us know if we need to spend extra effort testing or if we can deploy with minimal oversight.

Our release management system is an important step toward increasing our software output while maintaining the quality our customers expect. This system is a step, not the final goal. By continually improving our process, by learning as we go, we work toward our ultimate goal — helping even more people get jobs.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Gracefully Degrading Functionality Using Status

In a previous blog post, we described how to use our Status library to create a robust health check for your applications. In this follow-up, we show how you can check and degrade your application during an outage by:

  • short-circuiting code paths of your application
  • removing a single application instance from a data center load balancer
  • removing an entire data center from rotation at the DNS level

Evaluating application health

The Status library allows you to perform two different types of checks on a system — a single dependency check and a system-wide evaluation. A dependency is a system or service that your system requires in order to function.

During a single dependency check, the DependencyManager uses an evaluate method that takes the dependency’s ID and returns a CheckResult.

A CheckResult includes:

  • the health of the dependency
  • some basic information about the dependency
  • the time it took to evaluate the health of the dependency

A CheckResult is a Java enum that is one of OK, MINOR, MAJOR, or OUTAGE. The OUTAGE status indicates that the dependency is not usable.

final CheckResult checkResult = dependencyManager.evaluate("dependencyId");
final CheckStatus status = checkResult.getStatus();

The second approach to evaluating an application’s health is to look at the system as a whole. This gives you a high-level overview of how the entire system is performing. When a system is in OUTAGE, this indicates that the instance of an application is not usable.

final CheckResultSet checkResultSet = dependencyManager.evaluate();
final CheckStatus systemStatus = checkResultSet.getSystemStatus();

If a system is unhealthy, it’s often best to short circuit requests made to the system and return an HTTP status code 500 (“Internal Server Error”). In the example below, we use an interceptor in Spring to capture the request, evaluate the system’s health, and respond with an error in the event that the application is in an outage.

public class SystemHealthInterceptor extends HandlerInterceptorAdapter {
    private final DependencyManager dependencyManager;

    public boolean preHandle(
            final HttpServletRequest request,
            final HttpServletResponse response,
            final Object handler
    ) throws Exception {
        final CheckResultSet checkResultSet = dependencyManager.evaluate();
        final CheckStatus systemStatus = checkResultSet.getSystemStatus();
        switch (systemStatus) {
            case OUTAGE:
                return false;

        return true;

Comparing the health of dependencies

CheckResultSet and CheckResult have methods for returning the current status of the system or the dependency, respectively. Once you have CheckStatus, there are a couple of methods that allow you to compare the results.

isBetterThan() determines if the current status is better than the provided status. This is an exclusive comparison.

CheckStatus.OK.isBetterThan(CheckStatus.OK)              // evaluates to false
CheckStatus.OK.isBetterThan(/* any other CheckStatus */) // evaluates to true

isWorseThan() determines if the current status is worse than the provided status. Again, this operation is exclusive.

CheckStatus.OUTAGE.isWorseThan(CheckStatus.OUTAGE)          // evaluates to false
CheckStatus.OUTAGE.isWorseThan(/* any other CheckStatus */) // evaluates to true

The isBetterThan() and isWorseThan() methods are great tools to check for a desired state of an evaluated dependency. Unfortunately, these methods do not offer enough control to produce a graceful degradation. Either the system was healthy, or it was in an outage. To better control the graceful degradation of our system, two additional methods were needed.

noBetterThan() returns the unhealthier of the two statuses.

CheckStatus.MINOR.noBetterThan(CheckStatus.MAJOR) // returns CheckStatus.MAJOR
CheckStatus.MINOR.noBetterThan(CheckStatus.OK)    // returns CheckStatus.MINOR

noWorseThan() returns the healthier of the two statuses.

CheckStatus.MINOR.noWorseThan(CheckStatus.MAJOR) // returns CheckStatus.MINOR
CheckStatus.MINOR.noWorseThan(CheckStatus.OK)    // returns CheckStatus.OK

During the complete system evaluation, we use a combination of these methods and the Urgency#downgradeWith() methods to gracefully degrade our application’s health.

By having the ability to inspect the outage state, engineers can dynamically toggle feature visibility based on the health of its corresponding dependency. Suppose that our service that provides company information was unable to reach its database. The service’s health check would change its state to MAJOR or OUTAGE. Our job search product would then omit the company widget from the right rail on the search results page. The core functionality that helps people find jobs would be unaffected.


Unhealthy (Gracefully)

Status offers more than just the ability to control features based on a service’s health. We also use it to control access to instances of our front end web applications. When an instance is unable to service requests, we remove it from the load balancer until it is healthy again.

Instance level failovers

Generally, running multiple instances of your application in production is highly recommended. This helps keep your system resilient by allowing it to continue to handle requests even if a single instance of your application crashes. These instances of your application can live on a single machine, multiple machines, and even in multiple data centers.

The Status library lets you configure your load balancer to remove an instance if it becomes unhealthy. Consider the following basic example within a single data center.

  When all of the applications within a single data center are healthy, the load balancer distributes requests among them evenly. To determine if an application is healthy, the load balancer sends a request to the health check endpoint and evaluates the response code.
When an instance becomes unhealthy, the health check endpoint returns a non-200 status code, indicating that it should no longer receive traffic. The load balancer then removes the unhealthy instance from rotation, preventing it from receiving requests.
When instance 1 is removed from rotation, the other instances within a data center start to receive instance 1’s traffic. Within each data center, we provision enough instances so that we can handle traffic even if some of the instances go down.

Data center level failovers

Before a request is even sent to a data center, our domain (e.g. is resolved to an IP address using DNS. We use Global Server Load Balancer (GSLB) that allows us to geographically distribute traffic across our data centers. After the GSLB resolves the domain to the IP address of the nearest available data center, the data center load balancer then routes and fails over traffic as described above.

What if an entire data center can no longer service requests? Similar to the single instance approach, GSLB constantly checks each of our data centers for their health (using the same health check endpoint). When GSLB detects that a single data center can no longer service requests, it fails requests over to another data center and removes the unhealthy data center from rotation. Again, this helps keep the site available by ensuring that requests can be processed, even during an outage.

As long as a single data center remains healthy, the site can continue to service requests. For users that hit unhealthy data centers, this just looks like a slower web page load. While not ideal, the experience is better than an unprocessed request.

The last scenario is a complete system outage. This occurs when every data center becomes unhealthy and can no longer service requests. Engineers try to avoid this situation like the plague.

When Indeed encounters complete system outages, we reroute traffic to every data center and every instance. This policy, known as “failing open,” allows for graceful degradation of our system. While every instance may report an unhealthy state, it is possible that an application can perform some work. And being able to perform some work is better than performing no work.

Status works for Indeed and can work for you

The Status library is an integral part of the systems that we develop and run at Indeed. We use Status to:

  • quickly fail over application instances and data centers
  • detect when a deploy is going to fail before the code reaches a high traffic data center
  • keep our applications fast by failing requests quickly, rather than doing work we know will fail
  • keep our sites available by ensuring that only healthy instances of our applications service requests

To get started with Status, read our quick start guide and take a look at the samples. If you need help, you can reach out to us on GitHub or Twitter.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone