Gracefully Degrading Functionality Using Status

In a previous blog post, we described how to use our Status library to create a robust health check for your applications. In this follow-up, we show how you can check and degrade your application during an outage by:

  • short-circuiting code paths of your application
  • removing a single application instance from a data center load balancer
  • removing an entire data center from rotation at the DNS level

Evaluating application health

The Status library allows you to perform two different types of checks on a system — a single dependency check and a system-wide evaluation. A dependency is a system or service that your system requires in order to function.

During a single dependency check, the DependencyManager uses an evaluate method that takes the dependency’s ID and returns a CheckResult.

A CheckResult includes:

  • the health of the dependency
  • some basic information about the dependency
  • the time it took to evaluate the health of the dependency

A CheckResult is a Java enum that is one of OK, MINOR, MAJOR, or OUTAGE. The OUTAGE status indicates that the dependency is not usable.

final CheckResult checkResult = dependencyManager.evaluate("dependencyId");
final CheckStatus status = checkResult.getStatus();

The second approach to evaluating an application’s health is to look at the system as a whole. This gives you a high-level overview of how the entire system is performing. When a system is in OUTAGE, this indicates that the instance of an application is not usable.

final CheckResultSet checkResultSet = dependencyManager.evaluate();
final CheckStatus systemStatus = checkResultSet.getSystemStatus();

If a system is unhealthy, it’s often best to short circuit requests made to the system and return an HTTP status code 500 (“Internal Server Error”). In the example below, we use an interceptor in Spring to capture the request, evaluate the system’s health, and respond with an error in the event that the application is in an outage.

public class SystemHealthInterceptor extends HandlerInterceptorAdapter {
    private final DependencyManager dependencyManager;

    @Override
    public boolean preHandle(
            final HttpServletRequest request,
            final HttpServletResponse response,
            final Object handler
    ) throws Exception {
        final CheckResultSet checkResultSet = dependencyManager.evaluate();
        final CheckStatus systemStatus = checkResultSet.getSystemStatus();
        
        switch (systemStatus) {
            case OUTAGE:
                response.setStatus(HttpStatus.INTERNAL_SERVER_ERROR.value());
                return false;
            default:
                break;
        }

        return true;
    }
}

Comparing the health of dependencies

CheckResultSet and CheckResult have methods for returning the current status of the system or the dependency, respectively. Once you have CheckStatus, there are a couple of methods that allow you to compare the results.

isBetterThan() determines if the current status is better than the provided status. This is an exclusive comparison.

CheckStatus.OK.isBetterThan(CheckStatus.OK)              // evaluates to false
CheckStatus.OK.isBetterThan(/* any other CheckStatus */) // evaluates to true

isWorseThan() determines if the current status is worse than the provided status. Again, this operation is exclusive.

CheckStatus.OUTAGE.isWorseThan(CheckStatus.OUTAGE)          // evaluates to false
CheckStatus.OUTAGE.isWorseThan(/* any other CheckStatus */) // evaluates to true

The isBetterThan() and isWorseThan() methods are great tools to check for a desired state of an evaluated dependency. Unfortunately, these methods do not offer enough control to produce a graceful degradation. Either the system was healthy, or it was in an outage. To better control the graceful degradation of our system, two additional methods were needed.

noBetterThan() returns the unhealthier of the two statuses.

CheckStatus.MINOR.noBetterThan(CheckStatus.MAJOR) // returns CheckStatus.MAJOR
CheckStatus.MINOR.noBetterThan(CheckStatus.OK)    // returns CheckStatus.MINOR

noWorseThan() returns the healthier of the two statuses.

CheckStatus.MINOR.noWorseThan(CheckStatus.MAJOR) // returns CheckStatus.MINOR
CheckStatus.MINOR.noWorseThan(CheckStatus.OK)    // returns CheckStatus.OK

During the complete system evaluation, we use a combination of these methods and the Urgency#downgradeWith() methods to gracefully degrade our application’s health.

By having the ability to inspect the outage state, engineers can dynamically toggle feature visibility based on the health of its corresponding dependency. Suppose that our service that provides company information was unable to reach its database. The service’s health check would change its state to MAJOR or OUTAGE. Our job search product would then omit the company widget from the right rail on the search results page. The core functionality that helps people find jobs would be unaffected.

Healthy

Unhealthy (Gracefully)

Status offers more than just the ability to control features based on a service’s health. We also use it to control access to instances of our front end web applications. When an instance is unable to service requests, we remove it from the load balancer until it is healthy again.

Instance level failovers

Generally, running multiple instances of your application in production is highly recommended. This helps keep your system resilient by allowing it to continue to handle requests even if a single instance of your application crashes. These instances of your application can live on a single machine, multiple machines, and even in multiple data centers.

The Status library lets you configure your load balancer to remove an instance if it becomes unhealthy. Consider the following basic example within a single data center.

  When all of the applications within a single data center are healthy, the load balancer distributes requests among them evenly. To determine if an application is healthy, the load balancer sends a request to the health check endpoint and evaluates the response code.
When an instance becomes unhealthy, the health check endpoint returns a non-200 status code, indicating that it should no longer receive traffic. The load balancer then removes the unhealthy instance from rotation, preventing it from receiving requests.
When instance 1 is removed from rotation, the other instances within a data center start to receive instance 1’s traffic. Within each data center, we provision enough instances so that we can handle traffic even if some of the instances go down.

Data center level failovers

Before a request is even sent to a data center, our domain (e.g. www.indeed.com) is resolved to an IP address using DNS. We use Global Server Load Balancer (GSLB) that allows us to geographically distribute traffic across our data centers. After the GSLB resolves the domain to the IP address of the nearest available data center, the data center load balancer then routes and fails over traffic as described above.

What if an entire data center can no longer service requests? Similar to the single instance approach, GSLB constantly checks each of our data centers for their health (using the same health check endpoint). When GSLB detects that a single data center can no longer service requests, it fails requests over to another data center and removes the unhealthy data center from rotation. Again, this helps keep the site available by ensuring that requests can be processed, even during an outage.

As long as a single data center remains healthy, the site can continue to service requests. For users that hit unhealthy data centers, this just looks like a slower web page load. While not ideal, the experience is better than an unprocessed request.

The last scenario is a complete system outage. This occurs when every data center becomes unhealthy and can no longer service requests. Engineers try to avoid this situation like the plague.

When Indeed encounters complete system outages, we reroute traffic to every data center and every instance. This policy, known as “failing open,” allows for graceful degradation of our system. While every instance may report an unhealthy state, it is possible that an application can perform some work. And being able to perform some work is better than performing no work.

Status works for Indeed and can work for you

The Status library is an integral part of the systems that we develop and run at Indeed. We use Status to:

  • quickly fail over application instances and data centers
  • detect when a deploy is going to fail before the code reaches a high traffic data center
  • keep our applications fast by failing requests quickly, rather than doing work we know will fail
  • keep our sites available by ensuring that only healthy instances of our applications service requests

To get started with Status, read our quick start guide and take a look at the samples. If you need help, you can reach out to us on GitHub or Twitter.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

New Eng Manager at Indeed? First: Write Some Code

I joined Indeed in March 2016 as an “industry hire” manager for software engineers. At Indeed, engineering managers act as individual contributors (ICs) before taking on more responsibilities. Working with my team as an IC prepared me to be a more effective manager.

new eng manager

Before my first day, I talked with a few engineering managers about what to expect. They advised that I would spend about 3-6 months contributing as an individual developer. I would write unit tests and code, commit changes, do code reviews, fix bugs, write documentation, and more.

I was excited to hear about this approach, because in my recent years as an engineering manager, I had grudgingly stopped contributing at the code level. Instead, I lived vicariously through others by doing code reviews, participating in technical design reviews, and creating utilities and tools that boosted team productivity.

When new managers start in the Indeed engineering organization as an IC, they can rotate through several different teams or stay with a single team for about a quarter. I was in the latter camp and joined a team that works on revenue management.

Onboarding as an individual contributor

My manager helped to onboard me and directed me to self-guided coursework on our wiki. I was impressed by the amount of content provided to familiarize new hires with the tools and technologies we use at Indeed. In my experience, most companies don’t invest enough in creating and maintaining useful documentation. Equally as valuable, fellow Indeedians gladly answered my questions and helped me to get unblocked when I encountered technical hurdles. I really appreciated that support as a new employee.

During my time as an IC, I had no management responsibilities. That was a change for me….and it was wonderful! I focused on code. I built technical competence and knocked the rust off mental processes that I hadn’t needed to use for awhile. I observed practices and processes used by the team to learn how I could become equally productive. I had a chance to dive deeper into Git usage. I wrote unit and DAO tests to increase code coverage. I learned how to deploy code into the production environment. For the first time in a long while, I wrote production code for new features in a product.

To accelerate my exposure to the 20 different projects owned by my team, I asked to be included on every code review. I knew I wouldn’t be able to contribute to all of the projects, but I wanted to be exposed to as many as possible. Prior to my request, the developer typically selected a few people to do a code review and nominated one to be the “primary” reviewer. Because I was included in every review, I saw code changes and the comments left by team members on how to improve the code. I won’t claim I understood everything I read in every code review, but I did gain an appreciation for the types of changes. I recommend this approach to every new member of a team, not just managers.

Other activities helped me integrate with people outside of my team. For example, I scheduled lunch meetings with everyone who had interviewed me. This was mostly other engineering managers, but I also met with folks in program management and technical writing. Everyone I contacted was open to meeting me. These lunch meetings allowed me to get a feel for different roles; how they planned and prioritized work; their thoughts on going from IC to manager; and challenges that they had faced during their tenure at Indeed. On-site lunches (with great food, by the way) allowed me to meet both engineering veterans as well as people in other departments.

Transitioning into a managerial role

By the time I was close to the end of my first full quarter, I had contributed to several projects. I had been exposed to some of the important systems owned by my team. Around this time, my manager and I discussed my transition into a managerial role. We agreed that I had established enough of a foundation to build on. I took over 1-on-1 meetings, quarterly reviews, team meetings, and career growth discussions.

Maintaining a technical focus

Many software engineers who take on management roles struggle with the idea of giving up writing code. But in a leadership position, what matters more is engaging the team on a technical level. This engagement can take a variety of forms. Engineering managers at Indeed coach their teams on abstract skills and technical decisions. When managers have a deeper understanding of the technology, they can be more effective in their role.

I am glad that I had a chance to start as an IC so that I could earn my team’s trust and respect. A special shout out to the members of the Money team: Akbar, Ben, Cheng, Erica, Kevin, Li, and Richard.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Finding Anomalies in User Behavior with Python

anomaly_detection_banner_cropped

In the course of helping over 200 million unique visitors every month find jobs, we end up with a lot of data. The data we collect can tell us a lot about the behavior of our users, and for the most part we observe predictable patterns in that behavior. But unexpected changes could be evidence of failures in our system or an actual shift in user behavior. When the data shows something strange, we want to understand.

Identifying and acting on anomalies in user behavior is a complex problem. To help detect these anomalies, we take advantage of several open-source libraries, chief among them Twitter’s AnomalyDetection library.

Observing anomalies in user behavior

Detecting anomalies in a large set of data can be easy if that data follows a regular, predictable pattern over time. If we saw the same range of pageviews for our job listings every day, as simulated in Figure 1, it would be easy to identify outliers.

anomaly_1

Figure 1. Single outlier

But most of the data we collect is determined by user behavior, and such data does not follow patterns that we can easily observe. A variety of factors influence user behavior. For example, each of the following factors might affect what we consider a “normal” range of pageviews, depending on the geographic location of the user:

  • What day of the week is it?
  • What time of day is it?
  • Is it a holiday?

We might understand some factors in advance. We might understand others after analyzing the data. We might never fully understand some.

Our anomaly detection should account for as many variations as possible, but still be precise enough to provide significant statistical outliers. Simply saying “traffic is normally higher on Monday morning” is too fuzzy: How much higher? For how long?

Figure 2 shows a variable range of expected data, while Figure 3 shows a range of actual data. Anomalies within the actual data are not immediately visible.

anomaly_2

Figure 2. Expected data

anomaly_3

Figure 3. Actual data

Figure 4 shows the actual and expected data overlaid. Figure 5 shows the difference between the two at any point in time. Viewing the data in this way highlights the significant anomaly in the final data point of the sequence.

anomaly_4

Figure 4. Expected and actual overlaid

anomaly_5

Figure 5. Difference between actual and expected data

We needed a sophisticated method to quickly identify and report on anomalies such as these. This method had to be able to analyze existing data to predict expected future data. And it had to be accurate enough so that we wouldn’t miss anomalies requiring action.

Step one: Solving the statistical problem

The hard mathematical part was actually easy, because Twitter already solved the problem and open sourced their AnomalyDetection library.

From the project description:

AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the presence of seasonality and an underlying trend. The AnomalyDetection package can be used in wide variety of contexts. For example, detecting anomalies in system metrics after a new software release, user engagement post an A/B test, or for problems in econometrics, financial engineering, political and social sciences.

To create this library, Twitter started with an extreme studentized deviate (ESD) test, also known as Grubb’s test, and improved it to handle user behavior data. Originally, the test used the mean value for a set of data to identify outliers. Twitter’s developers realized that using the median value was more precise for web use, as user behavior can be volatile over time.

The result is a resource that makes it easy to quickly identify anomalous results in time-based datasets with variable trends. Twitter’s data scientists use this data to perform their own internal analysis and reporting. For example, they report on tweets per second and the CPU utilization of their internal servers.

Twitter’s library allowed us to use historical data to estimate a highly complicated set of user behavior and quickly identify anomalous behavior. There was only one problem: Twitter wrote the library using R, while our internal alerting systems are implemented in Python.

We decided to port Twitter’s library to Python so that it would work directly with our code.

Step two: Porting the library

Of course, porting code from one language to another always involves some refactoring and problem solving. Most of our work in porting the AnomalyDetection library dealt with differences between how math is supported in R and in Python.

Twitter’s code relies on several math functions that are not natively supported in Python. The most important is a seasonal decomposition algorithm called seasonal and trend decomposition using loess (STL).

We were able to incorporate STL from the open-source pyloess library. We found many of the other math functions that Twitter used in the numpy and scipy libraries. This left us with only a few unsupported math functions, which we ported to our library directly by reading the R code and replicating the functionality in Python.

Taking advantage of the excellent work done by our neighbors in the open-source community allowed us to greatly reduce the effort required in porting the code. By using the pyloess, numpy, and scipy libraries to replicate the R math functions Twitter used, one developer completed most of the work in about a week.

Open-source Python AnomalyDetection

We participate in the open source community because, as engineers, we recognize the value in adapting and learning from the work of others. We are happy to make our Python port of AnomalyDetection available as open source. Download it, try it out, and reach out to us on GitHub or Twitter if you need any help.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone