Status: A Java Library For Robust System Status Health Checks

We are excited to highlight the open source availability of Status, a Java library that can report a system’s status in a readable format. The Status library enables dynamic health checks and monitoring of system dependencies. In this post, we will show how to add health checks to your applications.

Why use system status health checks?

Health checks play an important role at Indeed. We set up and run large-scale services and applications every day. Health checks allow us to see the problematic components at an endpoint, rather than combing through logs.

In production, a health check can let us know when a service is unreachable, a file is missing, or the system cannot talk with the database. Additionally, these health checks provide a controlled way for developers to communicate issues to system administrators. In any of these situations, the application can evaluate its own health check and gracefully degrade behavior, rather than taking the entire system offline.

The Status library will capture stack traces from dependencies and return the results in a single location. This feature makes it easy to resolve issues as they arise in any environment. Typical dependencies include MySQL tables, MongoDB collections, RabbitMQ message queues, and API statuses.

System states

When dependencies fail, they affect the condition of the system. System states include:

  • OUTAGE – the system is unable to process requests;
  • MAJOR – the system can service some requests, but may fail for the majority;
  • MINOR – the system can service the majority of requests, but not all;
  • OK – the system should be able to process all requests.

Get started with Status

Follow these instructions to start using the Status library:

Extend the AbstractDependencyManager. The dependency manager will keep track of all your dependencies.

public class MyDependencyManager extends AbstractDependencyManager {
  public MyDependencyManager() {
    super("MyApplication");
  }
}

Extend PingableDependency for each component that your application requires to run.

public class MyDependency extends PingableDependency {
  @Override
  public void ping() throws Exception {
    // Throw exception if considered unhealthy or unavailable
  }
}

Extending the PingableDependency class is the simplest way to incorporate a dependency into your application. Alternatively, you can extend AbstractDependency or ComparableDependency to get more control over the state of a dependency. You can control how your dependency affects the system’s condition by providing an Urgency level.

Add your new dependencies to your dependency manager.

dependencyManager.addDependency(myDependency);
...

For web-based applications and services, create an AbstractDaemonCheckReportServlet that will report the status of your application.

public class StatusServlet extends AbstractDaemonCheckReportServlet {
  private final AbstractDependencyManager manager;

  public StatusServlet(AbstractDependencyManager manager) {
    this.manager = manager;
  }

  @Override
  protected AbstractDependencyManager newManager(ServletConfig config) {
    return manager;
  }
}

Once this process is complete and your application is running, you should be able to access the servlet to read a JSON representation of your application status.

Below is a sample response returned by the servlet. If the application is in an OUTAGE condition, the servlet returns a 500 status code. Associating the health check outcome with an HTTP status code enables integration with systems (like Consul) that make routing decisions based on application health. Otherwise, the servlet returns a 200 since it can still process requests. In this case, the application may gracefully degrade less-critical functionality that depends on unhealthy code paths.

{
  "hostname": "pitz.local",
  "duration": 19,
  "condition": "OUTAGE",
  "dcStatus": "FAILOVER",
  "appname": "crm.api",
  "catalinaBase": "/var/local/tomcat",
  "leastRecentlyExecutedDate": "2015-02-24T22:48:37.782-0600",
  "leastRecentlyExecutedTimestamp": 1424839717782,
  "results": {
    "OUTAGE": [{
      "status": "OUTAGE",
      "description": "mysql",
      "errorMessage": "Exception thrown during ping",
      "timestamp": 1424839717782,
      "duration": 18,
      "lastKnownGoodTimestamp": 0,
      "period": 0,
      "id": "mysql",
      "urgency": "Required: Failure of this dependency would result in complete system outage",
      "documentationUrl": "http://www.mysql.com/",
      "thrown": {
        "exception": "RuntimeException",
        "message": "Failed to communicate with the following tables:
          user_authorities, oauth_code, oauth_approvals, oauth_client_token,
          oauth_refresh_token, oauth_client_details, oauth_access_token",
        "stack": [
          "io.github.jpitz.example.MySQLDependency.ping(MySQLDependency.java:68)",
          "com.indeed.status.core.PingableDependency.call(PingableDependency.java:59)",
          "com.indeed.status.core.PingableDependency.call(PingableDependency.java:15)",
          "java.util.concurrent.FutureTask.run(FutureTask.java:262)",
          "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
          "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
          "java.lang.Thread.run(Thread.java:745)"
        ]
      },
      "date": "2015-02-24T22:48:37.782-0600"
    }],
    "OK": [{
      "status": "OK",
      "description": "mongo",
      "errorMessage": "ok",
      "timestamp": 1424839717782,
      "duration": 0,
      "lastKnownGoodTimestamp": 0,
      "period": 0,
      "id": "mongo",
      "urgency": "Required: Failure of this dependency would result in complete system outage",
      "documentationUrl": "http://www.mongodb.org/",
      "date": "2015-02-24T22:48:37.782-0600"
    }]
  }
}

This report includes these key fields to help you evaluate the health of a system and the health of a dependency:

condition Identifies the current health of the system as a whole.
leastRecentlyExecutedDate The last date and time that the report was updated.

Use these fields to inspect individual dependencies:

status Identifies the health of the current dependency.
thrown The exception that caused the dependency to fail.
duration The length of time it took to evaluate the dependency’s health. Because the system caches the result of a dependency’s evaluation, this value can be 0.
urgency The urgency of the dependency. Dependencies with a WEAK urgency may not need to be fixed immediately. Dependencies with a REQUIRED urgency must be fixed as soon as possible.

Learn more about Status

Stay tuned for a future post about using the Status library, in which we’ll show how to gracefully degrade unhealthy applications. To get started, read our quick start guide and take a look at the samples. If you need help, you can reach out to us on GitHub or Twitter.

Finding Great (and Profitable) Ideas in the Computer Science Literature

I spend quite a bit of time trawling through recent computer science papers, looking for anything algorithmic that might improve my team’s product and Help People Get Jobs. It’s been a mixed bag so far, often turning up a bunch of pretty math that won’t scale at Indeed. But looking through the computer science literature can pay off big, and more of us should use the research to up our game as software developers.

Word cloud generated by entering the term 'inverted index'

Word cloud generated by WordItOut

Why read a computer science paper

The first question you might ask is why? Most working developers, after all, simply never read any computer science papers. Many smart developers look at me blankly when I even suggest that they do a literature search. “You mean look on StackOverflow?”

The short answer: to get an edge on your problem (and occasionally on your competition or your peers).

Some academic is looking into some deep generalization of whatever problem you are facing. They are hungry (sometimes literally, on academic salaries) to solve problems, and they give away the solutions. They are publishing papers at a ferocious pace, because otherwise their tenure committees will invite them to explore exciting opportunities elsewhere. Academics think up good, implementable approaches and give them away for free. And hardly anyone notices or cares, which is madness. But a smart developer can sometimes leverage this madness for big payouts. The key is knowing how to find and read what academics write.

Finding computer science papers

Thousands of computer science papers are published each year. How do you find a computer science paper worth reading? As with so many questions in this new century, the answer is Google, specifically Google Scholar.

As near as I can tell, Google Scholar includes almost all the academic papers ever written, for free. Almost every computer science paper since Alan Turing is accessible there. With Scholar, Google is providing one of the most amazing resources anyone has ever given away. Some links point to papers behind paywalls, but almost all those have extra links to copies that aren’t. I’ve read hundreds of papers and never paid for one.

Google doesn’t even attempt to monetize it. Nobody in the general public has heard about scholar.google.com. More surprisingly: according to my Google contacts, not many Googlers have heard about it either.

With Google Scholar, you’ve solved the problem of finding interesting papers.

Filtering computer science papers

Next, the problem is filtering and prioritizing the interesting papers you find.

Google Scholar search algorithms are powerful, but they aren’t magic. Even your best search skills will net you too many papers to read and understand. The chance that you are reading the one that will most help your work is small.

Here’s my basic strategy for quickly finding the best ones.

First, figure out the paper’s publication date. This seems like an obvious bit of metadata, but you’ll rarely find the date on the paper itself. Instead, look for clues in Google Scholar. You can also assume that it’s two years after the latest paper listed in the citations. This seems sloppy, but it’s effective. Computer science papers older than fifteen years are unlikely to contain anything of value beyond historical interest.

Next, read the first paragraph of the paper. This paragraph covers the problem the researchers are trying to solve, and why it’s important. If that problem sounds like yours, score! Otherwise, unless the authors have hooked you on the intrinsic interest of their results, dump it and move on to the next paper.

If things still seem promising, read the second paragraph. This paragraph covers what the authors did, describes some constraints, and lets you know the results (in broad strokes). If you can replicate what they did in your environment, accept the constraints, and the results are positive, awesome. You’ve determined the paper is worth reading!

How to read a computer science paper

The biggest trick to reading an academic paper is to know what to read and what not to read. Academic papers follow a structure only slightly more flexible than that of a sonnet. Some portions that look like they would help you understand will likely only confuse. Others that look pointless or opaque can hold the secrets to interpreting the paper’s deeper meanings.

Here’s how I like to do it.

Don’t read the abstract. The abstract conveys the gist of the paper to other researchers in the field. These are folks who’ve spent the last decade thinking about similar problems. You’re not there yet. The abstract will likely confuse you and possibly frighten you, but won’t help you understand the topic.

Don’t read the keywords. Adding keywords to papers was a bad idea that nonetheless seems to have stuck. Keywords tend to mislead and won’t add anything you wouldn’t get otherwise. Skip ’em, they’re not worth their feed.

Read the body of the paper closely. Do you remember the research techniques your teachers tried to drum into you in eighth grade? You’ll need them all. You’re trying to reverse engineer just what the researchers did and how they did it. This can be tricky. Papers tend to leave out many shared assumptions behind the research, as well as many details and small missteps. Read every word. Look up phrases or words you don’t know — Wikipedia is usually fine for this. Write down questions. Try to figure out not just what the researchers did, but what they didn’t do, and why.

Don’t read the code. This is counterintuitive, because the clearest way software developers communicate is through code — ideally with documentation, revision history, cross-references, test cases, and review comments.

It doesn’t work that way with academics. To a first approximation, code in academic papers is worthless. The skills necessary to code well are either orthogonal to or actively opposed to the skills necessary for interesting academic research. It’s a minor scandal that most code used in academics is unreviewed, not version-controlled, lacks any test cases, and is debugged only to the point of “it didn’t crash, mostly, today.” That’s the good stuff. The bad stuff is simply unavailable, and quite probably long-deleted by the time the paper got published. Yes, that’s atrocious. Yes, even in computer science.

Read the equations. Academics get mathematics, so their equations have all the virtues that software developers associate with the best software: precision, correctness, conciseness, evocativeness. Teams of smart people trying to find flaws offer painstaking reviews of the equations. In contrast, a bored grad student writes the code, which nobody reads.

Don’t read the conclusions section. It adds nothing.

Leveraging a computer science paper for further search

Academic papers offers a bounty of contextual data in references to other papers. Google Scholar excels at finding papers, but there’s no substitute for actually following the papers that researchers used to inform their work.

Follow the citations in the related work. Authors put evocative descriptions of the work that matters to them in “Related Work.” This provides an interesting contrast for interpreting their work. In some ways, this section memorializes the most important social aspects of academic work.

Follow the citations in the references. Long before HTML popularized hypertext, academic papers formed a dense thicket of cross-references, reified as citations. For even the best papers, half of the value is the contents, half is the links. Citations in papers aren’t clickable (yet), but following them is not hard with Google Scholar.

Repeated citations of older papers? There’s a good chance those are important in the field and useful for context. Repeated citations of new papers? Those papers give insight into the trajectory of the subject. Odd sounding papers with unclear connections to the subject? They are great for getting the sort of mental distance that can be useful in hypothesis generation.

Once you’ve done all that…

It’s just a simple matter of coding. Get to it!

Dave Griffith has been building software systems for over 20 years.

Vectorized VByte Decoding: High Performance Vector Instructions

Data-driven organizations like Indeed need great tools. We built Imhotep, our interactive data analytics platform (released last year), to manage the parallel execution of queries. To balance memory efficiency and performance in Imhotep, we developed a technique called vectorized variable-byte (VByte) decoding.

VByte with differential decoding

Many applications use VByte and differential encoding to compress sorted sequences of integers. The most common compression method for inverted indexes uses this style of encoding. This approach encodes successive differences between integers instead of the integers themselves, using fewer bytes for smaller integers at the cost of using more bytes for larger integers.

A conventional VByte decoder examines only one byte at a time, which limits throughput. Also, each input byte requires one branch, leading to mispredicted branches.

Vectorized VByte decoding

Our masked VByte decoder processes larger chunks of input data — 12 bytes — at one time, which is much faster than decoding one byte at a time. This is important for Indeed because Imhotep spends ~40% of its CPU time decoding variable-byte integers. We described this approach in a tech talk last year: Large Scale Analytics and Machine Learning at Indeed.

Jeff Plaisance (Indeed), Nathan Kurz (Verse Communications), and Daniel Lemire (LICEF, Université du Québec) discuss the masked VByte decoder in detail in Vectorized VByte Decoding. The paper’s abstract follows:

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

Vectorized VByte Decoding has been accepted to the International Symposium on Web Algorithms (iSWAG) on June 2-3, 2015. iSWAG promotes academic and industrial research on all topics related to web algorithms.

Large-scale interactive tools

To learn more about Imhotep, check out these tech talks and slides: Scaling Decision Trees and Large-Scale Analytics with Imhotep. You can find the source and documentation for Imhotep on GitHub.