Friendly Machine Learning

At Indeed, machine learning is key to our mission of helping people get jobs. Machine learning lets us collect, sort, and analyze millions of job postings a day. In this post, we’ll describe our open-source Java wrapper for a particularly useful machine learning library, and we’ll explain how you can benefit from our work.

Challenges of machine learning

It’s not easy to build a machine learning system. A good system needs to do several things right:

  • Feature engineering. For example, converting a text to features vector requires you to precalculate statistics about words. This process can be challenging.
  • Model quality. Most algorithms require hyper parameters tuning, which is usually done through grid search. This process can take hours, making it hard to iterate quickly on ideas.
  • Model training for large datasets. The implementations for most algorithms assume that the entire dataset fits in memory in a single process. Extremely large datasets, like those we work with at Indeed, are harder to train.

Wabbit to the rescue

Fortunately for us, an excellent machine learning system that meets those needs already exists. John Langford, a computer science researcher from Microsoft, possesses a rare combination of excellence in machine learning theory and programming. His command line tool, Vowpal Wabbit (VW), implements state of the art techniques for building generalized linear models and includes useful features such as a flexible input data format. VW has garnered a lot of attention in the machine learning community and enjoys success in the industry.

Benefits of Vowpal Wabbit

At Indeed, we use VW to build models that help discover new job sites, improve quality of search results, and accurately measure performance of our products. VW is convenient for a number of reasons.

Benefit 1: An input format that makes your life easier

To feed VW with data, you need to convert that data to a special format first. While this format might seem strange, it has many benefits. It allows you to split features into namespaces, put weight on a whole namespace, name features, pass categorical features as-is, and even pass text as a feature. With VW, you can pass raw text with almost zero prep and train a decent model on it!

The data format is also less error prone. During the prediction phase, you only need to convert prediction features into this format and not into numerical vectors.

Benefit 2: Powerful feature engineering techniques out-of-the-box

Another strength of Vowpal Wabbit is implemented feature engineering techniques. These techniques range from less complex, such as quadratic interactions and n-grams, to more complex, such as low rank quadratic approximation (also known as factorization machines). You can access all of these feature engineering techniques just by changing program options.

Benefit 3: Excellent speed

Vowpal Wabbit is written in optimized C++ and it can take advantage of multiple processor cores. VW is 2-3 times faster than R if you count only train time, and ten times faster than R if you count preparation time, such as computing tf-idf.

Benefit 4: No bottleneck on data size

Most machine learning algorithms require you to read an entire dataset in the memory of one process. VW uses a different approach called online learning: it reads a training set, example by example, and updates the model with each example. It doesn’t need to keep the mapping from a word to an index for weight in memory, because it uses a hashing trick. All it needs to store in a memory is a weight vector.

This means you can train a model on a dataset of any size on a single machine — tens of gigabytes of data is not an issue.

Improving the VW API

Vowpal Wabbit is inspired by good old UNIX command line tools, such as find. At Indeed, however, most of our infrastructure is in Java. We wanted to invoke VW from Java code, but we encountered two issues with its default Java wrapper:

  • The wrapper requires boost to be installed on every server where it is used.
  • Its API is very low-level, requiring you to operate with strings instead of providing a more convenient domain abstraction.

To address these issues, we built our own open source JNI wrapper for VW.

Adding vw-wrapper to your project

Add a dependency on vw-wrapper using Maven as follows. No additional software is necessary.


Deploying the model to production

You can deploy the model to production in three ways:

  • Train the model via command line and deploy it to production by replicating the file with the model or putting it in Git with the sources
  • Build one Java component that trains the model, stores it in a file, and replicates it to make predictions in a different component
  • Train and make predictions in the same Java process: this can be useful if you want to make an online learning system (a system that continuously updates the model as new data becomes available)

We’ve tested the library in the three main environments we use: CentOS, Ubuntu and macOS. We include shared libraries that are statically linked to VW in the distributed jar file.

Examples of usage

We reproduced each deployment model in integration tests, which also demonstrate using the Java API.

  • The “Movies lens dataset” test illustrates how to use VW for user rating prediction. This test uses the lrqfa option to get a signal from latent (user, movie) interactions, as described in this factorization machines paper.
  • The “Twitter sentiment analysis” test illustrates how to use VW for NLP. This test demonstrates using raw text as features, the n-grams and skip-n-grams feature engineering techniques, and how to perform feature selection using the featureMask option.

What about the name: Vowpal Wabbit?

Vowpal Wabbit is Elmer Fudd’s version of “Vorpal Rabbit“. Vorpal: a nonsense word from Lewis Carrol’s poem Jabberwocky and in this context, quick.

One, two! One, two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

A Vorpal Rabbit is very quick.

Get started with Vowpal Wabbit and Vowpal Wabbit Java

Learn more about VW with Langford’s VW documentation. It explains VW features and includes tutorials and links to research describing how VW works under the hood.

Check out our Vowpal Wabbit Java wrapper on Github. To learn how to use the wrapper, refer to our integration tests and Java API documentation, including information about useful parameters.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Delaying Asynchronous Message Processing

At Indeed, we always consider what’s best for the job seeker. When a job seeker applies for a job, we want them to have every opportunity to be hired. It is unacceptable for a job seeker to miss an employment opportunity because their application was waiting to be processed while the employer makes a hire. The team responsible for handling applies to jobs posted on Indeed maintains service level objectives (SLOs) for application processing time. We constantly consider better solutions for processing applications and scaling this system.

Indeed first adopted RabbitMQ within our aggregation engine to handle the volume of jobs we process daily. With this success, we integrated RabbitMQ into other systems, such as our job seeker application processing pipeline. Today, this pipeline is responsible for processing more than 1.5 million applications a day. Over time, the team needed to implement several resilience patterns around this integration including:

  • Tracing messages from production to consumption
  • Delaying message processing when expected errors occur
  • Sending messages that cannot be processed completely to a dead letter queue

Implementing a delay queue

A delay queue prolongs message processing by setting a message aside for a set amount of time. To understand why we implemented this pattern, consider several key behaviors of most messaging systems. RabbitMQ:

  • Guarantees at least once delivery (some messages can be delivered multiple times)
  • Allows acknowledgement (ack), negative acknowledgement (nack), or requeue of messages
  • Requeues messages to the head of the queue, not the end

The team implemented a delay queue primarily to deal with the third item. Since RabbitMQ requeues messages to the head of the queue, the next message your consumer will likely process is the one that just failed. Although this is a non-issue for a small volume of messages, critical problems occur as the number of unprocessable messages exceeds the number of consumer threads. Since consumers can’t get past the group of unprocessable messages at the beginning of the queue, messages back up within the cluster.

Queue size
Time (24-hour clock)

Figure 1. Message backup within the cluster

How it works

While mechanisms such as a dead letter queue allowed us to delay message processing, they often required manual intervention to return a system to a healthy state. The delay queue pattern allows our systems to continue processing. Additionally, it requires less work from our first responders (engineers who are “on call” during business hours to handle production issues), Site Reliability Engineers (SREs), and our operations team. The following diagram shows the options for a consumer process that encounters an error:

Figure 2. Asynchronous message consuming system

When a consumer encounters an error and cannot process a message, engineers must choose to have the consumer requeue, place into the delayed queue, or deliver to the dead letter queue. They can make this decision by considering the following questions:

Was the error unexpected?

If your system encounters an unexpected error that is unlikely to happen again, requeue the message. This gives your system a second chance to process the message. Requeuing the message is useful when you encounter:

  • Network blips in service communication
  • A database operation failure caused by a transaction rollback or the inability to obtain a lock

Does the dependent system need time to catch up?

If your system encounters an expected error that may require a little time before reprocessing, delay the message. This allows downstream systems to catch up so the next time you try to process the message, it’s more likely to succeed. Delaying the message is useful for handling:

  • Database replication lag issues
  • Consistency issues when working with eventually consistent systems

Would you consider the message unprocessable?

If a message is unprocessable, send it to your dead letter queue. An engineer can then inspect the message and investigate before dropping or manually requeueing it. A dead letter queue is useful when your system:

  • Expects a message to contain information that is missing
  • Requires manual inspection of dependent resources before trying to reprocess the message

Escalation policy

To further increase your system’s resilience, you might establish an escalation policy among the three options. If a system requests a message to be requeued n times, start to delay the message. If the message is delayed another m times, send it to your dead letter queue. That’s what we have done.

This type of policy has reduced the work for our first responders, SREs, and operations team. We have been able to scale our application processing system as we process more and more candidate applications every day.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Automating Indeed’s Release Process

Indeed’s rapid growth has presented us with many challenges, especially to our release process. Our largely manual process did not scale and became a bottleneck. We decided to develop a custom solution. The lessons we learned in automating our process can be applied to any rapidly growing organization that wants to maintain software quality and developer goodwill.

How did we end up here?

Our software release process has four main goals:

  • Understand which features are being released
  • Understand cross-product and cross-team dependencies
  • Quickly fix bugs in release candidates
  • Record release details for tracking, analysis, and repeatability

Our process ended up looking like this:

This process was comprehensive but required a lot of work. To put it in perspective, a software release with 4 new features required over 100 clicks and Git actions. Each new feature added about 13 actions to the process.

We identified four primary problems:

  • Release management took a lot of time.
  • It was hard to understand what exactly was in a release.
  • There was a lot of potential for error through so many manual steps.
  • Only senior engineers knew enough to handle a release.

We came to a realization: we needed more automation.

But wait — why not just simplify?

Of course, rather than automating our process, we could just simplify it. However, our process provided secondary benefits that we did not want to lose:

Data. Our process provided us with a lot of data and metrics, which allowed us to make continual improvements.

History. Our process allowed us to keep track of what was released and when it was released.

Transparency. Our process, while complicated, allowed us to examine each step.

Automating our way out

We realized that we could automate much of our process and reduce our overhead. To do so, we would need to integrate better with the solutions we already had in place — and be smart about it.

Our process uses multiple systems:

  • Atlassian JIRA: issue management and tracking
  • Atlassian Crucible: code reviews
  • Jenkins: release candidate builds and deploys
  • Gitlab: source control
  • Various build and dependency management tools

Rather than replace these tools, we decided to create a unified release system that could communicate with each of them. We called this unified release system Control Tower.

Integration with dependency management tools allows release managers (RMs) to track new code coming in through library updates. RMs can quickly assess code interdependencies and see the progress of changes in a release. Finally, when an RM has checked everything, they can trigger a build through Jenkins.

The Control Tower main view allows RMs to see details from all the relevant systems. Changes are organized by JIRA issue key, and each change item includes links to Crucible code review information and Git repo locations.

By automating, we significantly reduced the amount of human interaction necessary in our release process. In the following image, every grey box represents a manual step that was eliminated.

After automating, we reduced the number of required clicks and Git actions from over 100 to fewer than 15. And new features now add no extra work, instead of requiring 13 extra actions.

To learn even more about Control Tower, see our Indeed Engineering tech talk. We talk about Control Tower starting at 32:45.

Lessons learned

In the process of creating our unified release system, we learned some valuable lessons.

Lesson 1: Automate the process you have, not the one you want

When we first set out to automate our release process, we did what engineers naturally do in such a situation — we studied the process to understand it as best as we could before starting. Then, we did what engineers also naturally do — we tried to improve it.

While it seemed obvious to “fix” the process while we were automating it, we learned that a tested, working process — even one with problems — is preferable to an untested one, no matter how slick. Our initial attempts at automation met with resistance because developers were unfamiliar with the new way.

Lesson 2: Automation can mean more than you think

When most people think of “automating” a process, they assume it means removing decisions from human actors — “set it and forget it.” But sometimes you can’t remove human interaction from a process. It might be too difficult technically, or you might want a human eye on a process to assure a correct outcome. Even in these situations, automation can come into play.

Sometimes automation means collecting and displaying data to help humans make decisions faster. We found that, even when we needed a human to make a choice, we were able to provide better data to help them make a more informed choice.

Deciding on the proper balance between human and machine action is key to automating. We see future opportunities for improvement by applying machine learning techniques to help humans make decisions even faster.

Lesson 3: Transparency, transparency, transparency

Engineers might not like inefficiency, but they also don’t like mystery. We wanted to avoid a “black box” process that does everything without giving insight as to how and why.

We provide abundant transparency through logging and messaging whenever we can. Allowing developers to examine what the process had done — and why — helped them to trust and adopt the automation solution. Logging also helps should anything go wrong.

Where do we go from here?

Even with our new system in place, we know that we can improve it. We are already working behind-the-scenes on the next steps.

We are developing algorithms that can monitor issue statuses, completed code reviews, build/test statuses, and other external factors. We can develop systems capable of programmatically understanding when a feature is ready for release. We can then automatically make the proper merge requests and set the release process in motion. This further reduces the time between creating and shipping a feature.

We can use machine learning techniques to take in vast amounts of data for use in our decision-making process. This can point out risky deploys and let us know if we need to spend extra effort testing or if we can deploy with minimal oversight.

Our release management system is an important step toward increasing our software output while maintaining the quality our customers expect. This system is a step, not the final goal. By continually improving our process, by learning as we go, we work toward our ultimate goal — helping even more people get jobs.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone