At Indeed, machine learning is key to our mission of helping people get jobs. Machine learning lets us collect, sort, and analyze millions of job postings a day. In this post, we’ll describe our open-source Java wrapper for a particularly useful machine learning library, and we’ll explain how you can benefit from our work.
Challenges of machine learning
It’s not easy to build a machine learning system. A good system needs to do several things right:
- Feature engineering. For example, converting text to a feature vector requires you to precalculate statistics about words. This process can be challenging.
- Model quality. Most algorithms require hyper parameters tuning, which is usually done through grid search. This process can take hours, making it hard to iterate quickly on ideas.
- Model training for large datasets. The implementations for most algorithms assume that the entire dataset fits in memory in a single process. Extremely large datasets, like those we work with at Indeed, are harder to train.
Wabbit to the rescue
Fortunately for us, an excellent machine learning system that meets those needs already exists. John Langford, a computer science researcher from Microsoft, possesses a rare combination of excellence in machine learning theory and programming. His command line tool, Vowpal Wabbit (VW), implements state-of-the-art techniques for building generalized linear models, including feature hashing, adaptive bound optimization, normalized online learning, and online importance weight aware updates. It also includes useful features such as a flexible input data format. VW has garnered a lot of attention in the machine learning community and enjoys success in the industry.
Benefits of Vowpal Wabbit
At Indeed, we use VW to build models that help discover new job sites, improve quality of search results, and accurately measure performance of our products. VW is convenient for a number of reasons.
Benefit 1: An input format that makes your life easier
To feed VW with data, you need to convert that data to a special format first. While this format might seem strange, it has many benefits. It allows you to split features into namespaces, put weight on a whole namespace, name features, pass categorical features as-is, and even pass text as a feature. With VW, you can pass raw text with almost zero prep and train a decent model on it!
The data format is also less error prone. During the prediction phase, you only need to convert prediction features into this format and not into numerical vectors.
Benefit 2: Powerful feature engineering techniques out-of-the-box
Another strength of Vowpal Wabbit is implemented feature engineering techniques, described in this wiki page. These techniques range from less complex, such as quadratic interactions and n-grams, to more complex, such as low rank quadratic approximation (also known as factorization machines). You can access all of these feature engineering techniques just by changing program options.
Benefit 3: Excellent speed
Vowpal Wabbit is written in optimized C++ and it can take advantage of multiple processor cores. VW is 2-3 times faster than R if you count only train time, and ten times faster than R if you count preparation time, such as computing tf-idf.
Benefit 4: No bottleneck on data size
Most machine learning algorithms require you to read an entire dataset in the memory of one process. VW uses a different approach called online learning: it reads a training set, example by example, and updates the model with each example. It doesn't need to keep the mapping from a word to an index for weight in memory, because it uses a hashing trick. All it needs to store in a memory is a weight vector.
This means you can train a model on a dataset of any size on a single machine -- tens of gigabytes of data is not an issue.
Improving the VW API
Vowpal Wabbit is inspired by good old UNIX command line tools, such as
find. At Indeed, however, most of our infrastructure is in Java. We wanted to invoke VW from Java code, but we encountered two issues with its default Java wrapper:
- The wrapper requires boost to be installed on every server where it is used.
- Its API is very low-level, requiring you to operate with strings instead of providing a more convenient domain abstraction.
To address these issues, we built our own open source JNI wrapper for VW.
Adding vw-wrapper to your project
Add a dependency on vw-wrapper using Maven as follows. No additional software is necessary.
com.indeed vw-wrapper 1.0.0
Deploying the model to production
You can deploy the model to production in three ways:
- Train the model via command line and deploy it to production by replicating the file with the model or putting it in Git with the sources
- Build one Java component that trains the model, stores it in a file, and replicates it to make predictions in a different component
- Train and make predictions in the same Java process: this can be useful if you want to make an online learning system (a system that continuously updates the model as new data becomes available)
We’ve tested the library in the three main environments we use: CentOS, Ubuntu and macOS. We include shared libraries that are statically linked to VW in the distributed jar file.
Examples of usage
We reproduced each deployment model in integration tests, which also demonstrate using the Java API.
- The "Movies lens dataset" test illustrates how to use VW for user rating prediction. This test uses the
lrqfaoption to get a signal from latent (user, movie) interactions, as described in this factorization machines paper. See the test here.
- The "Twitter sentiment analysis" test illustrates how to use VW for NLP. This test demonstrates using raw text as features, the n-grams and skip-n-grams feature engineering techniques, and how to perform feature selection using the
featureMaskoption. See the test here.
What about the name: Vowpal Wabbit?
One, two! One, two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.
A Vorpal Rabbit is very quick.