Indeed at PyData Seattle 2017

Indeed was a proud sponsor of PyData Seattle 2017, an international conference promoting the use of open source data analysis tools for the Python community, such as Pandas, Matplotlib, IPython and Project Jupyter.

Indeed Data Scientists presented two tutorials at the conference:

  • Using Pandas for analyzing structured time series data
  • Using open source natural language processing (NLP) libraries for analyzing unstructured text

This post introduces their presentations and includes links to videos and tutorials for you to try the exercises yourself.

Joe McCarthy illustrated how to use tools in the Pandas data analysis library to investigate unevenly spaced time series data in The Simpsons. This type of data analysis tends to focus more on the intervals between events rather than the frequency of events occurring within regularly spaced intervals. At Indeed, one example of such a task is estimating how long it takes a recruiter to review a resume (or profile), based on the gaps in timestamps of initial profile disposition events.

Joe’s tutorial focused on a collection of data about episodes, characters, locations and scripts from The Simpsons. This collection is one of many data sets available at data.world.

Video 1. D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas

Alex Thomas demonstrated how to use open source NLP tools such as the Natural Language Toolkit (NLTK) and word_cloud for vocabulary analysis of job descriptions. His tutorial covered basic NLP techniques such as tokenization, stemming and lemmatization in the context of analyzing job descriptions posted on Indeed. Other techniques include the use of stop words, multi-word phrases (n-grams) and the TF-IDF statistic for estimating the relevance of documents.

Alex highlighted challenges in processing text and some interesting and often-unanticipated problems in interpreting the results of applying each of these techniques.

Video 2. Vocabulary Analysis of Job Descriptions

Exercises and Jupyter Notebooks for both Indeed tutorials are on GitHub at pydata-simpsons and pydata-vocab-analysis. For more PyData conference presentations, check out their YouTube channel.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Indeed at Litmus Live 2017: How to Run a Successful Email Workshop

Lindsay Brothers

Indeed is proud to announce that Lindsay Brothers will be speaking at Litmus Live in Boston on August 3, 2017. Lindsay is a Product Manager at Indeed. Her team sends billions of job alert emails every month to job seekers around the world.

As the world’s #1 job site, Indeed communicates with job seekers around the globe. A unified email strategy allows us to effectively understand how, when, and why we should email job seekers. To develop this strategy, we built an email planning workshop to share ideas and come to consensus quickly. During this workshop, we created a job seeker’s Bill of Rights and brought users onsite for feedback and validation. Lindsay’s session will cover workshop details and offer takeaways for anyone developing a similar email strategy.

Litmus Live brings together email marketers for two days of real-world advice, best practices, and key takeaways. Free from product pitches and hype, Litmus Live is all about content: Teaching designers, developers, marketers, and strategists how to create emails that look great, perform well, and engage audiences.

If you’re at Litmus Live Boston this year, join Lindsay to learn more about Indeed!

Litmus Live: The Email Design Conference

Indeed is hiring talented Sales, Product, Marketing and Engineering minds from Toronto to Tokyo and beyond. Find out more about opportunities to work at one of our 24 global offices.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Friendly Machine Learning

At Indeed, machine learning is key to our mission of helping people get jobs. Machine learning lets us collect, sort, and analyze millions of job postings a day. In this post, we’ll describe our open-source Java wrapper for a particularly useful machine learning library, and we’ll explain how you can benefit from our work.

Challenges of machine learning

It’s not easy to build a machine learning system. A good system needs to do several things right:

  • Feature engineering. For example, converting text to a feature vector requires you to precalculate statistics about words. This process can be challenging.
  • Model quality. Most algorithms require hyper parameters tuning, which is usually done through grid search. This process can take hours, making it hard to iterate quickly on ideas.
  • Model training for large datasets. The implementations for most algorithms assume that the entire dataset fits in memory in a single process. Extremely large datasets, like those we work with at Indeed, are harder to train.

Wabbit to the rescue

Fortunately for us, an excellent machine learning system that meets those needs already exists. John Langford, a computer science researcher from Microsoft, possesses a rare combination of excellence in machine learning theory and programming. His command line tool, Vowpal Wabbit (VW), implements stateoftheart techniques for building generalized linear models and includes useful features such as a flexible input data format. VW has garnered a lot of attention in the machine learning community and enjoys success in the industry.

Benefits of Vowpal Wabbit

At Indeed, we use VW to build models that help discover new job sites, improve quality of search results, and accurately measure performance of our products. VW is convenient for a number of reasons.

Benefit 1: An input format that makes your life easier

To feed VW with data, you need to convert that data to a special format first. While this format might seem strange, it has many benefits. It allows you to split features into namespaces, put weight on a whole namespace, name features, pass categorical features as-is, and even pass text as a feature. With VW, you can pass raw text with almost zero prep and train a decent model on it!

The data format is also less error prone. During the prediction phase, you only need to convert prediction features into this format and not into numerical vectors.

Benefit 2: Powerful feature engineering techniques out-of-the-box

Another strength of Vowpal Wabbit is implemented feature engineering techniques. These techniques range from less complex, such as quadratic interactions and n-grams, to more complex, such as low rank quadratic approximation (also known as factorization machines). You can access all of these feature engineering techniques just by changing program options.

Benefit 3: Excellent speed

Vowpal Wabbit is written in optimized C++ and it can take advantage of multiple processor cores. VW is 2-3 times faster than R if you count only train time, and ten times faster than R if you count preparation time, such as computing tf-idf.

Benefit 4: No bottleneck on data size

Most machine learning algorithms require you to read an entire dataset in the memory of one process. VW uses a different approach called online learning: it reads a training set, example by example, and updates the model with each example. It doesn’t need to keep the mapping from a word to an index for weight in memory, because it uses a hashing trick. All it needs to store in a memory is a weight vector.

This means you can train a model on a dataset of any size on a single machine — tens of gigabytes of data is not an issue.

Improving the VW API

Vowpal Wabbit is inspired by good old UNIX command line tools, such as find. At Indeed, however, most of our infrastructure is in Java. We wanted to invoke VW from Java code, but we encountered two issues with its default Java wrapper:

  • The wrapper requires boost to be installed on every server where it is used.
  • Its API is very low-level, requiring you to operate with strings instead of providing a more convenient domain abstraction.

To address these issues, we built our own open source JNI wrapper for VW.

Adding vw-wrapper to your project

Add a dependency on vw-wrapper using Maven as follows. No additional software is necessary.

<dependency>
   <groupId>com.indeed</groupId>
   <artifactId>vw-wrapper</artifactId>
   <version>1.0.0</version>
</dependency>

Deploying the model to production

You can deploy the model to production in three ways:

  • Train the model via command line and deploy it to production by replicating the file with the model or putting it in Git with the sources
  • Build one Java component that trains the model, stores it in a file, and replicates it to make predictions in a different component
  • Train and make predictions in the same Java process: this can be useful if you want to make an online learning system (a system that continuously updates the model as new data becomes available)

We’ve tested the library in the three main environments we use: CentOS, Ubuntu and macOS. We include shared libraries that are statically linked to VW in the distributed jar file.

Examples of usage

We reproduced each deployment model in integration tests, which also demonstrate using the Java API.

  • The “Movies lens dataset” test illustrates how to use VW for user rating prediction. This test uses the lrqfa option to get a signal from latent (user, movie) interactions, as described in this factorization machines paper.
  • The “Twitter sentiment analysis” test illustrates how to use VW for NLP. This test demonstrates using raw text as features, the n-grams and skip-n-grams feature engineering techniques, and how to perform feature selection using the featureMask option.

What about the name: Vowpal Wabbit?

Vowpal Wabbit is Elmer Fudd’s version of “Vorpal Rabbit“. Vorpal: a nonsense word from Lewis Carrol’s poem Jabberwocky and in this context, quick.

One, two! One, two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

A Vorpal Rabbit is very quick.

Get started with Vowpal Wabbit and Vowpal Wabbit Java

Learn more about VW with Langford’s VW documentation. It explains VW features and includes tutorials and links to research describing how VW works under the hood.

Check out our Vowpal Wabbit Java wrapper on Github. To learn how to use the wrapper, refer to our integration tests and Java API documentation, including information about useful parameters.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone