Normalizing Resume Text in the Age of Ninjas, Rockstars, and Wizards

Left to right: Ninja by Mwangi Gatheca, Rockstar by Austin Neill and Magic by Pierrick Van Troost

At Indeed we help people get jobs, which means understanding resumes and making them discoverable by the right employers. Understanding massive amounts of text is a tricky problem by itself. With source text as varied as resumes, the problem is even more challenging.

Everyone writes their resume differently, and there are some wild job titles out there. If we want to correctly label resumes for software engineers, we have to consider that developer wizard, java engineer, software engineer, and software eng. may all be the same job title. In fact, there may be thousands of ways to describe a job title in our more than 150 million resumes. Human labeling of all of those resumes—as well as new ones created every day—is an impossible task. 

So what is your job, really?

To better understand what a job actually is, we apply a process called normalization to the job title. Normalization is the process of finding synonyms (or equivalence classes) for terms. It allows us to classify resumes in a meaningful way so that employers can find job seekers with relevant experience for their job listings. 

For example, if we determine that software engineer and software developer are equivalent titles, then we can show employers searching for software engineers additional resumes with the title software developer. This is particularly useful in regions with fewer resumes for a job title the employer wants to fill.

Normalizing job titles, certifications, company names, etc. also helps us use resume information in machine learning models as features and labels. We want to know if biology on a resume has the same meaning as bio or even a common misspelling like boilogy. If we want to predict whether a job seeker has a nursing license, we have to correctly label resumes with RN and registered nurse.

How do we normalize text?

There are many ways to normalize text. For a quick initial model, we can measure how similar strings are to one another. We apply two common methods for measuring these string distances, Levenshtein distance of characters and Jaccard distance of phrases. We measure these distances by characters (to capture misspellings) and by words (so we can group cell biology major and cell biology together). 

Step 1: Preprocessing

As with most text-related models, we must first clean the text data. This preprocessing step removes punctuation from terms, replaces known acronyms and abbreviations with full names, replaces synonyms with more common variants, and stems the words, e.g., removing suffixes such as ing from verbs.

Step 2: Term frequency

After that, we define a term frequency threshold. If a string falls below this threshold, we do not consider it as a potential normalized value. 

Step 3: Minhash

Once we remove low count strings, we have to classify the terms into groups. The most common technique for this kind of grouping involves determining the distance between terms. How different is boilogy from biology?

To prepare, we need to address a computational power problem. We often have millions of unique strings coming from resumes for each field, e.g., for company names. Finding the distances between all pairs of strings is slow and inefficient, since the number of comparisons needed is as follows:

….where n is the number of values. For one million different strings, we would need about 500 billion comparisons. We have to reduce the number of pairwise comparisons to make string distance computation feasible. 

To address this challenge, we use locality sensitive hashing. This set of algorithms hashes similar items together in buckets and can approximate string distance. In particular, the minhash algorithm approximates Jaccard distance, which is the intersection of a set of items over the union of that set. 

Approximating Jaccard distance with minhash is an easy way to measure string distances defined by the words they contain. Using minhash vastly reduces the number of comparisons that we need by only comparing the strings that are in the same minhash bucket. 

Once we carry out minhash and remove a large number of the comparisons we have to make, we calculate a normalized version of the Levenshtein distance to get a character-based distance metric. 


Step 4: Levenshtein distance

We then remove pairs with very high Levenshtein distances. Ultimately we are left with groups of pairs that are quite similar, like cell biology and cell biology major.

Step 5: L2 norm

If similar strings are grouped together, it makes sense to choose the normalized value from that group. But which one? Which values of any given string should we designate as the standard (normalized) value? 

To determine this without outside information such as labels, we look at the frequency of strings in our corpus of resumes. Frequently occurring strings are likely to be the more standard values for a string. 

However, we do not want to rely solely on frequency to choose our normalized value. The most frequent value could be a good standard for most strings in that group, but not all of them. A group could have pairs that contain French, French language, and French language and economics. In this case, we might want to normalize the first two strings together, but not the third. 

To address this problem, we create a vector of features for each pair. This vector contains the two distance measures and the weighted inverse of the frequency of the more common term (wf where w is the weight and f is the frequency of the term in the corpus). We use an inverse so that the number output is lower when the string is higher frequency—this is consistent with string distances being lower when similarity is higher. 

We then normalize strings to the term with the lowest vector magnitude (L2 norm) based on those three features. This results in better normalization accuracy as determined by human labelers.

A worked example

Here is how this normalization works in practice. Below is a table of job titles we will consider, as well as their distances from the first job title, Java developer II.

We apply the following steps:

In step 1, during preprocessing, we remove extraneous words such rockstar and stem the remaining words, removing endings like er.

In step 2 we determine which job titles have the necessary number of counts to be potential normalized job titles, based off a threshold of 1,000. Rockstar java developer does not make the cut. 

In step 3 we use the minhash algorithm to group the titles by Jaccard distance, and discard any job titles from the group with a distance of > 0.7. Barista and Night shift janitor are discarded from the group. 

In step 4 we calculate the Levenshtein ratio, and discard job titles from the group with a ratio of > 0.3. Developer is discarded. 

And lastly, in step 5 we select the standard value based on finding the shortest vector of the Jaccard distance, the Levenshtein ratio, and the w/counts (the L2 norm). Since this is a group of two strings, the distances are the same and only the counts feature is different. Here we use a weight of 50. The vectors are:

  • Java developer II [0.33,0.15,0.005]
  • Rockstar java developer [0.33,0.15,0.5]

The normalized value becomes Java developer II since the L2 norm of the first vector is 0.36, less than that of the second vector 0.62.

Is this the best way to approach normalization?

Many other techniques can normalize text and take into account distant synonyms by considering context around the terms of interest. In fact, we are currently working on including phrase embeddings in this framework. In the meantime, our current approach works for us by greatly reducing the amount of time needed to come up with a new normalization for any field in structured text. With a little tuning, this model can work well for many of the 28 languages found in Indeed resumes.

This method also works for different types of data sets. It can apply to job descriptions and even Indeed Questions—the questions that employers use to screen applicants. Normalization does not circumvent the need for expert human judgment. However, it is helpful in aiding and scaling these experts for use in a large international product.

Normalization is the bread and butter of understanding text. It might not be as exciting as text generation or deep learning classifiers, but it is just as important. Normalization helps search engines by finding synonyms. It aids in creating features and labels for machine learning models, and makes analysis of data many times easier. Models like the one described here can speed up the normalization process so we can expand to new countries without years of work. These models can also adapt to new data easily so we can update our normalization to a changing lexicon. 

With mathematical models for normalizing text, Indeed can better understand job seekers and employers and adapt to changes, ultimately helping us help people get the jobs they want.

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

IndeedEng: Proud Supporters of the Open Source Community

At Indeed, open source is at the core of everything we do. Our collaboration with the open source community allows us to develop solutions that help people get jobs.

As active participants in the community, we believe it is important to give back. This is why we are dedicated to making meaningful contributions to the open source ecosystem.

We’re proud to announce our continuing support by renewing our sponsorship for these foundations and organizations.


ASF logoThe ASF thanks Indeed for their continued generosity as an Apache Software Foundation Sponsor at the Gold level.

In addition, Indeed has expanded on their support by providing our awesome ASF Infrastructure team the opportunity to leverage job listing and advertising resources. This helped us bring on new hires to ensure Apache Infrastructure services continue to run 24x7x365 at near 100% uptime.

We are grateful for their involvement, which, in turn, benefits the greater Apache community.

— Daniel Ruggeri, VP Fundraising, Apache Software Foundation


Cloud Native Computing Foundation

CNCF is thrilled to have Indeed as a member of the Foundation. They have been a great addition to our growing end-user community. Indeed’s participation in this vibrant ecosystem helps in driving adoption of cloud native computing across industries. We’re looking forward to working with them to help continue to grow our community.

— Dan Kohn, Executive Director, Cloud Native Computing Foundation


OSI logoIndeed’s active engagement with open source communities highlights that open source software is now fundamental, not only for businesses, but developers as well.

Like most companies today, Indeed is a user of and contributor to open source software, and interestingly, Indeed’s research of resumes shows developers are too—as job seekers highlight open source skills and experience to win today’s most sought after jobs across technology.

— Patrick Masson, General Manager at the OSI


Outreachy logo

We’re so happy that Indeed continues to join our sponsors—making it possible for us to provide critical opportunities to people who are impacted by systemic bias, underrepresentation and discrimination—and helping them get introduced to free and open source software.

— Karen Sandler, Executive Director, Software Freedom Conservancy


Python SW Foundation logo

Participation in the PSF Sponsorship Plan shows Indeed’s support of our mission to promote the development of the Python programming language and the growth of its international community.

Sponsorships, like Indeed’s, fund programs that help provide opportunities for underrepresented groups in technology and shows support for open source and the Python community.

— Betsy Waliszewski, Python Software Foundation


We’re committed

Our open source initiatives involve partnerships, sponsorships and memberships that support open source projects we rely on. We work to ensure that Indeed’s own open source projects thrive. And we involve all Indeedians. This year we began a FOSS Contributor Fund to support the open source community. Anyone in the company can nominate an open source project to receive funds that we award each month.

We’re committed to open source. Learn more about how we do it.

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

Jobs Filter: Improving the Job Seeker Experience

As Indeed continues to grow, we’re finding more ways to help people get jobs. We’re also offering more ways job seekers can see those jobs. Job seekers can search directly on, receive recommendations, view sponsored jobs or Indeed Targeted Ads, or receive invitations to apply — to name a few. While each option presents jobs in a slightly different way, our goal for each is the same: showing the right jobs to the right job seekers.

If we miss the mark with the jobs we present, you may lose trust in our ability to connect you with your next opportunity. Our mission is to help people get jobs, not waste their time.

Some of the ways we’d consider a job to be wrong for a job seeker are if it:

  • Pays less than their expected salary range
  • Requires special licensure they do not have
  • Is located outside their preferred geographic area
  • Is in a related field but mismatched, such as nurses and doctors being offered the same jobs

To mitigate this issue, we built a jobs filter to remove jobs that are obviously mismatched to the job seeker. Our solution uses a combination of rules and machine learning technologies, and our analysis shows it to be very effective.

System architecture

The jobs filter product consists of the following components, as shown in the preceding diagram:

  1. Jobs Filter Service. A high throughput, low latency application service that evaluates potential match-ups of jobs to users, identified by ID. If the service determines that the job is appropriate for the user ID, it returns an ALLOW decision; otherwise it returns a VETO. This service is horizontally scalable so it can serve many real-time Indeed applications.
  2. Job Profile. A data storage service that provides high throughput, low latency performance. It retrieves job attributes such as estimated salary, job titles, and job locations at serving time. The job profile uses Indeed NLP libraries and machine learning technologies to extract or aggregate user attributes.
  3. User Profile. Similar to the job profile, but provides attributes about the job seeker rather than the job. Like the job profile, it is a data storage service that provides high throughput, low latency performance. It retrieves job seeker attributes such as expected salary, current job title, and preferred job locations at serving time. Like the job profile, it uses Indeed NLP libraries and machine learning technologies to extract or aggregate user attributes.
  4. Offline Evaluation Platform. Consumes historic data to evaluate rule effectiveness without actually integrating with the upstream applications. It is also heavily used for fine-tuning existing rules, identifying new rules, and validating new models.
  5. Offline Model Training. Component that consists of our offline training algorithms, with which we train models that can be used in the jobs filter rules at serving time for evaluation.

Filter rules to improve job matches

The jobs filter uses a set of rules to improve the quality of jobs displayed to any given job seeker. Rules can be simple: “Do not show a job requiring professional licenses to job seekers who don’t possess such licenses,” or “Do not show jobs to a job seeker if they come with a significant pay cut.” They can also be complex: “Do not show jobs to the job seeker if we are confident the job seeker will not be interested in the job titles,” or “Do not show jobs to the job seeker if our complex predictive models suggest the job seeker will not be interested in them.”

All rules are compiled into a decision engine library. We share this library in our online service and offline evaluation platform.

Although the underlying data for building jobs filter rules might be complex to acquire, most of the heuristic rules themselves are straightforward to design and implement. For example, in one rule we use a user response prediction model to filter out jobs that the job seeker is less likely to be interested in. An Indeed proprietary metric helps us evaluate our performance by measuring the match quality of the job seeker and the given jobs.

Ads ranking and recommender systems commonly rely on user response prediction models, such as click prediction and conversion prediction, to generate a score. They then set a threshold to filter out everything with low scores. This filtering is possible because the models predict positive reactions from users, and low scores indicate poor match quality.

We adopted similar technologies in our jobs filter product, but we used negative matching models when designing our machine learning based rules. We build models to predict negative responses from users. We use Tensorflow to build the Wide and Deep model. This facilitates future experimentation with more complex models such as Factorization machine or neural networks. The features we use cover major user attributes and job data.

After we train a model that performs well, we export it using the Tensorflow SimpleSave API. We load the exported model into our online systems and serve requests using the Tensorflow Java API. Besides traditional classifier metrics such as AUC, precision, and recall, we also load our model into our offline evaluation platforms to validate the performance.

Putting it all to work

We apply our jobs filter in several applications within Indeed. One application is Job2Job, which recommends similar jobs to the job seeker based on the jobs they have clicked or applied for. Using the Job2Job service, we saw a greater than 20% increase in job match quality. When we applied the service to other applications, we observed similar, if not greater, improvements.

Rule-based engines work well in solving corner cases. However, the number of rules can easily spiral out of control. Our design’s hierarchy of rules and machine learning technologies effectively solve this challenge and keep our system working. In the future, we aim to add more features into the model so that it can become even more effective.

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone