Jobs Filter: Improving the Job Seeker Experience

As Indeed continues to grow, we’re finding more ways to help people get jobs. We’re also offering more ways job seekers can see those jobs. Job seekers can search directly on Indeed.com, receive recommendations, view sponsored jobs or Indeed Targeted Ads, or receive invitations to apply — to name a few. While each option presents jobs in a slightly different way, our goal for each is the same: showing the right jobs to the right job seekers.

If we miss the mark with the jobs we present, you may lose trust in our ability to connect you with your next opportunity. Our mission is to help people get jobs, not waste their time.

Some of the ways we’d consider a job to be wrong for a job seeker are if it:

  • Pays less than their expected salary range
  • Requires special licensure they do not have
  • Is located outside their preferred geographic area
  • Is in a related field but mismatched, such as nurses and doctors being offered the same jobs

To mitigate this issue, we built a jobs filter to remove jobs that are obviously mismatched to the job seeker. Our solution uses a combination of rules and machine learning technologies, and our analysis shows it to be very effective.

System architecture

The jobs filter product consists of the following components, as shown in the preceding diagram:

  1. Jobs Filter Service. A high throughput, low latency application service that evaluates potential match-ups of jobs to users, identified by ID. If the service determines that the job is appropriate for the user ID, it returns an ALLOW decision; otherwise it returns a VETO. This service is horizontally scalable so it can serve many real-time Indeed applications.
  2. Job Profile. A data storage service that provides high throughput, low latency performance. It retrieves job attributes such as estimated salary, job titles, and job locations at serving time. The job profile uses Indeed NLP libraries and machine learning technologies to extract or aggregate user attributes.
  3. User Profile. Similar to the job profile, but provides attributes about the job seeker rather than the job. Like the job profile, it is a data storage service that provides high throughput, low latency performance. It retrieves job seeker attributes such as expected salary, current job title, and preferred job locations at serving time. Like the job profile, it uses Indeed NLP libraries and machine learning technologies to extract or aggregate user attributes.
  4. Offline Evaluation Platform. Consumes historic data to evaluate rule effectiveness without actually integrating with the upstream applications. It is also heavily used for fine-tuning existing rules, identifying new rules, and validating new models.
  5. Offline Model Training. Component that consists of our offline training algorithms, with which we train models that can be used in the jobs filter rules at serving time for evaluation.

Filter rules to improve job matches

The jobs filter uses a set of rules to improve the quality of jobs displayed to any given job seeker. Rules can be simple: “Do not show a job requiring professional licenses to job seekers who don’t possess such licenses,” or “Do not show jobs to a job seeker if they come with a significant pay cut.” They can also be complex: “Do not show jobs to the job seeker if we are confident the job seeker will not be interested in the job titles,” or “Do not show jobs to the job seeker if our complex predictive models suggest the job seeker will not be interested in them.”

All rules are compiled into a decision engine library. We share this library in our online service and offline evaluation platform.

Although the underlying data for building jobs filter rules might be complex to acquire, most of the heuristic rules themselves are straightforward to design and implement. For example, in one rule we use a user response prediction model to filter out jobs that the job seeker is less likely to be interested in. An Indeed proprietary metric helps us evaluate our performance by measuring the match quality of the job seeker and the given jobs.

Ads ranking and recommender systems commonly rely on user response prediction models, such as click prediction and conversion prediction, to generate a score. They then set a threshold to filter out everything with low scores. This filtering is possible because the models predict positive reactions from users, and low scores indicate poor match quality.

We adopted similar technologies in our jobs filter product, but we used negative matching models when designing our machine learning based rules. We build models to predict negative responses from users. We use Tensorflow to build the Wide and Deep model. This facilitates future experimentation with more complex models such as Factorization machine or neural networks. The features we use cover major user attributes and job data.

After we train a model that performs well, we export it using the Tensorflow SimpleSave API. We load the exported model into our online systems and serve requests using the Tensorflow Java API. Besides traditional classifier metrics such as AUC, precision, and recall, we also load our model into our offline evaluation platforms to validate the performance.

Putting it all to work

We apply our jobs filter in several applications within Indeed. One application is Job2Job, which recommends similar jobs to the job seeker based on the jobs they have clicked or applied for. Using the Job2Job service, we saw a greater than 20% increase in job match quality. When we applied the service to other applications, we observed similar, if not greater, improvements.

Rule-based engines work well in solving corner cases. However, the number of rules can easily spiral out of control. Our design’s hierarchy of rules and machine learning technologies effectively solve this challenge and keep our system working. In the future, we aim to add more features into the model so that it can become even more effective.

Cross-posted on Medium

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

Time-Tested: 7 Ways to Improve Velocity When A/B Testing a New UX

A/B testing holistic redesigns can be tough. Here at Indeed, we learned this firsthand when we tried to take our UX from this to this:

job search ux v1-2

The Indeed mobile Search Engine Results Page (SERP) circa mid-2017 (left) and circa mid-2018 (right)

Things didn’t go so hot. We’d spent months and months coming up with a beautiful design vision, and then months and months trying to test all of these changes all at once. We had a bunch of metrics move (mostly down) and it was super confusing because we couldn’t figure out what UI changes caused what effects.

So, we took a new approach. In the middle of 2018, we founded the Job Search UI Lab, a cross-functional team with one goal: to scientifically test as many individual UI elements as we could to understand the levers on our job search experience. In just the last 12 months, our team ran over 52 tests with over 502 groups. We’ve since used our learnings to successfully overhaul the Job Search UX on both desktop and mobile browsers.

In this blog post, we share some of the A/B test accelerating approaches we incorporated in the JSUI Lab — approaches that garnered us the 2018 Indeed Engineering Innovation Award. Whether you’re interested in doing a UX overhaul or just trying out a new feature, we think you can incorporate each of these tips into your A/B testing, too!

#1: Have a healthy backlog

No one should ever be waiting around to start development on a new test. Having a healthy backlog of prioritized A/B tests for developers helps you roll out A/B tests one after the other.

One way that the JSUI Lab creates a backlog is by gathering all of our teammates — regardless of their role — at the beginning of each quarter to brainstorm tests. We pull up the current UX on mobile and desktop and ask questions about how each element or feature works. Each question or idea ends up on its own sticky note. We end up with over 40 test ideas, which we then prioritize based off of how each test might address a job seeker pain point or improve the design system while minimizing effort. And while we may not get to every test in our backlog, we never have to worry about not having tests lined up.

#2: Write down hypotheses ahead of time

In the hustle and bustle of product development, sometimes experimenters don’t take the time to specify their hypotheses for a given A/B test ahead of time. Sure, not writing hypotheses may save you 10–30 minutes up front. But this can come back to bite you once the test is completed, when your team is looking at dozens of metrics and trying to make a decision about what to do next.

Not only is it confusing to see some metrics go up while others go down, chances are you’re probably also seeing some false positives (also known as Type I error) the more metrics you look at. You may even catch yourself looking at metrics that wouldn’t feasibly be affected by your test (e.g., “How does changing this UI element from orange to yellow affect whether or not a job seeker gets a call back from an employer?!”).

So do yourself a solid. Pick 3–4 metrics that your test could reasonably be expected to move, conduct a power analysis for each one, and write down your hypotheses for the test ahead of time.

#3: Test UI elements one at a time

This one’s a little counterintuitive. It might seem like it would increase the amount of time it would take to do a UX holistic redesign by testing each and every UI element separately. But by testing elements one at a time, the conclusions about our tests were more sound. Why? Because we could more clearly establish causality.

Consequently, we were able to take all of the learnings from our tests and roll them into one big test that we were fairly confident would perform well. Rather than see metrics tank like the first time we did a holistic design test, we actually saw some of Indeed’s biggest user engagement wins for 2018, in less than half the time of the first attempt.

By running tests on UI elements one at a time, we were able to iterate on our design vision in a data-driven way and set up our holistic test for success.

SERP 2019

Indeed’s mobile SERP circa mid-2019

So, what do these tests look like in practice? Below are a few examples of some of the groups we ran. You’ll notice that the only real difference between the treatments is a minor change, like font size or spacing.

tests

#4: Consider multivariate tests

Multivariate tests (sometimes referred to as “factorial tests”) test all possible combinations of each of the factors of interest in your A/B test. So, in a way, they’re more like an A/B/C/D/E/… test! What’s cool about multivariate tests is that you end up with winning combinations that you would have missed had you tested each factor one at a time.

An example from the JSUI Lab illustrates this benefit. We knew from UX research that our job seekers really cared about salary when making the decision to learn more about a job. In 2018, this was how we displayed salary on each result:

account coordinator listing

We wanted to see if increasing the visual prominence using color, font size, and bolding would increase job seeker engagement with search results. So, we developed four font size variants, four color variants, and two variants that were bolded or unbolded. We ended up with 4x4x2 groups for 32 total groups including control.

account coordinator tests

While multivariate tests can speed up how you draw conclusions about different UI elements, they’re not without their drawbacks. First and foremost, you’ll need to weigh the tradeoffs to statistical power, or the likelihood that you’ll detect a given effect if one actually exists (also known as Type II error). Without sufficient statistical power, you risk not detecting an effect of your test if there is one.

Power calculations are a closed-form equation that require your product team to make tradeoffs between your α-level and β-level of choice, your sample size (n), and the effect size you care about your treatment having (p1). On Indeed, we have the benefit of having over 220M+ unique users each month. That level of traffic may not be available to you and your team. So, to have sufficient statistical power, you’ll potentially need to run your experiment for longer, run groups at higher allocations, cut some groups, or be willing to introduce more Type I error, depending on how small of an effect you’d like to confidently detect.

closed form calculation
The closed form calculation is a power test between two proportions

With a typical A/B test, it’s usually relatively straightforward to analyze with a t-test. Multivariate tests, however, will benefit from multivariate regression models, which will allow you to suss out the effects of particular variables and their interaction effects. Here’s a simplified regression equation:

simplified-regression-equation

And an example of a regression equation for one of the tests we ran that modified both font size and the spacing on the job card:

job card regression equation

Another caveat of multivariate tests is that they can quickly become infeasible. If we had 10 factors with 2 levels each, we’d have a 2^10 multivariate test, with a whopping 1,024 test groups. In cases like these, running what’s called a fractional factorial experiment might make more sense.

Finally, multivariate tests may sometimes yield some zany combinations. In our salary example above, our UX Design team was mildly mortified when we introduced the salary variant with 16pt, green, and bolded font. We lovingly referred to this variant as “the Hulk.” In some cases, it may not be feasible to run a variant due to accessibility concerns. In the JSUI Lab, we determine on a case-by-case basis whether the tradeoff of statistical rigor is worth a temporarily poor user experience.

sw engineer listing

#5: Deploy CSS and JavaScript changes differently

Sometimes a typical deploy cycle can get in the way of testing new features quickly. At Indeed, we developed a tool called CrashTest, which allows us to sidestep the deploy cycle. CrashTest relies on a separate code base of CSS and JavaScript files that are injected into “hooks” in our main code base. While installing CrashTest hooks follows the standard deploy, once hooks are set up, we can inject new CSS and JavaScript treatments and see the changes reflected in our product in just a few minutes.

In the JSUI Lab, we rely on our design technologist Christina to quickly develop CSS and JavaScript treatments for dozens of groups at a time. With CrashTest, Christina can develop her features and get them QAed by Cory. We can push them into production that same day using our open source experimentation management platform Proctor. Had we relied on the typical deploy cycle, it would have taken Christina’s work several more days to be seen by job seekers, and that much more time until we had results from our A/B tests.

#6: Have a democratized experimentation platform

Combing through logs and tables to figure out how your tests performed is not the best use of time. Instead, consider building or buying an experimentation platform for your team. As a data-driven company, Indeed has an internal tool for this called TestStats. The tool displays how each test group performed on key metrics and whether the test has enough statistical power to draw meaningful conclusions at the predetermined effect size. This makes it easy to share and discuss results with others.

#7: Level up everyone’s skills through cross-training

On the JSUI team, we firmly believe that allowing everyone to contribute to team decisions equally helps our team function better. Our teammates include product managers, UX designers, QA engineers, data scientists, program managers, and design technologists. Each of us brings a unique background to the team. Teaching each other the skills we use in our day-to-day jobs helps increase velocity for our A/B tests because we’re able to talk one another’s language more readily.

For instance, I’m a product scientist, and led a training on A/B testing. This allowed all of the other members of JSUI Lab to feel more empowered to make test design decisions without my direct guidance every time. Our UX designer Katie shadowed our product managers CJ and Kevin as they turned on tests. Katie now turns on tests herself. Not only does this kind of cross-training reduce the “bus factor” on your team, it can also be a great way of helping your teammates master their subject and improve their confidence in their own expertise.

Now it’s time to test!

Whether you take only one or two tips or all seven, they can be a great way of improving your velocity when running A/B tests. The Job Search UI Lab has already started sharing these simple steps with other teams at Indeed. We think they’re more broadly applicable to other companies and hope you’ll give them a try, too.

And if you’re passionate about A/B testing methods, Indeed’s hiring!


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

The Evolving Language of Data Science 

…or Grokking the Bokeh of Scarse Meaning Increasement

“You keep using that word. I do not think it means what you think it means.” — Dr. Inigo Montoya


I’m a technical writer at Indeed. One of the many great things about my job is that I get to work with smart people every day. A fair amount of that work involves translating between them. They will all be speaking English, but still might not understand each other. This is a natural consequence of how knowledge advances in general, and how English develops in particular. 

As disciplines evolve, alternate meanings and new words develop to match. That can extend to creating new phrases to name the disciplines themselves (for example, what is a data scientist?). English’s adoption of such new words and meanings has always been pragmatic. Other Western languages have more formal approval processes, such as French’s Académie française and German’s reliance on a single prestigious dictionary. The closest to formal authorities for correct English are popular dictionaries such as the Oxford English Dictionary, the American Heritage Dictionary, and Merriam-Webster. None of them reign supreme.

This informal adoption of new words and meanings can lead to entire conversations in which people don’t realize they’re discussing different things. For example, consider another recently adopted word: “bokeh.” This started as a term in the dialect of professional photography, for the aesthetically pleasing blurred look that strong depth of field can give a picture. “Bokeh” is also the name for a specific python data visualization package. So “bokeh” may already be headed for a new meaning within the realm of data science.

As a further example of the fluid nature of English, “bokeh” comes from the Japanese word boke (暈け or ボケ). In its original form it meant “intentional blurring,” as well as sometimes “mental haze,” i.e., confusion.

 

A row of lowers that becomes blurry in the distance, for the word "bokeh" A montage of various images relating to the data science usage of "bokeh"

Bokeh of flowers

Photo by Sergei Akulich on Unsplash

Data science bokeh

https://bokeh.pydata.org/ 

The clouded meaning of “data”

A data scientist told me that when she hears “the data” she tends to think of a large amount of information, a set large enough to be comprehensive. She was surprised to see another team’s presentation of  “the data” turn out to be a small table inside a spreadsheet that listed a few numbers. 

This term can also cause confusion between technical fields. Data scientists often interpret “data” as quantitative, while UX researchers interpret “data” as qualitative.

Exploring evolving language with Ngram Viewer

A product science colleague introduced me to the Google Books Ngram Viewer. It’s a search engine that shows how often a word or phrase occurs in the mass of print books Google has scanned. Google’s collection contains most books published in English from AD 1500 to 2008.

I entered some new words that I had come across, and screened out occurrences that weren’t relevant, such as place or person names and abbreviations. I also set the search to start from 1800. Medieval data science could be interesting, but I expect it to be “scarse.” (That’s not a typo.)

Features

When I first came across this newer meaning of “features,” I wasn’t even aware that it had changed. From previous work with software development and UX, I took “features” to mean “aspects of a product that a user will hopefully find useful.” But in data science, a “feature” relates to covariates in a model. In less technical English, a measurable property or characteristic of a phenomenon being observed. 

This dual meaning led me to a fair amount of head-scratching when I was documenting an internal data science application. The application had software features for defining and manipulating data features. 

The following graph indicates this emerging meaning for “feature” by tracking the emergence of a related phrase, “model feature.” 

Ngram graph for the phrase "model feature"

Diving into Ngram’s specific citations, the earliest mention I can find that’s near this meaning is in 1954. Interestingly, it’s from a book on management science:

Screenshot from Google Books summary of "Management Science"

The next use that seems exact turns up in 1969, in the Digest Record from Association for Computing Machinery, Society for Industrial and Applied Mathematics, Institute of Electrical and Electronics Engineers. Leaving aside the intervening comma, the example is so dead-on that I wonder if we’re looking at near the exact moment this new meaning was fully born:

Screenshot of Google Books summary for "The Digest Record"

To grok

“Grok” is an example of English going so far as to steal words from languages that don’t even exist. Robert A. Heinlein coined the word in his 1961 science fiction classic Stranger in a Strange Land. In the novel, the Martian phrase “grok” literally means “drink” and metaphorically means “understanding something so completely that you and it are one.” 

Ngram graph for the word "grok"

Like many other aspects of science fiction and fantasy, computer programming culture absorbed the term. The Jargon File from 1983 shares an early defined example:

GROK (grahk) verb.
  To understand, usually in a global sense especially, to understand
all the implications and consequences of making a change. Example:
“JONL is the only one who groks the MACLISP compiler.”

Since then, computer jargon has absorbed “grok” and applied it in many different ways. One immediate example is the source code and reference engine OpenGrok. It’s intended to let users “grok (profoundly understand) source code and is developed in the open.”

Salt

Salt is an example of a common word that has gone through two steps of technical change. First it gained a meaning relating to information security, and then an additional one in data science. 

As a verb and noun, “salt” originally meant what it sounds like – adding the substance chemically known as NaCl to food for flavoring and preservation. It gained what is perhaps its better-known technical meaning in information security. Adding “salt” to password hashing makes encrypted passwords more difficult to crack. In the word’s further and more recent permutations in data science, “salt” and “resalt” mean to partly randomize the results of an experiment by shuffling them. The following ngram graph tracks the association of “salt” and “resalt” over time. 

This was hard to parse out, and required diving deeply into Ngram’s options. I ended up graphing the different times “salt” modifies the words “food,” “password,” or “data.” Google stopped scanning in new books in 2008 – you can see the barest beginning of this new usage in 2007.

Ngram graph for the word "salt"

Pickling

Traditionally “pickling” refers to another way to treat food, this one almost entirely for preservation. In Python, this refers to the object serialization method made possible by the Pickle module. Data scientists have found increasing use for this term, in ways too recent to find on Ngram.

The bleeding edge of language?

Here are some words that may just be in the sprouting stage of wider usage.

Scarse

This came from an accidental jumble of words in a meeting, and has remained in use since. It describes situations where data is both scarce (there’s not a lot of it) and sparse (even when there is some, it’s pretty thin). 

This meaning for “scarse” doesn’t appear in the Ngram graph. So it appears we’re seeing mutation and evolution in word form in the wild. Will it take root and prosper, continuing to evolve? Only time will tell.

Increasement

“We should look for the source of that error message increasement.”

I’ve observed this word once in the wild–from me. “Increasement” came to me in a meeting, as a word for the amount of an increase over time. I had never used the word before. It just seemed like a word that could exist. It had meaning similar to other words, and fit those other words’ rules of word construction.

In the context I used, its meaning isn’t exactly the same as “increment.” Increment refers to a specific numeric increase. One wouldn’t refer, for example, to an increasing amount of users as an increment. You might, however, refer to it as an increasement.

Searching for increasement revealed that this word previously existed but fell out of common usage, as shown on the following graph.

Ngram graph for the word "increasement"

Previous examples:

The Fathers of the English Church

Paul was, that he should return again to these Philippians, and abide, and continue amongst them, and that to their profit; both to the increasement of their faith


The Harleian miscellany; or, A collection of … pamphlets and tracts … in the late earl of Oxford’s library

….when she saw the man grown settled and staid, gave him an assistance, and advanced him to the treasurership, where he made amends to his house, for his mis-spent time, both in the increasement of his estate and honour…

Perhaps it’s time for “increasement” to be rebooted into common use?

Bottom line

Language is likely to continue evolving as long as we use language. Words in general, and English words in particular, and words in English technical dialects above all, are in a constant state of flux. Just like the many fields of knowledge they discuss.

So if you’re in a technical discussion and others’ responses aren’t quite what you expect, consider re-examining the technical phrases you’re using. 

The people you’re talking with might grok those words quite differently.

 

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone