Recognize Class Imbalance with Baselines and Better Metrics

In my first machine learning course as an undergrad, I built a recommender system. Using a dataset from a social music website, I created a model to predict whether a given user would like a given artist. I was thrilled when initial experiments showed that for 99% of the points in my dataset, I gave the correct rating – I was wrong only 1% of the time!

When I proudly shared the results with my professor, he revealed that I wasn’t, in fact, a machine learning prodigy. I’d made a mistake called the base rate fallacy. The dataset I used exhibited a high degree of class imbalance. In other words, for 99% of the pairs between user and artist, the user did not like the artist. This makes sense: there are many, many musicians in the world, and it’s unlikely that one person has even heard of half of them (let alone actually enjoys them).

class imbalance

When we’re unprepared for it, class imbalance introduces problems by producing misleading metrics. The undergrad version of me ran face-first into this problem: accuracy alone tells us almost nothing. A trivial model that predicts that no users like any artists can achieve 99% accuracy, but it’s completely worthless. Using accuracy as a metric assumes that all errors are equally costly; this is frequently not the case.

Consider a medical example. If we incorrectly classify a tumor as malignant and request further screening, the cost of that error is worry for the patient and time for the hospital workers. By contrast, if we incorrectly state that a tumor is benign when it is in fact malignant, the patient may die.

Examine the distribution of classes

Moving beyond accuracy, there are a number of metrics to think about in an imbalanced problem. Knowing the distribution of classes is the first line of defense. As a rule of thumb, Prati, Batista, and Silva find that class imbalance doesn’t significantly harm performance in cases where the minority class makes up 10% or more of the dataset. If you find that your dataset is imbalanced more than this, pay special attention.

I recommend starting with an incredibly simple model: pick the most frequent class. scikit-learn implements this in the DummyClassifier. Had I done this with my music recommendation project, I would quickly have noticed that my fancy model wasn’t really learning anything.

Evaluate the cost

In an ideal world, we could calculate the exact costs of a false negative and a false positive. When evaluating our models, we could multiply those costs by the false negative and false positive rates to come up with a number that describes the cost of our model. Unfortunately, these costs are often unknown in the real world, and improving the false positive rate usually harms the true positive rate.

To visualize this tradeoff, we can use an ROC curve. Most classifiers can output probability of membership in a certain class. If we choose a threshold (50%, for example), we can declare that all points with probability over the threshold are members of the positive class. Varying the threshold from a low percentage to a high percentage produces different ways of classifying points that have different true positive and false positive rates. Plotting the false positive rate on the x-axis and the true positive rate on the y-axis, we get an ROC curve.

As an example, I trained a classifier on the yeast3 dataset from KEEL and created an ROC curve:

ROC for LogisticRegression on yeast3

While we could certainly write the code to draw an ROC curve, the yellowbrick library has this capability built in (and it’s compatible with scikit-learn models). These curves can suggest where to set the threshold for our model. Further, we can use the area under them to compare multiple models (though there are times when this isn’t a good metric).

The next time you’re working on a machine learning problem, consider the distribution of the target variable. A huge first step towards solving class imbalance is recognizing the problem. By using better metrics and visualizations, we can start to talk about imbalanced problems much more clearly.

More on class imbalance

In my upcoming talk at ODSC West, I’ll dive deeper into the causes of class imbalance. I’ll also explore different ways to address this error. I hope to see you in October!


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

You Probably Have Missing Data

Here’s a Guide on When to Care

scrabble tiles

Strategies to address missing data¹

At Indeed, our mission is to help people get jobs. Searching for a job can be stressful, which is one reason why Indeed is always looking for ways to make the process easier and our products better. Surveys provide us with an ongoing measure of people’s feelings about Indeed’s products and the job search experience.

We realize that when someone is looking for a job (or has just landed one), answering a survey is the last thing they want to do. This means that a lot of the survey data that Indeed collects ends up with missing data. To properly analyze user satisfaction and similar surveys, we need to account for potential missing patterns to ensure we draw correct conclusions.

I’d like to discuss identifying and handling missing data. I’m inspired by my training in the University of Michigan’s Program in Survey Methods. I’ve also wanted to apply the theories about data sets that I learned in academia to Indeed’s terabyte-sized data.

I recently worked on a project that dealt with missing data. I learned a lot from the analysis. Walking through this process can show how Indeed collects survey data, illustrate the difference between non-response rate and non-response bias, and provide examples of why “randomness” in non-response bias is a good thing.

One quick note: While the examples in this blog post reference Indeed, all data in this blog post are entirely bogus and made up by the author (aka me).

Measuring hires at Indeed

case study 1

If you have ever canceled a job alert from Indeed, you might have seen this survey:

job alert cancellation survey

The purpose of this survey is to determine whether a job seeker is canceling their job alert because they found a job. This information helps us improve our products and enables us to celebrate the success stories of job seekers.

One challenge with this survey is that only a subset of job seekers completes it. From a user perspective this makes sense — people who unsubscribe from an email notification probably don’t want to spend time answering a survey. This means that we end up with a substantial amount of missing data, especially regarding a key question: did they unsubscribe because they got a job?

Non-response rate vs non-response bias

case study 2

When discussing missing data, people often conflate response rate with non-response bias. When this misunderstanding of response rate is further conflated with the question of data quality, people might assume that a higher response rate means higher quality survey responses. This is not necessarily the case.

For the following job alert cancellation survey results, you’ll note that 13.8% did not respond.

survey response 1

Does a non-response rate of 13.8% say something about the quality of responses in the survey?

The short answer is no. While this might initially sound counterintuitive, stay with me! Imagine that Indeed revised the job alert cancellation survey to include a “prefer not to say” option.

Current Survey Revised Survey
job alert cancellation survey job alert cancellation - revised

After collecting data for a few weeks, we would then see that only 5.8% of job seekers didn’t respond to the revised survey.

Current Survey Revised Survey
survey response survey response alternate

Does this mean an increase in useful data? Before you start celebrating an 8% decrease in non-response, take a closer look at the response distribution. You’ll notice that a whopping 57% of job seekers selected “prefer not to say”!

Typically, we treat the response option of “prefer not to say” as missing data. We don’t know if job seekers selected “prefer not to say” because they are in the process of finalizing an offer, or for some other reason, such as concern that their current employer might find out they have a competing offer. If so, there is a potential for response bias.

Response bias refers to bias towards selecting a certain response (e.g., “prefer not to say’’) due to social desirability or another pressure. Non-response bias, also known as participation bias, refers to refusing to respond to the survey because of social desirability or another pressure.

The example above shows response bias, because respondents may have selected “prefer not to say” due to the sensitive nature of the question. If the respondents hadn’t completed the survey at all due to the nature of the survey, we would have non-response bias.

This illustrates that non-response rate alone (i.e., the percentage of people who responded to your survey) is not the sole indicator of data quality.

For example, if your most recent survey has a 61% response rate while past surveys had a response rate of 80–90%, there’s probably enough rationale to look into potential problems associated with non-response rate. However, if your recent survey has a 4% response rate and past surveys had a response rate of 3–5%, it’s unlikely that there’s a non-response issue with your specific survey. Instead, perhaps your team’s strategy in how surveys are sent (e.g., collecting survey data by landlines versus mobile phones) or how participants are identified for your study (e.g., using outdated and/or incorrect contact information) is leading to low response rates overall.

Whether you have a 3% or 61% response rate, response rate is not synonymous with low-response bias. As we saw with the revised survey, even when the response rate was high, over half of the respondents still selected “prefer not to say” — a response that isn’t usable for data analysis.

In addition to paying attention to the number of people who responded to your survey, you also need to check the distribution of the responses to each question in your survey. Simple and easy frequency statistics are a great way to notice oddities and potential biases in your data.

Missing at random vs missing not at random

case study 3

Non-response bias can be summarized as whether the missing data are random or non-random. Each possibility has different implications for analysis.

Worst-case scenario: MNAR

The worst-case scenario for missing data is if it’s missing not at random (MNAR). In these cases, the missingness can be correlated with at least one variable and the missingness is likely due to the survey question being sensitive. This indicates potential problems with the survey design.

For example, let’s say we ran a chi-square test on the job alert cancellation survey to examine the relationship between survey response (no response vs. responded) and current employment status (employed vs. unemployed). We might see the following findings:

Observed Expected
mnar observed mnar expected

The above findings show a statistically significant relationship between responding to the job alert cancellation survey and the job seeker’s current employment status. This gives us a test statistic and p-value of 𝚾²(1) = 9.70 and p = 0.0018, respectively. Thus, significantly more unemployed job seekers responded to the job alert cancellation survey than would be expected at random.

This is an example of “missing not at random” because the survey question itself might have influenced how people chose to respond. Job seekers who are currently unemployed might be more inclined to respond to the job alert cancellation survey, because finding a job after a period of joblessness is a huge deal.

Best-case scenario: MAR

The best-case scenario for missing data is if it’s missing at random (MAR). In these cases, missingness can be correlated with at least one variable, and the missingness is not due to the survey question itself.

You might be thinking that I’m intentionally using jargon to confuse you…and I am! Just kidding, MNAR and MAR are commonly used among survey methodologists when discussing missing data. MNAR and MAR live in the same world of jargon as Type 1 and Type 2 error and mediation and moderation.

For example, let’s imagine that we ran a chi-square test on the job alert cancellation survey results to examine the relationship between survey response (no response vs. responded) and the job seeker’s device type (desktop vs. mobile). This gives us a test statistic and p-value of X²(1) = 75.57 and p < .0001, respectively. We might see the following findings:

Observed Expected
mar observed mar expected

The above findings show a statistically significant relationship between job seekers responding to the job alert cancellation survey and those job seekers’ devices. Significantly fewer job seekers on desktop computers responded to the job alert cancellation survey than expected.

However, we might also know from previous experience that more job seekers search for jobs on mobile devices than desktops. In that case, the missingness is likely attributable to device popularity and not to the survey question itself.

Additional scenarios for missing data

Additional scenarios for missing data include cases where the data are missing completely at random, and cases where the data are missing by design.

Missing completely at random (MCAR) refers to cases where the missingness is uncorrelated with all other variables. This type of missingness is typically impossible to validate for large and complex data sets like those found in web analytics. With large data sets, especially rapidly growing ones, the chance of finding some spurious but significant correlation is almost 100%.

Missing by design refers to cases where the missingness is intentional. For example, imagine a product change where job seekers are only presented with the job alert cancellation survey if they applied for a job on Indeed in the past 30 days. In this scenario, job seekers who haven’t applied for jobs in the past 30 days will never see the survey. Data will thus be missing by design based on the number of applies.

The challenge of addressing missing data

A core challenge of missing data is determining if it’s missing due to randomness, and if so, then which type of randomness  — MNAR or MAR. While it’s fairly easy to check for significant differences in the distribution of missing data, a p-value and confidence interval will not tell you why the data is missing.

Determining whether data are MNAR or MAR is a daunting task and relies heavily on assumptions. In the MAR example above, we assumed that the missingness was because users were more inclined to use the mobile version of Indeed than the desktop version. However, we only know this pattern exists because we’ve talked with people who noticed similar patterns in users preferring mobile over desktop. Without that knowledge we could very easily have misinterpreted the pattern.

Thankfully, there are strategies you can use to diagnose whether your data are MAR or MNAR.

To start, you can ask yourself:

Does the question ask people to reveal sensitive or socially undesirable behavior?”

If it does, be aware that asking people to reveal sensitive information is more likely to cause your data to be MNAR rather than MAR. It might be possible to reduce the impact of such survey design by reassuring confidentiality and using other strategies to gain the trust of respondents.

If the question does not ask people to reveal sensitive information but you’re still concerned the missing data might be MNAR (the bad one), you can try other strategies. If you have longitudinal data from the respondents, you can check whether the non-response pattern you observe is consistent with previous responses at other time points. If the pattern replicates, you can at least say that your observations are not unusual.

Of course, just because the non-response pattern replicates doesn’t mean you’re in the clear for declaring your data are MNAR and not MAR. If, for example, you’re asking people to report socially undesirable behavior, you’d likely see the same MNAR pattern over time.

If you don’t have access to longitudinal data, a second solution is to talk with people in your team/organization or look at papers from related research to see if anyone else has observed similar patterns of non-response. Another Research 2.0 solution might be to crowdsource via reaching out to colleagues on Slack and other social media. There you might discover if the non-response pattern you’re observing is typical or atypical.

This relatively simple yes/no logic isn’t perfect, but using the strategies above is still better than a head-in-the-sand “missing data never matters” approach.

Missing data isn’t always the end of the world

Not all missing data is inherently tied to response bias. It can be missing by design, missing completely at random (MCAR), missing not at random (MNAR), or missing at random (MAR). In the job alert cancellation survey, we saw how the survey design might lead to different scenarios of missingness.

Are you a data scientist or data aficionado who is also a critical thinker? If so, remember to take a deep dive into your missing data.

Suggested reading

De Leeuw, E. D. (2001). Reducing missing data in surveys: An overview of methods. Quality and Quantity, 35(2), 147–160.concise article on missing data and response bias

Kish, L. (1997). Survey samplingalthough this book is a bit dense, it’s a go-to resource for learning more about sampling bias

Tourangeau, R., & Yan, T. (2007). Sensitive questions in surveys. Psychological bulletin, 133(5), 859.

About the author

For a bit of background about myself, I’m a University of Michigan Ph.D. graduate (Go Blue!) who recently transitioned to industry as a Quantitative UX Researcher at Indeed.

Feel free to message me if you want to chat about my transition from academia to industry or if you just want to muse about missing data 😉

Interested in joining Indeed? Check out our available opportunities.


[1] It’s worth acknowledging that the topic of non-response bias is an enormous field. Several textbooks and many dissertations are available on this topic. For a deeper understanding of the field, check out my suggested reading section above. This is designed to be an easy resource you can reference when you are dealing with missing data.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

The FOSS Contributor Fund: Six Months In

Early this year we began an internal FOSS Contributor Fund and invited everyone at Indeed to participate. Why? We wanted to support the open source community in a meaningful way. So we decided to democratize the decision-making process and encourage individual contributions.

foss contributor fund logo

FOSS: Free or Open Source Software

Our unique approach aligns with the ideals that have helped sustain the open source community. These ideals also drive our own company culture. The FOSS Contributor Fund has now reached the 6-month mark. We want to talk here about how it works and reflect on what we’ve learned. 

How does the FOSS Contributor Fund work?

Each month we select an open source project that will receive a $10,000 USD donation. Anyone in the company can nominate a project, as long as it meets four criteria. The project:

  • must be in use by the company or one of its subsidiaries
  • must use an OSI-approved license
  • must have some mechanism for receiving funds
  • cannot be employee-owned

Any Indeed employee who makes an open source contribution during the monthly voting cycle can vote on a nominated project. This means that those who are active with open source projects also make the decision. 

Once the votes are cast for the cycle, we count them and declare the winner. Then, we contact the receiving project and make sure that it is prepared to receive the funds. Recipients decide how to use the funds to best suit their project’s needs.

Results to date

Since launching the program, participants have engaged from groups like Engineering, QA, Site Reliability Engineering, and Technical Content Management. We’ve also seen strong employee engagement with open source since announcing the fund: over 3,000 contributions from Indeed employees. Indeed has distributed funds to five open source projects: Django, Git, Homebrew, pandas, and pytest. We’ve also selected ESLint to receive future funds.

 

git logoThe Git project relies completely on donations for its funds. Git has a few core contributors and a very long tail of developers who only work part-time on the project. These funds help us keep them involved, as they pay for travel to conferences, including our yearly Contributor Summit. We also use project funds for the Outreachy program, increasing the diversity of our contributor base.
— Jeff King, Git


homebrew logoFunds will be mainly used to pay for future Homebrew maintainers to meet in person and may be used to pay for contractors to do some repeatedly postponed tasks around updating our CI setup.
— Mike McQuaid, Homebrew

 

Takeaways

After launching the fund, we learned that all Indeedians wanted to participate. Here are some takeaways for encouraging that wider participation.

Open up the nominating process to all employees, not just those who’ve registered their GitHub IDs or self-identified their work in open source. If an employee believes a project is important enough to your company that you should support it—and they nominate it—that is enough. 

Be aware of the parameters that you set for project nominations. You want to ensure a smooth donation process on the backend. You don’t want to create roadblocks and discourage participation. Essential: the nominated project must have a mechanism for receiving funds. A small number of our nominations required a support contract or a subscription. Because the fund operates within a fixed budget, recurring payments were not possible. Yet, these nominations did provide visibility into projects that we would not have known about otherwise. 

Continue asking which projects would benefit most from a one-time injection of funds. Our initiative has highlighted dependencies about which people have strong feelings. Yet, the initiative does not necessarily show which projects are in most need of financial support. To date, the projects we support are reasonably well known. They generally have a community of contributors and supporters. These projects need support, but there are many other lesser-known projects that need support too. We continue to dig deeper as we work to identify and highlight projects that will benefit the most from these donations. 

Use nominated projects as a starting point for getting more involved in open source. Figuring out where to start is often the biggest barrier to making that first open source contribution. Picking from a list of known dependencies helps us by narrowing the list to projects nominated during the initiative. It allows us to send new contributors to projects that someone believes Indeed should be supporting.

 

pandas logoThe pandas community is grateful for the funds donated by Indeed’s FOSS Contributor Fund. We will put these to good use. We intend to modernize our documentation, improve our performance metric and benchmarking tools, and help fund annual core developer sprints, where we can really work on pandas in depth. This donation helps unlock more resources to continually improve pandas and make it the best data science library toolkit in any language.
—  Jeff Reback, pandas


pytest logoPytest will use the donation to fund overall maintenance and future gatherings of pytest developers, possibly including another development sprint sometime in the next year.
— Bruno Oliveira


ESLint logoDonations from companies like Indeed allow ESLint, a project that is currently run entirely by volunteers, to pursue improvements that wouldn’t otherwise be possible. In the short term, funds are being allocated for translating our documentation into different languages; in the long term, funds will be used to support individual developers who contribute regularly to ESLint to ensure that development and maintenance can continue.
— Nicholas C. Zakas, ESLint TSC Member

 

We’re committed

Indeed’s Open Source Program Office is committed to helping sustain the projects we depend on. We’re committed financially and by encouraging our internal community to contribute to the projects they use. The FOSS Contributor Fund is a great way to marry the two. We gave our open source contributors a voice in the process and are enjoying these benefits: broad contribution activity, increased visibility into a wider range of projects, and a great list of projects we can use to onboard new contributors. 

Learn more about Indeed’s open source program.  


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone