The Evolving Language of Data Science 

…or Grokking the Bokeh of Scarse Meaning Increasement

“You keep using that word. I do not think it means what you think it means.” — Dr. Inigo Montoya


I’m a technical writer at Indeed. One of the many great things about my job is that I get to work with smart people every day. A fair amount of that work involves translating between them. They will all be speaking English, but still might not understand each other. This is a natural consequence of how knowledge advances in general, and how English develops in particular. 

As disciplines evolve, alternate meanings and new words develop to match. That can extend to creating new phrases to name the disciplines themselves (for example, what is a data scientist?). English’s adoption of such new words and meanings has always been pragmatic. Other Western languages have more formal approval processes, such as French’s Académie française and German’s reliance on a single prestigious dictionary. The closest to formal authorities for correct English are popular dictionaries such as the Oxford English Dictionary, the American Heritage Dictionary, and Merriam-Webster. None of them reign supreme.

This informal adoption of new words and meanings can lead to entire conversations in which people don’t realize they’re discussing different things. For example, consider another recently adopted word: “bokeh.” This started as a term in the dialect of professional photography, for the aesthetically pleasing blurred look that strong depth of field can give a picture. “Bokeh” is also the name for a specific python data visualization package. So “bokeh” may already be headed for a new meaning within the realm of data science.

As a further example of the fluid nature of English, “bokeh” comes from the Japanese word boke (暈け or ボケ). In its original form it meant “intentional blurring,” as well as sometimes “mental haze,” i.e., confusion.

 

Two rows of flowers that become blurry in the distance; a representation of bokeh in photography.

Bokeh of flowers by Sergei Akulich on Unsplash

 

A montage of various visualizations related to the data science product Bokeh.

Data science bokeh — https://bokeh.pydata.org/

The clouded meaning of “data”

A data scientist told me that when she hears “the data” she tends to think of a large amount of information, a set large enough to be comprehensive. She was surprised to see another team’s presentation of  “the data” turn out to be a small table inside a spreadsheet that listed a few numbers. 

This term can also cause confusion between technical fields. Data scientists often interpret “data” as quantitative, while UX researchers interpret “data” as qualitative.

Exploring evolving language with Ngram Viewer

A product science colleague introduced me to the Google Books Ngram Viewer. It’s a search engine that shows how often a word or phrase occurs in the mass of print books Google has scanned. Google’s collection contains most books published in English from AD 1500 to 2008.

I entered some new words that I had come across, and screened out occurrences that weren’t relevant, such as place or person names and abbreviations. I also set the search to start from 1800. Medieval data science could be interesting, but I expect it to be “scarse.” (That’s not a typo.)

Features

When I first came across this newer meaning of “features,” I wasn’t even aware that it had changed. From previous work with software development and UX, I took “features” to mean “aspects of a product that a user will hopefully find useful.” But in data science, a “feature” relates to covariates in a model. In less technical English, a measurable property or characteristic of a phenomenon being observed. 

This dual meaning led me to a fair amount of head-scratching when I was documenting an internal data science application. The application had software features for defining and manipulating data features. 

The following Ngram graph indicates this emerging meaning for “feature” by tracking the emergence of a related phrase, “model feature.” 

 

A line chart showing the rise in the use of the term "model feature" from 1800 to 2000.

The usage of the term “model feature” peaks sometime in the 1990s.

 

Diving into Ngram’s specific citations, the earliest mention I can find that’s near this meaning is in 1954. Interestingly, it’s from a book on management science:

Screenshot from Google Books summary of "Management Science"

The next use that seems exact turns up in 1969, in the Digest Record from Association for Computing Machinery, Society for Industrial and Applied Mathematics, Institute of Electrical and Electronics Engineers. Leaving aside the intervening comma, the example is so dead-on that I wonder if we’re looking at near the exact moment this new meaning was fully born:

Screenshot of Google Books summary for "The Digest Record"

To grok

“Grok” is an example of English going so far as to steal words from languages that don’t even exist. Robert A. Heinlein coined the word in his 1961 science fiction classic Stranger in a Strange Land. In the novel, the Martian phrase “grok” literally means “drink” and metaphorically means “understanding something so completely that you and it are one.” 

 

A line chart showing the trend for the use of the term "grok" from 1800 to 2000.

Usage of the term “grok” increases starting in the 1960s.

 

Like many other aspects of science fiction and fantasy, computer programming culture absorbed the term. The Jargon File from 1983 shares an early defined example:

GROK (grahk) verb.
  To understand, usually in a global sense especially, to understand
all the implications and consequences of making a change. Example:
“JONL is the only one who groks the MACLISP compiler.”

Since then, computer jargon has absorbed “grok” and applied it in many different ways. One immediate example is the source code and reference engine OpenGrok. It’s intended to let users “grok (profoundly understand) source code and is developed in the open.”

Salt

Salt is an example of a common word that has gone through two steps of technical change. First it gained a meaning relating to information security, and then an additional one in data science. 

As a verb and noun, “salt” originally meant what it sounds like – adding the substance chemically known as NaCl to food for flavoring and preservation. It gained what is perhaps its better-known technical meaning in information security. Adding “salt” to password hashing makes encrypted passwords more difficult to crack. In the word’s further and more recent permutations in data science, “salt” and “resalt” mean to partly randomize the results of an experiment by shuffling them. The following Ngram graph tracks the association of “salt” and “resalt” over time. 

This was hard to parse out, and required diving deeply into Ngram’s options. I ended up graphing the different times “salt” modifies the words “food,” “password,” or “data.” Google stopped scanning in new books in 2008 – you can see the barest beginning of this new usage in 2007.

 

A line chart representing the use of salt with regard to food, passwords, and data.

From 2000 to 2008, salt in the context of food is most used, followed by salt in the information security sense.

 

Pickling

Traditionally “pickling” refers to another way to treat food, this one almost entirely for preservation. In Python, this refers to the object serialization method made possible by the Pickle module. Data scientists have found increasing use for this term, in ways too recent to find on Ngram.

The bleeding edge of language?

Here are some words that may just be in the sprouting stage of wider usage.

Scarse

This came from an accidental jumble of words in a meeting, and has remained in use since. It describes situations where data is both scarce (there’s not a lot of it) and sparse (even when there is some, it’s pretty thin). 

This meaning for “scarse” doesn’t appear in the Ngram graph. So it appears we’re seeing mutation and evolution in word form in the wild. Will it take root and prosper, continuing to evolve? Only time will tell.

Increasement

“We should look for the source of that error message increasement.”

I’ve observed this word once in the wild–from me. “Increasement” came to me in a meeting, as a word for the amount of an increase over time. I had never used the word before. It just seemed like a word that could exist. It had meaning similar to other words, and fit those other words’ rules of word construction.

In the context I used, its meaning isn’t exactly the same as “increment.” Increment refers to a specific numeric increase. One wouldn’t refer, for example, to an increasing amount of users as an increment. You might, however, refer to it as an increasement.

Searching for increasement in Ngram revealed that this word previously existed but fell out of common usage, as shown on the following graph.

 

A line chart representing the usage of the term "increasement" from 1800 to 2000.

The word “increasement” tapers in usage, with its peak in the 1810s, followed by a gradual decline.

 

Previous examples:

Book: The Fathers of the English Church

Paul was, that he should return again to these Philippians, and abide, and continue amongst them, and that to their profit; both to the increasement of their faith


Book: The Harleian miscellany; or, A collection of … pamphlets and tracts … in the late earl of Oxford’s library

….when she saw the man grown settled and staid, gave him an assistance, and advanced him to the treasurership, where he made amends to his house, for his mis-spent time, both in the increasement of his estate and honour…

Perhaps it’s time for “increasement” to be rebooted into common use?

Bottom line

Language is likely to continue evolving as long as we use language. Words in general, and English words in particular, and words in English technical dialects above all, are in a constant state of flux. Just like the many fields of knowledge they discuss.

So if you’re in a technical discussion and others’ responses aren’t quite what you expect, consider re-examining the technical phrases you’re using. 

The people you’re talking with might grok those words quite differently.

 

The Evolving Language of Data Science—cross-posted on Medium.

Recognize Class Imbalance with Baselines and Better Metrics

In my first machine learning course as an undergrad, I built a recommender system. Using a dataset from a social music website, I created a model to predict whether a given user would like a given artist. I was thrilled when initial experiments showed that for 99% of the points in my dataset, I gave the correct rating – I was wrong only 1% of the time!

When I proudly shared the results with my professor, he revealed that I wasn’t, in fact, a machine learning prodigy. I’d made a mistake called the base rate fallacy. The dataset I used exhibited a high degree of class imbalance. In other words, for 99% of the pairs between user and artist, the user did not like the artist. This makes sense: there are many, many musicians in the world, and it’s unlikely that one person has even heard of half of them (let alone actually enjoys them).

A flock of birds on telephone wires. While most of the birds are grouped together on the right, one is perched alone on the left.

When we’re unprepared for it, class imbalance introduces problems by producing misleading metrics. The undergrad version of me ran face-first into this problem: accuracy alone tells us almost nothing. A trivial model that predicts that no users like any artists can achieve 99% accuracy, but it’s completely worthless. Using accuracy as a metric assumes that all errors are equally costly; this is frequently not the case.

Consider a medical example. If we incorrectly classify a tumor as malignant and request further screening, the cost of that error is worry for the patient and time for the hospital workers. By contrast, if we incorrectly state that a tumor is benign when it is in fact malignant, the patient may die.

Examine the distribution of classes

Moving beyond accuracy, there are a number of metrics to think about in an imbalanced problem. Knowing the distribution of classes is the first line of defense. As a rule of thumb, Prati, Batista, and Silva find that class imbalance doesn’t significantly harm performance in cases where the minority class makes up 10% or more of the dataset. If you find that your dataset is imbalanced more than this, pay special attention.

I recommend starting with an incredibly simple model: pick the most frequent class. scikit-learn implements this in the DummyClassifier. Had I done this with my music recommendation project, I would quickly have noticed that my fancy model wasn’t really learning anything.

Evaluate the cost

In an ideal world, we could calculate the exact costs of a false negative and a false positive. When evaluating our models, we could multiply those costs by the false negative and false positive rates to come up with a number that describes the cost of our model. Unfortunately, these costs are often unknown in the real world, and improving the false positive rate usually harms the true positive rate.

To visualize this tradeoff, we can use an ROC curve. Most classifiers can output probability of membership in a certain class. If we choose a threshold (50%, for example), we can declare that all points with probability over the threshold are members of the positive class. Varying the threshold from a low percentage to a high percentage produces different ways of classifying points that have different true positive and false positive rates. Plotting the false positive rate on the x-axis and the true positive rate on the y-axis, we get an ROC curve.

As an example, I trained a classifier on the yeast3 dataset from KEEL and created an ROC curve:

An ROC for LogisticRegression on yeast3. The curve is lining both the leftmost and upper borders of the graph.

After training the model, the curve shows a likely chance for true positives.

While we could certainly write the code to draw an ROC curve, the yellowbrick library has this capability built in (and it’s compatible with scikit-learn models). These curves can suggest where to set the threshold for our model. Further, we can use the area under them to compare multiple models (though there are times when this isn’t a good metric).

The next time you’re working on a machine learning problem, consider the distribution of the target variable. A huge first step towards solving class imbalance is recognizing the problem. By using better metrics and visualizations, we can start to talk about imbalanced problems much more clearly.

More on class imbalance

In my upcoming talk at ODSC West, I’ll dive deeper into the causes of class imbalance. I’ll also explore different ways to address this error. I hope to see you in October!


Class Imbalance—cross-posted on Medium.

You Probably Have Missing Data

Here’s a Guide on When to Care

Scrabble tiles seemingly spelling out the word "MISSING," though the two "I" tiles are missing.

Strategies to address missing data¹

At Indeed, our mission is to help people get jobs. Searching for a job can be stressful, which is one reason why Indeed is always looking for ways to make the process easier and our products better. Surveys provide us with an ongoing measure of people’s feelings about Indeed’s products and the job search experience.

We realize that when someone is looking for a job (or has just landed one), answering a survey is the last thing they want to do. This means that a lot of the survey data that Indeed collects ends up with missing data. To properly analyze user satisfaction and similar surveys, we need to account for potential missing patterns to ensure we draw correct conclusions.

I’d like to discuss identifying and handling missing data. I’m inspired by my training in the University of Michigan’s Program in Survey Methods. I’ve also wanted to apply the theories about data sets that I learned in academia to Indeed’s terabyte-sized data.

I recently worked on a project that dealt with missing data. I learned a lot from the analysis. Walking through this process can show how Indeed collects survey data, illustrate the difference between non-response rate and non-response bias, and provide examples of why “randomness” in non-response bias is a good thing.

One quick note: While the examples in this blog post reference Indeed, all data in this blog post are entirely bogus and made up by the author (aka me).

Measuring hires at Indeed

If you have ever canceled a job alert from Indeed, you might have seen this survey:

Survey asking users why they cancelled their job alerts. The options available are: "I found a job on Indeed," "I found a job elsewhere," or "Other."

The purpose of this survey is to determine whether a job seeker is canceling their job alert because they found a job. This information helps us improve our products and enables us to celebrate the success stories of job seekers.

One challenge with this survey is that only a subset of job seekers completes it. From a user perspective this makes sense — people who unsubscribe from an email notification probably don’t want to spend time answering a survey. This means that we end up with a substantial amount of missing data, especially regarding a key question: did they unsubscribe because they got a job?

Non-response rate vs non-response bias

When discussing missing data, people often conflate response rate with non-response bias. When this misunderstanding of response rate is further conflated with the question of data quality, people might assume that a higher response rate means higher quality survey responses. This is not necessarily the case.

For the following job alert cancellation survey results, you’ll note that 13.8% did not respond.

Survey results with a "no response" rate of 13.84%.

Detailed description of chart.

This chart shows survey response results, including the total count of responses and percentage of whole. According to this chart:

  • 49,109 survey participants (33.99%) selected “I found a job on Indeed”
  • 43,256 survey participants (29.94%) selected “I found a job elsewhere.”
  • 32,105 survey participants (22.22%) selected “Other.”
  • 20,000 users (13.84%) did not respond to the survey

 

Does a non-response rate of 13.8% say something about the quality of responses in the survey?

The short answer is no. While this might initially sound counterintuitive, stay with me! Imagine that Indeed revised the job alert cancellation survey to include a “prefer not to say” option.

 

A comparison of the original and revised surveys. The only difference is the inclusion of the “Prefer not to say” option.

 

After collecting data for a few weeks, we would then see that only 5.8% of job seekers didn’t respond to the revised survey.

A comparison of the original survey results and revised survey results. Originally, the “no response” rate was 13.84%.

Detailed description of charts.

In the original survey response results:

  • 49,109 survey participants (33.99%) selected “I found a job on Indeed”
  • 43,256 survey participants (29.94%) selected “I found a job elsewhere”
  • 32,105 survey participants (22.22%) selected “Other”
  • 20,000 users (13.84%) did not respond to the survey

With the addition of the new “Prefer not to say” survey option, the results change to the following:

  • 49,109 survey participants (14.31%) selected “I found a job on Indeed”
  • 43,256 survey participants (12.60%) selected “I found a job elsewhere”
  • 32,105 survey participants (9.35%) selected “Other”
  • 20,000 users (5.83%) did not respond to the survey
  • 198,772 survey participants (57.91%) selected “Prefer not to say”

 

Does this mean an increase in useful data? Before you start celebrating an 8% decrease in non-response, take a closer look at the response distribution. You’ll notice that a whopping 57% of job seekers selected “prefer not to say”!

Typically, we treat the response option of “prefer not to say” as missing data. We don’t know if job seekers selected “prefer not to say” because they are in the process of finalizing an offer, or for some other reason, such as concern that their current employer might find out they have a competing offer. If so, there is a potential for response bias.

Response bias refers to bias towards selecting a certain response (e.g., “prefer not to say’’) due to social desirability or another pressure. Non-response bias, also known as participation bias, refers to refusing to respond to the survey because of social desirability or another pressure.

The example above shows response bias, because respondents may have selected “prefer not to say” due to the sensitive nature of the question. If the respondents hadn’t completed the survey at all due to the nature of the survey, we would have non-response bias.

This illustrates that non-response rate alone (i.e., the percentage of people who responded to your survey) is not the sole indicator of data quality.

For example, if your most recent survey has a 61% response rate while past surveys had a response rate of 80–90%, there’s probably enough rationale to look into potential problems associated with non-response rate. However, if your recent survey has a 4% response rate and past surveys had a response rate of 3–5%, it’s unlikely that there’s a non-response issue with your specific survey. Instead, perhaps your team’s strategy in how surveys are sent (e.g., collecting survey data by landlines versus mobile phones) or how participants are identified for your study (e.g., using outdated and/or incorrect contact information) is leading to low response rates overall.

Whether you have a 3% or 61% response rate, response rate is not synonymous with low-response bias. As we saw with the revised survey, even when the response rate was high, over half of the respondents still selected “prefer not to say” — a response that isn’t usable for data analysis.

In addition to paying attention to the number of people who responded to your survey, you also need to check the distribution of the responses to each question in your survey. Simple and easy frequency statistics are a great way to notice oddities and potential biases in your data.

Missing at random vs missing not at random

Non-response bias can be summarized as whether the missing data are random or non-random. Each possibility has different implications for analysis.

Worst-case scenario: MNAR

The worst-case scenario for missing data is if it’s missing not at random (MNAR). In these cases, the missingness can be correlated with at least one variable and the missingness is likely due to the survey question being sensitive. This indicates potential problems with the survey design.

For example, let’s say we ran a chi-square test on the job alert cancellation survey to examine the relationship between survey response (no response vs. responded) and current employment status (employed vs. unemployed). We might see the following findings:

A comparison of observed and expected survey results for employment status.

Detailed description of charts.
  • 9,623 users who did not respond were unemployed
  • 2,572 users who did not respond were employed
  • 46,731 users who did respond were unemployed
  • 13,474 users who did respond were employed

In the expected survey response results

  • 9,856.69 users who did not respond were unemployed
  • 3,118.31 users who did not respond were employed
  • 46,009.31 users who did respond were unemployed
  • 14,55.69 users who did respond were employed

 

The above findings show a statistically significant relationship between responding to the job alert cancellation survey and the job seeker’s current employment status. This gives us a test statistic and p-value of 𝚾²(1) = 9.70 and p = 0.0018, respectively. Thus, significantly more unemployed job seekers responded to the job alert cancellation survey than would be expected at random.

This is an example of “missing not at random” because the survey question itself might have influenced how people chose to respond. Job seekers who are currently unemployed might be more inclined to respond to the job alert cancellation survey, because finding a job after a period of joblessness is a huge deal.

Best-case scenario: MAR

The best-case scenario for missing data is if it’s missing at random (MAR). In these cases, missingness can be correlated with at least one variable, and the missingness is not due to the survey question itself.

You might be thinking that I’m intentionally using jargon to confuse you…and I am! Just kidding, MNAR and MAR are commonly used among survey methodologists when discussing missing data. MNAR and MAR live in the same world of jargon as Type 1 and Type 2 error and mediation and moderation.

For example, let’s imagine that we ran a chi-square test on the job alert cancellation survey results to examine the relationship between survey response (no response vs. responded) and the job seeker’s device type (desktop vs. mobile). This gives us a test statistic and p-value of X²(1) = 75.57 and p < .0001, respectively. We might see the following findings:

A comparison of observed & expected results for device usage. The total number of desktop participants is 22353 vs. 37218.57 respectively.

Detailed description of charts.

In the observed survey response results:

  • 6,980 users who did not respond were mobile users
  • 5,995 users who did not respond were desktop users
  • 38,212 users who responded were mobile users
  • 22,353 users who responded were desktop users

In the expected survey response results:

  • 5,001.57 users who did not respond were mobile users
  • 7973.43 users who did not respond were desktop users
  • 23,346.43 users who responded were mobile users
  • 37,218.57 users who responded were desktop users

 

 

The above findings show a statistically significant relationship between job seekers responding to the job alert cancellation survey and those job seekers’ devices. Significantly fewer job seekers on desktop computers responded to the job alert cancellation survey than expected.

However, we might also know from previous experience that more job seekers search for jobs on mobile devices than desktops. In that case, the missingness is likely attributable to device popularity and not to the survey question itself.

Additional scenarios for missing data

Additional scenarios for missing data include cases where the data are missing completely at random, and cases where the data are missing by design.

Missing completely at random (MCAR) refers to cases where the missingness is uncorrelated with all other variables. This type of missingness is typically impossible to validate for large and complex data sets like those found in web analytics. With large data sets, especially rapidly growing ones, the chance of finding some spurious but significant correlation is almost 100%.

Missing by design refers to cases where the missingness is intentional. For example, imagine a product change where job seekers are only presented with the job alert cancellation survey if they applied for a job on Indeed in the past 30 days. In this scenario, job seekers who haven’t applied for jobs in the past 30 days will never see the survey. Data will thus be missing by design based on the number of applies.

The challenge of addressing missing data

A core challenge of missing data is determining if it’s missing due to randomness, and if so, then which type of randomness  — MNAR or MAR. While it’s fairly easy to check for significant differences in the distribution of missing data, a p-value and confidence interval will not tell you why the data is missing.

Determining whether data are MNAR or MAR is a daunting task and relies heavily on assumptions. In the MAR example above, we assumed that the missingness was because users were more inclined to use the mobile version of Indeed than the desktop version. However, we only know this pattern exists because we’ve talked with people who noticed similar patterns in users preferring mobile over desktop. Without that knowledge we could very easily have misinterpreted the pattern.

Thankfully, there are strategies you can use to diagnose whether your data are MAR or MNAR.

To start, you can ask yourself:

Does the question ask people to reveal sensitive or socially undesirable behavior?”

If it does, be aware that asking people to reveal sensitive information is more likely to cause your data to be MNAR rather than MAR. It might be possible to reduce the impact of such survey design by reassuring confidentiality and using other strategies to gain the trust of respondents.

If the question does not ask people to reveal sensitive information but you’re still concerned the missing data might be MNAR (the bad one), you can try other strategies. If you have longitudinal data from the respondents, you can check whether the non-response pattern you observe is consistent with previous responses at other time points. If the pattern replicates, you can at least say that your observations are not unusual.

Of course, just because the non-response pattern replicates doesn’t mean you’re in the clear for declaring your data are MNAR and not MAR. If, for example, you’re asking people to report socially undesirable behavior, you’d likely see the same MNAR pattern over time.

If you don’t have access to longitudinal data, a second solution is to talk with people in your team/organization or look at papers from related research to see if anyone else has observed similar patterns of non-response. Another Research 2.0 solution might be to crowdsource via reaching out to colleagues on Slack and other social media. There you might discover if the non-response pattern you’re observing is typical or atypical.

This relatively simple yes/no logic isn’t perfect, but using the strategies above is still better than a head-in-the-sand “missing data never matters” approach.

Missing data isn’t always the end of the world

Not all missing data is inherently tied to response bias. It can be missing by design, missing completely at random (MCAR), missing not at random (MNAR), or missing at random (MAR). In the job alert cancellation survey, we saw how the survey design might lead to different scenarios of missingness.

Are you a data scientist or data aficionado who is also a critical thinker? If so, remember to take a deep dive into your missing data.

Suggested reading

De Leeuw, E. D. (2001). Reducing missing data in surveys: An overview of methods. Quality and Quantity, 35(2), 147–160.concise article on missing data and response bias

Kish, L. (1997). Survey samplingalthough this book is a bit dense, it’s a go-to resource for learning more about sampling bias

Tourangeau, R., & Yan, T. (2007). Sensitive questions in surveys. Psychological bulletin, 133(5), 859.

About the author

For a bit of background about myself, I’m a University of Michigan Ph.D. graduate (Go Blue!) who recently transitioned to industry as a Quantitative UX Researcher at Indeed.

Feel free to message me if you want to chat about my transition from academia to industry or if you just want to muse about missing data 😉

Interested in joining Indeed? Check out our available opportunities.


[1] It’s worth acknowledging that the topic of non-response bias is an enormous field. Several textbooks and many dissertations are available on this topic. For a deeper understanding of the field, check out my suggested reading section above. This is designed to be an easy resource you can reference when you are dealing with missing data.


Missing Data—cross-posted on Medium.