The Evolving Language of Data Science 

…or Grokking the Bokeh of Scarse Meaning Increasement

“You keep using that word. I do not think it means what you think it means.” — Dr. Inigo Montoya


I’m a technical writer at Indeed. One of the many great things about my job is that I get to work with smart people every day. A fair amount of that work involves translating between them. They will all be speaking English, but still might not understand each other. This is a natural consequence of how knowledge advances in general, and how English develops in particular. 

As disciplines evolve, alternate meanings and new words develop to match. That can extend to creating new phrases to name the disciplines themselves (for example, what is a data scientist?). English’s adoption of such new words and meanings has always been pragmatic. Other Western languages have more formal approval processes, such as French’s Académie française and German’s reliance on a single prestigious dictionary. The closest to formal authorities for correct English are popular dictionaries such as the Oxford English Dictionary, the American Heritage Dictionary, and Merriam-Webster. None of them reign supreme.

This informal adoption of new words and meanings can lead to entire conversations in which people don’t realize they’re discussing different things. For example, consider another recently adopted word: “bokeh.” This started as a term in the dialect of professional photography, for the aesthetically pleasing blurred look that strong depth of field can give a picture. “Bokeh” is also the name for a specific python data visualization package. So “bokeh” may already be headed for a new meaning within the realm of data science.

As a further example of the fluid nature of English, “bokeh” comes from the Japanese word boke (暈け or ボケ). In its original form it meant “intentional blurring,” as well as sometimes “mental haze,” i.e., confusion.

 

Two rows of flowers that become blurry in the distance; a representation of bokeh in photography.

Bokeh of flowers by Sergei Akulich on Unsplash

 

A montage of various visualizations related to the data science product Bokeh.

Data science bokeh — https://bokeh.pydata.org/

The clouded meaning of “data”

A data scientist told me that when she hears “the data” she tends to think of a large amount of information, a set large enough to be comprehensive. She was surprised to see another team’s presentation of  “the data” turn out to be a small table inside a spreadsheet that listed a few numbers. 

This term can also cause confusion between technical fields. Data scientists often interpret “data” as quantitative, while UX researchers interpret “data” as qualitative.

Exploring evolving language with Ngram Viewer

A product science colleague introduced me to the Google Books Ngram Viewer. It’s a search engine that shows how often a word or phrase occurs in the mass of print books Google has scanned. Google’s collection contains most books published in English from AD 1500 to 2008.

I entered some new words that I had come across, and screened out occurrences that weren’t relevant, such as place or person names and abbreviations. I also set the search to start from 1800. Medieval data science could be interesting, but I expect it to be “scarse.” (That’s not a typo.)

Features

When I first came across this newer meaning of “features,” I wasn’t even aware that it had changed. From previous work with software development and UX, I took “features” to mean “aspects of a product that a user will hopefully find useful.” But in data science, a “feature” relates to covariates in a model. In less technical English, a measurable property or characteristic of a phenomenon being observed. 

This dual meaning led me to a fair amount of head-scratching when I was documenting an internal data science application. The application had software features for defining and manipulating data features. 

The following Ngram graph indicates this emerging meaning for “feature” by tracking the emergence of a related phrase, “model feature.” 

 

A line chart showing the rise in the use of the term "model feature" from 1800 to 2000.

The usage of the term “model feature” peaks sometime in the 1990s.

 

Diving into Ngram’s specific citations, the earliest mention I can find that’s near this meaning is in 1954. Interestingly, it’s from a book on management science:

Screenshot from Google Books summary of "Management Science"

The next use that seems exact turns up in 1969, in the Digest Record from Association for Computing Machinery, Society for Industrial and Applied Mathematics, Institute of Electrical and Electronics Engineers. Leaving aside the intervening comma, the example is so dead-on that I wonder if we’re looking at near the exact moment this new meaning was fully born:

Screenshot of Google Books summary for "The Digest Record"

To grok

“Grok” is an example of English going so far as to steal words from languages that don’t even exist. Robert A. Heinlein coined the word in his 1961 science fiction classic Stranger in a Strange Land. In the novel, the Martian phrase “grok” literally means “drink” and metaphorically means “understanding something so completely that you and it are one.” 

 

A line chart showing the trend for the use of the term "grok" from 1800 to 2000.

Usage of the term “grok” increases starting in the 1960s.

 

Like many other aspects of science fiction and fantasy, computer programming culture absorbed the term. The Jargon File from 1983 shares an early defined example:

GROK (grahk) verb.
  To understand, usually in a global sense especially, to understand
all the implications and consequences of making a change. Example:
“JONL is the only one who groks the MACLISP compiler.”

Since then, computer jargon has absorbed “grok” and applied it in many different ways. One immediate example is the source code and reference engine OpenGrok. It’s intended to let users “grok (profoundly understand) source code and is developed in the open.”

Salt

Salt is an example of a common word that has gone through two steps of technical change. First it gained a meaning relating to information security, and then an additional one in data science. 

As a verb and noun, “salt” originally meant what it sounds like – adding the substance chemically known as NaCl to food for flavoring and preservation. It gained what is perhaps its better-known technical meaning in information security. Adding “salt” to password hashing makes encrypted passwords more difficult to crack. In the word’s further and more recent permutations in data science, “salt” and “resalt” mean to partly randomize the results of an experiment by shuffling them. The following Ngram graph tracks the association of “salt” and “resalt” over time. 

This was hard to parse out, and required diving deeply into Ngram’s options. I ended up graphing the different times “salt” modifies the words “food,” “password,” or “data.” Google stopped scanning in new books in 2008 – you can see the barest beginning of this new usage in 2007.

 

A line chart representing the use of salt with regard to food, passwords, and data.

From 2000 to 2008, salt in the context of food is most used, followed by salt in the information security sense.

 

Pickling

Traditionally “pickling” refers to another way to treat food, this one almost entirely for preservation. In Python, this refers to the object serialization method made possible by the Pickle module. Data scientists have found increasing use for this term, in ways too recent to find on Ngram.

The bleeding edge of language?

Here are some words that may just be in the sprouting stage of wider usage.

Scarse

This came from an accidental jumble of words in a meeting, and has remained in use since. It describes situations where data is both scarce (there’s not a lot of it) and sparse (even when there is some, it’s pretty thin). 

This meaning for “scarse” doesn’t appear in the Ngram graph. So it appears we’re seeing mutation and evolution in word form in the wild. Will it take root and prosper, continuing to evolve? Only time will tell.

Increasement

“We should look for the source of that error message increasement.”

I’ve observed this word once in the wild–from me. “Increasement” came to me in a meeting, as a word for the amount of an increase over time. I had never used the word before. It just seemed like a word that could exist. It had meaning similar to other words, and fit those other words’ rules of word construction.

In the context I used, its meaning isn’t exactly the same as “increment.” Increment refers to a specific numeric increase. One wouldn’t refer, for example, to an increasing amount of users as an increment. You might, however, refer to it as an increasement.

Searching for increasement in Ngram revealed that this word previously existed but fell out of common usage, as shown on the following graph.

 

A line chart representing the usage of the term "increasement" from 1800 to 2000.

The word “increasement” tapers in usage, with its peak in the 1810s, followed by a gradual decline.

 

Previous examples:

Book: The Fathers of the English Church

Paul was, that he should return again to these Philippians, and abide, and continue amongst them, and that to their profit; both to the increasement of their faith


Book: The Harleian miscellany; or, A collection of … pamphlets and tracts … in the late earl of Oxford’s library

….when she saw the man grown settled and staid, gave him an assistance, and advanced him to the treasurership, where he made amends to his house, for his mis-spent time, both in the increasement of his estate and honour…

Perhaps it’s time for “increasement” to be rebooted into common use?

Bottom line

Language is likely to continue evolving as long as we use language. Words in general, and English words in particular, and words in English technical dialects above all, are in a constant state of flux. Just like the many fields of knowledge they discuss.

So if you’re in a technical discussion and others’ responses aren’t quite what you expect, consider re-examining the technical phrases you’re using. 

The people you’re talking with might grok those words quite differently.

 

The Evolving Language of Data Science—cross-posted on Medium.