There’s No Such Thing as a Data Scientist

The Inconsistent Definitions of Data Science and More Descriptive Titles

Images from Left to Bottom: 1) Link, By smoothgroover22, License, cropped. 2) Link, By NazWeb, License. 3) Link, By BalticServers.com, License. 4) Link, By Wallpoper. 5) Link, By The Opte Project, License, cropped


What do you really do?

There’s a memorable scene in the movie Office Space where consultants determining employee productivity start by asking, “What would you say… you do here?”

That scene and the “What I Do” images are funny because we empathize with the struggle to describe our jobs. It’s not funny, however, when the same misunderstanding occurs during the job search. It’s important to understand what a job posting means. It’s important for prospective employers to understand our skills and abilities. We’ve all viewed job postings with the same title, but with totally different descriptions.

How can the same title mean such vastly different things from one company to another?

This phenomenon is becoming increasingly common in the field of data science. The discipline has dramatically risen in popularity over the past few years. And while the number of data science jobs has increased, clarity around the role has declined. This post takes advantage of Indeed’s tremendous amounts of behavioral data to describe trends in the field and more specific definitions for data science roles.

The growing popularity of data science

Jobs matching “data scientist” have risen from 0.03% of jobs to about 0.15% (+400%) in a 4-year span.

Even earlier in 2012, a much ballyhooed article called Data Scientist the “Sexiest Job of the 21st Century.” If the title alone isn’t enough, maybe folks are interested for monetary reasons. According to Indeed’s salary data, a data scientist makes an average of $130k per year.

 

OK. Got it. Data science has taken off like discounted Nutella in a European supermarket. With this rise, we’ve also seen the refinement of more specific roles within the discipline. Our colleague Trey Causey wrote about the convergence between product managers and data scientists in the “Rise of the Data Product Manager.”

Many of us at Indeed also felt that the title “data scientist” has recently become more of a catch-all for many different sets of responsibilities. We wanted to dig deeper and test our intuition. Could we find natural delineations of roles within the job market? Could we use data to understand the differences within these titles and better classify them for clarity and consistency?

Spoiler Alert: We can.

Overlapping careers in data science

For this analysis of job titles, we looked at all site visitors who entered the search query “data scientist” on Indeed for the month of January 2018. Next, we looked at other searches these same users performed. We created a matrix for each user by their searches and another for searches by users. We calculated the cartesian product of these matrices to show the frequency between any pair of search terms:

Next, we removed “data scientist” from the data, as this search was present for all users. We used an R package called “igraph” to do the clustering and visualization. According to the igraph documentation, “this function implements the fast greedy modularity optimization algorithm for finding community structure.” While researching this algorithm, we learned that it was designed to quickly create communities from large data sets that have sparse regions. Hmm, that sounds exactly like the data we are using!

Here’s a great obligatory equation we can add for how this works. You’ll have to read that paper to understand what it means.

 

Next, we wrote a function with a pruning parameter to choose the minimum number of vertices in each cluster. This parameter is best set by “guess and check,” as higher numbers don’t necessarily mean more total groups and vice versa. We tried various numbers from 3–20 and checked to see if the groups made sense. We didn’t care about really small clusters and we wanted the queries to fit together. More on this later.

By choosing five as the pruning threshold, four clusters formed. We subsequently labeled these clusters “business intelligence”, “statistician”, “machine learning engineer”, and “natural scientist”.

Here are the queries that make up each group:

See the Pen Job Title Network Graph by Erik Oberg (@obergew) on CodePen.0

Thanks to Erik Oberg for the CodePen viz

And here’s how the clustering turned out:

clusteringresults

Thanks to Zhuying Xu for the Plotly viz

From the preceding chart, we see a few interesting things.

First, there is clear demarcation between statistician and machine learning engineer. Since we don’t see many searches that cross over between these roles, this suggests two distinct career paths.

Second, business intelligence doesn’t seem to have a clean grouping. It is dispersed broadly across the other roles. This contrasts with natural scientist searches, which seem to overlap more with statistician searches. This tells us that job seekers who search for business intelligence might be looking at a wide variety of other jobs within the data science realm. It could also mean that business intelligence positions are being called data science more often now. Further, it seems job seekers who search for machine learning engineer or statistician don’t search for jobs in both categories.

Finally, we see that some natural scientists are perhaps getting into data science through the statistician end of the data science spectrum.

More descriptive roles in data science

From these findings, we would posit that there is no single type of data scientist. Rather, there are many types! There is no single description of a data scientist and thus this title alone doesn’t give us enough information. Data science as a title could translate to a variety of different roles in practice.

Taken together, it’s important to gather more information to understand what it means to be a data scientist at a given company. We believe it would be helpful for employers to think in terms of the roles identified in our clustering. This will help them find the candidates they need and enable job seekers to apply for the jobs they want.

At Indeed, we have a few “data” roles: data engineer, BI developer, BI analyst, product scientist, and data scientist. It looks something like this:

Data Science Job Strengths

Thanks to Ron Chipman for helping put this together

It’s easy to see how confusing this can become. From searching patterns we’ve observed, if someone were to say, “I want to be a data scientist at Indeed,” it could be unclear which team or title would be the best fit. Each team has different interview processes and contributes in different ways, so it’s really important to apply to the right one.

This is the first blog post in a series diving more deeply into data science insights from Indeed. In upcoming posts, we’ll explore the skills associated with data science jobs. We’ll showcase trends and the overlap from each of these more specific job titles. We’ll also describe what skills you should gain if you are interested in a particular career path. We’ll provide employers with tips to interview better for the specific needs of their organization. Finally, we’ll describe “Job Title Supernovae” — jobs that grow quickly and fade away.

Will the title “data scientist” die away like “webmaster” did in the 90s? Subscribe or tune in to future posts for that prediction and more!

At Indeed, We Help People Get Jobs and we hope to help you too. If any of these roles have excited you, please check out www.indeed.jobs and apply today!


Footnotes

A. Clauset, M.E.J. Newman, C. Moore: Finding community structure in very large networks, http://www.arxiv.org/abs/cond-mat/0408187


Cross-posted on Medium.