Where Do Data Scientists Come From?

Our previous article in this series on Data Science Titles made the case that there’s no such thing as a data scientist — instead, the phrase “data scientist” has come to represent a number of distinct roles. So in addition to their different skills and job duties, we’d like to know who data scientists are and what backgrounds they come from.

In this article we dig into the resume data of practicing data scientists, and discover that data scientists come from a wide variety of fields of study, levels of education, and prior jobs. We also explore what this data can tell us about the similarities and differences in the roles of data scientists, analysts, engineers, and software and machine learning engineers.

Who are data scientists?

If you ask every data scientist around you what they did before DS, they’re each likely to give you a different answer. Many come from Masters and PhD programs, in fields ranging from astrophysics to zoology. Others come from the many new data science graduate programs that universities now offer. And still others came from other technology roles, such as software engineering or data analysis.

At Indeed, we help people get jobs. One way we do this is by letting job seekers submit resumes so employers can find a perfect match. There are tens of thousands of resumes in our dataset from current and former data scientists. We can use this resume data to gain some insight into where data scientists come from.

Does educational background matter?

Highest degree achieved

First, we took a look at the highest degree achieved by those who hold the title of “data scientist” or a related field¹.

We’ve chosen the job titles of data engineer, data analyst, software engineer, machine learning engineer, and data scientist², as these reflect some of the distinct roles we found in our previous articles.

Data Scientists

Data scientists have the highest average education level of any of the job titles we examined.

  • Data scientists have more PhDs than any of the other job titles. However, a PhD is not required for becoming a data scientist; only 20% of data scientists have them.
  • Advanced degrees (MA or PhD) are held by 75% of data scientists.
  • Less than 5% of data scientists have only a high school diploma or associates degree.

Machine Learning, Data, and Software Engineers

Software and data engineers have more bachelor’s degrees than advanced degrees, while machine learning engineers are more likely to hold advanced degrees.

  • Machine learning engineers have a similar distribution of education levels to data scientists, but are about 30% less likely to hold a PhD. These results seem roughly in line with a similar study by Stitch Data.
  • Engineering-focused roles tend to favor bachelor’s degrees with some masters, but very few (<5%) PhDs.
  • 1 in 4 data engineers have high school diplomas and associates degrees as their highest level of education.

Data Analysts

Data analysts have a very different distribution of degrees than data scientists, and more closely resemble software engineers in their levels of academic achievement³.

  • Data scientists have PhDs at almost 10 times the rate of data analysts, and are twice as likely to hold a graduate degree.
  • As we’ll see later, this may be due in part to an emerging pattern of software engineers transitioning into data analysis.
  • This could also mean that PhDs are being treated as relevant work experience by employers, who may be seeing data scientists as having more senior roles. Or perhaps the training one receives in a masters or PhD program uniquely prepares individuals for research-oriented data science work.

Field of study

Looking at the distribution of fields of study between job titles reveals some intriguing results.


The “data scientist” job title exhibits the most diversity in field of study of any of the titles we looked at, and no one field seems to dominate. We can quantify the diversity by calculating the gini impurity of each job title.

Gini Impurity (Larger means more diverse fields of study)

  • Data Scientist — 85%
  • Machine Learning Engineer — 73%
  • Software Engineer — 53%
  • Data Analyst — 78%
  • Data Engineer — 79%

Data Scientists

Data scientists clearly have the most diverse fields-of-study in the job titles we’ve looked at, while software engineers have the least diverse educational backgrounds. While the social sciences are somewhat under-represented in the data science population, they still make up about 5% of data scientists. Data science majors make up a slightly larger portion of data scientists (9%), which is somewhat surprising given how new most university data science programs are.

Machine Learning Engineers

Our data also shows a pronounced distinction between data scientists and machine learning engineers. Over 60% of machine learning engineers come from a computer science or engineering background, and are almost twice as likely to be from these backgrounds than someone holding the title of “data scientist.” There were effectively no social scientists with the title of “machine learning engineer” in our sample.

Software Engineers

Software engineers are — unsurprisingly — even more heavily focused on computer science and engineering majors. It’s been proposed that machine learning engineers are a merger between software engineers and data scientists. Our data appears to support this assertion.

Data Analysts

Like data scientists, data analysts seem to come from a diverse educational background. They differ from data scientists in that they are more often business, economics, and social science majors, and less often have mathematics, statistics, and natural science degrees. It’s also interesting to note that those with data science degrees represent more of the data scientist population than the analyst population.

Data Engineers

Data engineers show a field of study distribution that is somewhere between data scientists and machine learning engineers. However, as noted above, many data engineers don’t have any degree beyond a high school diploma!

Which jobs do data scientists hold prior to data science?

Unsurprisingly, many individuals (approximately 25% of our sample) held the same title in their previous role as their current.

This is especially true of software engineers, who are very likely (71%) to have held a software engineering role previously. This is probably due to the relative maturity of the field of software engineering as opposed to data science, which didn’t even have its own title until fairly recently.

“Academic” here means actually being employed by a university, or as a researcher in an academic environment. Graduate students in particular are likely to have held such positions, and we see that the most graduate-degree heavy fields (data science, machine learning engineer, data analyst) have the most transitions from academia.

Perhaps more interesting question is, what was the last different job title that data scientists held?

Here we see some interesting patterns: data scientists, machine learning engineers, and software engineers are more likely to start straight out of academia. Many of the “other” previous jobs are unrelated, such as catering, tutoring, store clerks, and other positions people can often hold while completing their degrees.

Many roles transition into data scientists or machine learning engineers, but rarely do we see data scientists and machine learning engineers transitioning into any of the other roles. This is likely due in part to the relative sizes of the fields, the infancy of the “data scientist” and “machine learning engineer” titles, and the recent growth in popularity of those titles. However, I believe we are also observing an interesting phenomena that speaks to how individuals are moving between and progressing⁶ through each role.

This chord diagram illustrates the main transitions we see between these roles. The color of the chord indicates which role people are transitioning from.

Software engineers make up a big slice of the pie. Many transition to analyst roles, while others hop straight to data science.

Data science is equally fed by academia, analysts, and software engineers. Software engineers are far more likely to hop into a data analyst role, although this is in part due to the larger number of analyst roles than data scientist roles.

Again, we see few individuals leaving data science at this moment. It’s unclear if this pattern will change in the future. The key takeaway here is that the data science field is fed by a wide variety of backgrounds, and it is relatively common to see software engineers become data analysts, and data analysts to become data scientists. This may represent a viable path for anyone looking to transition out of a software engineering role.

Transitions into data engineering come almost exclusively from software engineering⁴.

Conclusion

Where do data scientists come from? Everywhere! Although the field is predominantly populated by individuals with MAs and PhDs, there are still plenty of individuals with bachelor degrees (26%) in the role. No field of study seems to dominate data science at this time; conversely, we see a great diversity in backgrounds for data scientists, especially compared to fields like software engineering. In addition, we see a large number of individuals moving from other tech roles — such as software engineering and data analytics — into data science.

While machine learning engineers reflect data scientists in their levels of academic achievement, they seem to be more heavily focused in engineering backgrounds, and are more likely to have transitioned from a software engineer role. Data engineers also have more of an engineering focus, but tend to have lower levels of degree achievement when compared to the other roles in this study.

What does this mean for data science job seekers?

Graduate school is still the dominant way data scientists get into the field. Data science degrees have a growing presence, and now appear to be a somewhat common way to get entry into the field. Any field of study seems viable if one has obtained an advanced degree. If you’re in a graduate program now, there’s almost certainly someone in your field of study working in data science. I suggest you reach out to them and find out how they made the leap!

Software engineers and data analysts seem to transition into data science roles quite regularly, and represent substantial portions of new data scientists. Future job seekers should consider these routes as well.

What does this mean for employers looking for data scientists?

If you’re looking for a generalist data scientist, don’t throw out a resume just because the field or degree isn’t what you expect. Data scientists are diverse in their education and background. Although most have an advanced degree in some field, there is no one field that dominates the job market.

If you’re having difficulty hiring experienced data scientists or scientists out of academia, consider bringing in individuals from software engineering or data analyst roles, as that is clearly a common pathway to data science.

Also — as we’ll discuss in a later article — make sure you know the role you’re actually hiring for. Do you think need a data scientist, but feel your role is more heavy on engineering? Consider introducing a “machine learning engineer” role. Do you think you need a data scientist, but with more focus on a business background? Consider hiring an analyst. Do you need someone with a focus on database and infrastructure skills? Consider a data engineer, and don’t focus as much on their educational background.

Finally, if you think you do need some sort of generalist data scientists for your team, consider looking for a variety of educational backgrounds. At Indeed, the members of our data science and product science teams span a wide range of fields, including astronomy, sociology, biology, mathematics, economics, and business. Having a diverse data science team — both in demographics and in field of study — is essential for doing great work⁶ ⁷.


Footnotes

¹Note that there is almost certainly a bias here, in that we’re looking at the resumes of job seekers that have already added “data scientist” to their resume. This means we’re going to be looking at individuals who have likely already been in the field for several years, and may not be entirely representative of more recent trends.

²For each job title, we’ve bucketed related job titles as well, e.g. “Senior Data Scientist” will be in the Data Scientist category, and “C++ Programmer” will be in the Software Engineer category.

³This article by Paula Leonova has a good, data-driven discussion of the difference between data science and data analyst roles.

⁴To be absolutely clear, I do not mean to imply a hierarchy of roles. Many software engineering roles, for example, are far more senior than many data scientist roles. I am simply referring to the directional pattern that seems to be emerging.

⁵Stitch Science did a nice breakdown of data engineering roles, and also noted the major overlap with software engineering.

⁶See also https://press.princeton.edu/titles/8757.htmlhttps://www.mckinsey.com/business-functions/organization/our-insights/why-diversity-mattershttp://www.chabris.com/Woolley2010a.pdffor more information on the importance of diversity in the workplace.

⁷It is not my intention to conflate “diversity in field of study” with broader diversity topics. I strongly believe diversity in all dimensions is essential for doing great work and creating a better society, and it will take far more than focusing on degree of study to overcome the overwhelming lack of diversity in tech workers in the US right now. This article from Stitch argues that Data Science does not appear to be doing any better than engineering roles in many aspects of diversity.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Qualitative + Quantitative: How Qualitative Methods Support Better Data Science

Have you ever been embarrassed by the first iteration of one of your machine learning projects, where you didn’t include obvious and important features? In the practical hustle and bustle of trying to build models, we can often forget about the observation step in the scientific method and jump straight to hypothesis testing.

Data scientists and their models can benefit greatly from qualitative methods. Without doing qualitative research, data scientists risk making assumptions about how users behave. These assumptions could lead to:

  • neglecting critical parameters,
  • missing a vital opportunity to empathize with those using our products, or
  • misinterpreting data.

In this post, we’ll explore how qualitative methods can help all data scientists build better models, using a case study of Indeed’s new lead routing machine learning model which ultimately generated several million dollars in revenue.

What are qualitative methods and how are they different from quantitative methods?

Few data scientists are formally trained in qualitative methods. They’re more deeply familiar with quantitative methods like A/B testing, surveys, and regressions. Quantitative methods are great for answering questions like “How much does the average small business spend on a job posting?”, “What are the skills that make someone a data scientist?”, or even “How many licks does it take to get to the center of a Tootsie roll pop?” (The answer is 3. Three licks.)

But there are some questions that quantitative methods can’t answer, such as “Why do account executives reach out to this lead instead of that lead?” or “How do small businesses make the decision to sponsor or not sponsor a job?” Or the truly deep question: “Why do you want to get to center of the Tootsie roll pop?”

To answer these questions, qualitative researchers rely on methods like in-depth interviews, participant observation, content analysis and usability studies. These methods involve more direct contact with who and what you’re studying. They allow you to better understand how and why people do what they do, and what kinds of meaning they ascribe to different behaviors.

Put another way, quantitative methods can tell you the “what”, the “how much”, or “how often”; qualitative methods can tell you the “why” or the “how”.

Cartoon created by Indeed UX Research Manager Dave Yeats using cmx.io

Why should you use qualitative methods? A case study in Lead Generation

Our Lead Generation team recently benefited greatly from the use of qualitative methods. When an employer posts a job, it represents a revenue opportunity for Indeed. We route that employer to an account executive, who then reaches out and helps the employer set an advertising budget to sponsor their job. This increases the job’s visibility and therefore the velocity at which they make a successful hire. Employers who have not yet spent with us on Indeed are referred to as “leads”.

Some leads are better than others, however. We wanted to be able to give leads a score on a scale from one to five stars that would indicate our best estimate for whether or not they would spend. Our Product Science team decided to build a machine learning model that would score leads and route them more effectively. But where to start? Prior to this project, we had little experience with lead scoring and little intuition about what a good lead would look like. How could we even know what features should be in our model?

To answer that question, we turned to people with the most hands-on experience with leads: account executives themselves. Not only are they experts on what makes a good lead, they would also be the beneficiaries of our efforts. We took a three-pronged qualitative approach:

Observation. To learn about the day-to-day sales experience, each member of our team shadowed different reps and listened to them on sales calls. We observed how they would select which lead to call, how they would decide what to talk about on the call, and how they actually made deals.

Interviews. We sat down with several sales managers and representatives across the company and asked them questions about leads they had previously decided to call or drop, like “How do you pick which leads to call first?” or “Why did you decide to drop this lead?”.

Content Analysis. We combed through thousands of open-ended responses to a company-wide survey of account executives to better understand their pain points with regards to leads.

We learned a lot! Just by doing three simple qualitative studies for a few hours, we collected a long list of potential features. Had we not sat down next to members of the sales team and observed as they worked, we would have never obtained these insights. Our next step was to start digging into the data and validating how generalizable the findings from reps were.

With the intuition we gained from our qualitative studies on account executives’ behaviors and thought processes, we ultimately built a machine learning model that generated millions of dollars in annual incremental revenue. And we didn’t stop there: we kept interviewing and shadowing reps to get their feedback on the model. We built a new version that generated a additional annual incremental revenue. And we made sure to market our new model so people knew about it.

In short, these qualitative studies kept us grounded and built empathy with our end users. Without qualitative studies, the models we built would have been out of touch with reality and made it harder for us to address our users’ needs. With qualitative methods, we infused our models with intuition and working hypotheses that we could later verify with quantitative data.

Where to start learning the basics for qualitative methods

In the case study above, our end users were our co-workers here at Indeed. It’s worth noting that it’s not always as simple to conduct qualitative studies with external users. Here at Indeed, we have a fantastic UX Qualitative Research team to turn to for these kinds of studies. We encourage you to reach out to such teams at your own companies, and if they don’t exist yet, create them. Work with them. Shadow them. Buy them a beer. They are wonderful!

But don’t just stop there. Below are some of our favorite readings and resources on qualitative methods, recommended by former academics here at Indeed.

“When to Use Which User-Experience Research Methods” — a great article by the Nielsen Norman group on identifying the right method for the research question at hand.

Learning from Strangers— a classic guide on how to ask questions in an in-depth interview.

“How to Conduct User Interviews” — a shorter guide geared toward industry and product development.

“5 Steps to Create Good User Interview Questions” — a great Medium post on avoiding biased or leading questions in in-depth interviews.

Writing Ethnographic Field Notes— the seminal work on how to collect details during observational studies. Geared toward anthropological ethnographies, but with a lot of great tips for being more aware of details in day-to-day interactions as well.

Salsa Dancing in the Social Sciences— while arguably one of the weirdest book titles, this is an enjoyable and approachable overview of the benefits of qualitative methods.

Don’t Make Me Think — Steve Krug focuses primarily on usability, but his book offers good tips for observing how people interact with websites.

If you are passionate about methods and data science, check out product science and data science jobs at Indeed!


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Does Your Job Title Matter?

The Importance of Picking the Right Job Title for Your Job

Job titles are often the first interaction between job seekers and employers. As a job seeker searches, they click on relevant titles before getting to know the role more deeply through its job description. Calling a job “software engineer” versus “programmer” will likely lead to a different number of applicants and proportion of those meeting the minimum qualifications, but just how different? Surprisingly, after a single word change in nearly identical job titles, we observed more qualified candidates and more total candidates as well. We will describe our initial research and how we can improve on this in the future.

Data & Product Science at Indeed

There are two main roles in Indeed’s Data Science organization — data scientists and product scientists. Currently, Indeed has data/product scientists in five offices: Austin, San Francisco, Seattle, Singapore, and Tokyo, working on a wide variety of product and engineering teams.

Both roles employ advanced statistical and machine learning techniques to help people get jobs. Data science has a higher emphasis on machine learning and software engineering, while product science focuses on experiments, analysis, and simpler models that can improve the product. In short, data scientists are closer to software engineering than product management, and vice versa for product scientists.

You can view the differences in the Job Descriptions here: (Product Scientist/Data Scientist). Despite their differences, the ultimate requirements for data and product scientists are essentially the same: a deep understanding and experience in mathematics and computer science, and domain expertise.

Palmer, Shelly. Data Science for the C-Suite. New York: Digital Living Press, 2015. Print. Conway, Drew. The Data Science Venn Diagram. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Sequential Test: Changing the job title

In order to find out how job titles affect the hiring process, we conducted an experiment and changed the Product Scientist title to “Data Scientist: Product” in Seattle and “Product Scientist: Data Science” in San Francisco on March 15th, while keeping the job title unchanged for Austin. Job descriptions remained the same for all three cities.

Engineering work was required for an A/B test, so we chose to look at this sequentially. We conducted a statistical power analysis to determine the sample size ahead of time. We first compared the click-through rate (defined as clicks/impressions) and number of applies for three cities before and after March 15th. From the following two charts, we see both number of applies and click-through rate jumped up since March 15th for Seattle and SF. We performed t-tests that show that applies and clickthrough rates are significantly higher for Seattle and SF than Austin starting from March 15th.

appgrowthrates

ClickThrough Growth Rates in Austin, SF and Seattle

ctgrowthrates

However, changing the job titles might affect the job search ranking, and we know the top and bottom ranked jobs on a page usually have a higher probability to get clicked. In order to account for this position bias, we conducted a logistic regression to predict clicks on page, position on the SERP, city (Austin, Seattle or SF), and if we changed the job title. We have also included the interaction term between city, and if we changed the job title to test the hypothesis that log-odds ratios for various cities are different after changing titles than before changing titles.

The regression equation was estimated¹ as follows:

 

The non-parallel lines in the interaction plot below suggest that there are significant interaction effects, which the associated significant p-values for interaction terms confirms.

Before changing titles, the equation is simply:

 

Switching from Austin to Seattle yields a change in log odds of -0.18 and to SF yields a change in log odds of -0.09.

After changing titles, the equation is:

 

Switching from Austin to Seattle yields a change in log odds of -0.18+0.6 = 0.42 and to SF yields a change in log odds of -0.09+0.71 = 0.62

The graph below also confirms that log-odds ratio for Seattle and SF are much higher after changing titles vs before changing titles. To sum up, we see significantly higher applicants for cities with changed titles.

citychange

Qualified application model

We see more applicants after changing titles, but is this pool of applicants more suitable for the role? A team at Indeed has developed a model that scores the likelihood of a resume containing skills and experiences that meet the requirements in a job description.

We applied this model to all candidates who applied for “Product Scientist” (before changing titles) from February 1 to March 14 and got the scores² for each candidate. The mean scores for Austin, Seattle and SF were 0.489, 0.498, 0.471 respectively. The plot below shows the score Kernel Density Estimation (KDE) for Austin, Seattle, and SF and chart shows the p-values (insignificant) for t-tests and Kolmogorov-Smirnov (KS) tests. The KS-test tries to determine if two samples are drawn from the same distribution. The test is nonparametric and makes no assumption about the data distribution. Both tests indicate that our applicant qualification rate was at the same level for all 3 locations before changing titles.

kdesbefore

After the model was applied to all applicants after changing titles, the means of scores for Austin, Seattle and SF are 0.466, 0.516, 0.528 respectively. We observed a small decrease in the mean rate for Austin, accompanied by increases in Seattle and SF. The plot below shows the score distributions for Austin, Seattle and SF. After controlling the False Discovery Rate to adjust for p-values, both tests indicate that applicant qualification rates with changed titles (Seattle and SF) are significantly higher than those with the original title (Austin), while there is no significant difference between different changed titles (Data Scientist: Product and Product Scientist: Data Science).

kdesafter

Are you surprised by these findings? Our pilot research shows that simply making small changes to job titles led to more and better qualified candidates for Indeed. Job titles do matter, more than you think — they are great attention catchers and a prime focus as much as the job descriptions. So, you should care about your job titles and pick the one that can be noticed and easily stand out for job seekers.

For further reading, more rigorous approaches to establishing causal effect include:

If you are interested in using the scientific method to improve or develop products and help people get jobs, check out our open Product Scientist and Data Scientist positions at Indeed!

This is the second article in our ongoing series about Data Science from Indeed. The first article is There’s No Such Thing as a Data Scientist from our colleague, Clint Chegin.


Footnotes:

1.P-value for the hypothesis test for which the Z value is the test statistic. It tells you the probability of a test statistic being at least as unusual as the one you obtained, if the null hypothesis were true (the coefficient is zero). If this probability is low, it suggests that it would be rare to get a result as unusual as this if the coefficient were really zero. Signif.code is associated to each estimate and is only intended to flag levels of significance. The more asterisks, the more significant p-values are. E.g: Three asterisks represent a highly significant p-value (if p-value is less than 0.001)

2. These model scores are non-standardized and not probabilities. An application score of 0.8 represents a higher likelihood relative to an application with a score of 0.4 (but doesn’t mean twice as likely).

3. Bollen, K.A.; Pearl, J. (2013). “Eight Myths about Causality and Structural Equation Models”. In Morgan, S.L. Handbook of Causal Analysis for Social Research. Dordrecht: Springer. pp. 301–328.

4. Sekhon, Jasjeet (2007). “The Neyman–Rubin Model of Causal Inference and Estimation via Matching Methods”(PDF). The Oxford Handbook of Political Methodology.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone