There’s No Such Thing as a Data Scientist

Posted on December 10, 2018 by Clint Chegin

The Inconsistent Definitions of Data Science and More Descriptive Titles

Images from Left to Bottom: 1) Link, By smoothgroover22, License, cropped. 2) Link, By NazWeb, License. 3) Link, By BalticServers.com, License. 4) Link, By Wallpoper. 5) Link, By The Opte Project, License, cropped

What do you really do?

There’s a memorable scene in the movie Office Space where consultants determining employee productivity start by asking, “What would you say… you do here?”

That scene and the “What I Do” images are funny because we empathize with the struggle to describe our jobs. It’s not funny, however, when the same misunderstanding occurs during the job search. It’s important to understand what a job posting means. It’s important for prospective employers to understand our skills and abilities. We’ve all viewed job postings with the same title, but with totally different descriptions.

How can the same title mean such vastly different things from one company to another?

This phenomenon is becoming increasingly common in the field of data science. The discipline has dramatically risen in popularity over the past few years. And while the number of data science jobs has increased, clarity around the role has declined. This post takes advantage of Indeed’s tremendous amounts of behavioral data to describe trends in the field and more specific definitions for data science roles.

The growing popularity of data science

Jobs matching “data scientist” have risen from 0.03% of jobs to about 0.15% (+400%) in a 4-year span.

Even earlier in 2012, a much ballyhooed article called Data Scientist the “Sexiest Job of the 21st Century.” If the title alone isn’t enough, maybe folks are interested for monetary reasons. According to Indeed’s salary data, a data scientist makes an average of $130k per year.

OK. Got it. Data science has taken off like discounted Nutella in a European supermarket. With this rise, we’ve also seen the refinement of more specific roles within the discipline. Our colleague Trey Causey wrote about the convergence between product managers and data scientists in the “Rise of the Data Product Manager.”

Many of us at Indeed also felt that the title “data scientist” has recently become more of a catch-all for many different sets of responsibilities. We wanted to dig deeper and test our intuition. Could we find natural delineations of roles within the job market? Could we use data to understand the differences within these titles and better classify them for clarity and consistency?

Spoiler Alert: We can.

Overlapping careers in data science

For this analysis of job titles, we looked at all site visitors who entered the search query “data scientist” on Indeed for the month of January 2018. Next, we looked at other searches these same users performed. We created a matrix for each user by their searches and another for searches by users. We calculated the cartesian product of these matrices to show the frequency between any pair of search terms:

Next, we removed “data scientist” from the data, as this search was present for all users. We used an R package called “igraph” to do the clustering and visualization. According to the igraph documentation, “this function implements the fast greedy modularity optimization algorithm for finding community structure.” While researching this algorithm, we learned that it was designed to quickly create communities from large data sets that have sparse regions. Hmm, that sounds exactly like the data we are using!

Here’s a great obligatory equation we can add for how this works. You’ll have to read that paper to understand what it means.

Next, we wrote a function with a pruning parameter to choose the minimum number of vertices in each cluster. This parameter is best set by “guess and check,” as higher numbers don’t necessarily mean more total groups and vice versa. We tried various numbers from 3–20 and checked to see if the groups made sense. We didn’t care about really small clusters and we wanted the queries to fit together. More on this later.

By choosing five as the pruning threshold, four clusters formed. We subsequently labeled these clusters “business intelligence”, “statistician”, “machine learning engineer”, and “natural scientist”.

Here are the queries that make up each group:

See the Pen Job Title Network Graph by Erik Oberg (@obergew) on CodePen.

Thanks to Erik Oberg for the CodePen viz

And here’s how the clustering turned out:

Thanks to Zhuying Xu for the Plotly viz

From the preceding chart, we see a few interesting things.

First, there is clear demarcation between statistician and machine learning engineer. Since we don’t see many searches that cross over between these roles, this suggests two distinct career paths.

Second, business intelligence doesn’t seem to have a clean grouping. It is dispersed broadly across the other roles. This contrasts with natural scientist searches, which seem to overlap more with statistician searches. This tells us that job seekers who search for business intelligence might be looking at a wide variety of other jobs within the data science realm. It could also mean that business intelligence positions are being called data science more often now. Further, it seems job seekers who search for machine learning engineer or statistician don’t search for jobs in both categories.

Finally, we see that some natural scientists are perhaps getting into data science through the statistician end of the data science spectrum.

More descriptive roles in data science

From these findings, we would posit that there is no single type of data scientist. Rather, there are many types! There is no single description of a data scientist and thus this title alone doesn’t give us enough information. Data science as a title could translate to a variety of different roles in practice.

Taken together, it’s important to gather more information to understand what it means to be a data scientist at a given company. We believe it would be helpful for employers to think in terms of the roles identified in our clustering. This will help them find the candidates they need and enable job seekers to apply for the jobs they want.

At Indeed, we have a few “data” roles: data engineer, BI developer, BI analyst, product scientist, and data scientist. It looks something like this:

Data Science Job Strengths

Thanks to Ron Chipman for helping put this together

It’s easy to see how confusing this can become. From searching patterns we’ve observed, if someone were to say, “I want to be a data scientist at Indeed,” it could be unclear which team or title would be the best fit. Each team has different interview processes and contributes in different ways, so it’s really important to apply to the right one.

This is the first blog post in a series diving more deeply into data science insights from Indeed. In upcoming posts, we’ll explore the skills associated with data science jobs. We’ll showcase trends and the overlap from each of these more specific job titles. We’ll also describe what skills you should gain if you are interested in a particular career path. We’ll provide employers with tips to interview better for the specific needs of their organization. Finally, we’ll describe “Job Title Supernovae” — jobs that grow quickly and fade away.

Will the title “data scientist” die away like “webmaster” did in the 90s? Subscribe or tune in to future posts for that prediction and more!

At Indeed, We Help People Get Jobs and we hope to help you too. If any of these roles have excited you, please check out www.indeed.jobs and apply today!

Footnotes

A. Clauset, M.E.J. Newman, C. Moore: Finding community structure in very large networks, http://www.arxiv.org/abs/cond-mat/0408187

Cross-posted on Medium.

Market Your Data Science Like a Product

Posted on December 8, 2018 by Erik Oberg

A 7-Step ‘Go-to-Market’ Plan for Your Next Data Product

Why do internal tools need marketing?

Have you ever developed a great solution that never gets used? Accuracy, statistical significance, model type: none of these matter if your data product is not put into action. Positively impacting your organization as a data scientist means developing high quality data products and successfully launching those data products.

As a product scientist at Indeed (product science is a team in data science ) , I think about launching both business products and internal data products. This has helped me see that marketing techniques for launching goods and services can also be applied to launching data products internally. With this perspective, I’ve helped the tools I developed become among the top 10% most used at Indeed.

I have broken down what I do into seven steps:

Naming/branding
Documentation
Champion identification
Timing
Outreach
Demoing
Tracking

1. Get an MBA name

Your product needs a name that’s MBA: Memorable, Brandable, and Available.

Indeed runs over 500 IPython notebook web applications for internal reporting each day. We’ve developed and deployed over 12,000 IPython notebook web applications. In this rich reporting environment, data products need a way to distinguish themselves from one another. It’s hard to summarize the months you have spent exploring data, developing a model, and validating output into just a few words, but it also can shortchange your work to go with “The model” or “The revenue/ job seeker behavior/ sales thing I have been making!”

Identify your high-quality data products in ways that signal your past and future investment in the work.

Memorable

Apple and Starbucks are two of the most valuable brands in the world. Still, only 20% of people in a study by Signs.com could draw the Apple logo perfectly and only 6% for Starbucks. This points to the power of the name. People do not need to remember exactly how a logo or your data product looks and works, but they need to be able to recall it by name.

Memorable names are often:

Pronounceable. They start with a sharp sound and roll off the tongue. Research on English speakers suggests names with initial plosive consonants (p, t, k) are more memorable, but also see research on word symbolism.

Plain. They frequently repurpose common words (e.g., Apple or Indeed), which help you combine rich mental images to your product. Be aware that discoverability through search may be limited when using common words. Slightly modifying the word can help overcome this (Lyft) as long as it’s memorable.

Produced. They can even be entirely new. Making up a new word is also a strategy (Google, Intel, Sony, or Garmin), but this requires substantially more initial seeding to establish the name. This may not be in line with the audience and timeframe of an internal data product launch.

Brandable

You want your name to consistently represent the identity of the data product and reflect an overall positive attitude towards it. This way it can be incorporated seamlessly into the tool and documentation.

Available

Make sure no one else has called their data product the same thing!

Once you have picked the name, you can dress it up with a logo. The logo can simply be your MBA name that’s been stylized following the same MBA principles. A shortcut like Font Meme Text Generator can quickly create a sufficient design.

For example,

2. Document the product

You know what your code does. But what if you’re not around to answer questions, or give a demo when the CEO or a curious new intern ponder to themselves, “What does this thing do?”

Documentation is not only good practice as a data scientist/developer, it is also an opportunity for your work to be found. When one business wants to know if another business has the products and services it needs, 71% start with a simple Google search. Similarly, in addition to being valuable for your user group, wiki documentation and code comments create searchable content that helps your work get discovered.

When writing your documentation, identify:

the main problem your data product is solving
key features and how they solve the problem
key definitions
key technical aspects that need to be explained

Documenting your product’s journey can also help build trust in the product. Use consistent messaging by including your MBA name and logo within the documentation to further establish your brand.

3. Identify champions

Who else “gets” the problem you are trying to solve and how the data product delivers a solution?

Seek out people who are affected by that problem, and share your work with them. Also, look to your own team members who have participated in the build or know your work. These champions can recommend your work to others who would also appreciate the solution.

Identifying champions is analogous to customer advocacy in consumer business. Word-of-mouth is a leading influencer across continents and generations for ~83% of consumers (according to a study by Nielsen) when making a purchase decision. Your data product champions will be your top sales reps, lending credibility to the tool and answering questions when you are not around.

4. Timing is everything

Before each launch, consider the current business environment, and time your launch accordingly. The moment you have finished working on your data product is not necessarily the best time to launch it. For example, a product team may be in the middle of fixing a major bug and not ready for a new idea. Conversely, an upcoming related communication activity (e.g., blog post) could be an opportune time for a release with cross promotion.

Look at other recent data products: When were they released and how were they received? Stakeholders can feel inundated with too many new dashboards and models and this may even contribute to “analysis paralysis.”

5. Know your audience

If your champions are not happy, your product can lose its luster in a Snap. Developing positive working relationships with your champions and users is important for the early and long-term success of your data product.

Identify and reach your audience — those who will be using what you’ve made and can benefit from it. With this target audience in mind, comment on tickets, post on Slack, chat, send emails to relevant groups, or go directly to talk to your audience.

Use your audience’s preferred channels to communicate development progress, releases, and feedback. Establishing this communication will build early confidence in your data product. As iteration requests come in, you will have the opportunity to build this confidence with thoughtful acknowledgement of requests.

In 2017, Indeed’s Data Science Platform team — software engineers who built a machine learning deployment framework — went on a roadshow to Indeed’s multiple tech offices to share the data science platform framework. This was a great example of engaging with an audience across offices.

6. Go live!

Only you can see the picture in your mind of how something works. Demoing is a powerful way to communicate what your new data product does. A great way to do this is by getting a minimum viable data product, a prototype, out early to your champions.

Examples include creating a working application with minimal data, sketching a mockup of a dashboard, or taking screenshots. Forbes has more examples of consumer products. As a demo to explain a sales lead qualification machine learning model to the Sales organization, the product science team built a simple interactive web app that returned the model results when a user changed the value of the model features with sliders.

7. Own the results

“It’s not that I’m so smart, it’s just that I stay with problems longer.” — Albert Einstein

You may love the theoretical foundation and implementation of your data product, but ultimately the success of a data product comes down to the user. Long term marketing and retaining users depends on how much you can ensure reliability. Reliability is key to building your data product’s brand, your reputation and your technical credibility. This affects the marketing for your other current and future data products as well. It’s worth noting that this doesn’t mean perfection — it often just means dealing with problems quickly, fully and transparently.

Monitor key metrics of your data product to see how it’s working and what its impact is. Actively seek and be responsive to feedback. Evaluate if your data product is achieving its intended objectives and determine if features can be improved to better suit your audience.

If you are not achieving impact or the tool is not being used, revisit your initial assumptions about the problem you thought you were solving. Then, talk to your users (and non-users) about what might not be working. Be willing to destroy and start again, and create something even better with a new perspective. The initiative to iterate and improve your data product tools requires persistence but will raise the quality of your data products and enhance the rest of your marketing efforts.

Final thoughts

Teams outside the analytics community depend on your marketing efforts to learn about your data products that can make them and the company more effective. You don’t have to wait until the product is finished to start letting other teams know about the product. The marketing can start with documentation, champion identification, and outreach as soon as initial requirements are being gathered.

That being said, creating a data product of quality is a priority over marketing for data science, so choose what you market. A data scientist’s credibility is essential for people to trust your data-driven recommendations and act on them. Ensure that you’re investing it wisely.

If you are passionate about both developing great data products and making sure your data products have impact, check out product science and data science at Indeed!

Open Source at Indeed: Sponsoring Outreachy

Posted on November 21, 2018 by Duane O'Brien

Indeed is committed to supporting the open source community. That’s why we’re excited to announce our sponsorship of Outreachy!

What is Outreachy?

Outreachy supports diversity and inclusion across the whole open source community. By providing paid internships to people from underrepresented groups, Outreachy creates meaningful opportunities for individuals to make real contributions to open source while helping to improve inclusion in the community. Open source benefits from diverse participation, and Outreachy is making a difference. Outreachy accepted 46 interns for the December 2018 to March 2019 round of internships. Find more information about their projects on the Outreachy Alums page.

Marina Zhurakhinskaya, Outreachy co-organizer, says: “Outreachy is excited to welcome Indeed as a sponsor and is grateful for the commitment from Indeed to support diversity in free and open source software. With the help from Indeed, we are able to support more Outreachy applicants making their first contributions to free and open source software and more interns gaining in-depth experience.”

Indeed and the Community

As we continue to take a more active role in the open source community, Indeed will seek out additional partnerships, sponsorships, and memberships. In addition to sponsoring Outreachy, this year Indeed joined the Cloud Native Computing Foundation and began sponsoring the Python Software Foundation, the Apache Software Foundation, the Open Source Initiative, and Webpack.

For updates on Indeed’s open source projects, visit our open source site. If you’re interested in open source roles at Indeed, visit our hiring page.