Open Source at Indeed: Sponsoring Outreachy

Indeed is committed to supporting the open source community. That’s why we’re excited to announce our sponsorship of Outreachy!

Outreachy logo

What is Outreachy?

Outreachy supports diversity and inclusion across the whole open source community. By providing paid internships to people from underrepresented groups, Outreachy creates meaningful opportunities for individuals to make real contributions to open source while helping to improve inclusion in the community. Open source benefits from diverse participation, and Outreachy is making a difference. Outreachy accepted 46 interns for the December 2018 to March 2019 round of internships. Find more information about their projects on the Outreachy Alums page.

Marina Zhurakhinskaya, Outreachy co-organizer, says: “Outreachy is excited to welcome Indeed as a sponsor and is grateful for the commitment from Indeed to support diversity in free and open source software. With the help from Indeed, we are able to support more Outreachy applicants making their first contributions to free and open source software and more interns gaining in-depth experience.”

Indeed and the Community

As we continue to take a more active role in the open source community, Indeed will seek out additional partnerships, sponsorships, and memberships. In addition to sponsoring Outreachy, this year Indeed joined the Cloud Native Computing Foundation and began sponsoring the Python Software Foundation, the Apache Software Foundation, the Open Source Initiative, and Webpack.


For updates on Indeed’s open source projects, visit our open source site. If you’re interested in open source roles at Indeed, visit our hiring page.

Cross-posted on Medium.

The Benefits of Hindsight: A Metrics-Driven Approach to Coaching

In a previous post, I described using a measure-question-learn-improve cycle to drive improvements to development processes. In this post, I assert that this cycle can also help people understand their own opportunities for improvement and growth. It can be a powerful coaching tool — when used correctly.

At Indeed, we’ve developed an internal web app called Hindsight that rolls up measurements of work done by individuals. This tool makes contributions more transparent for that person and their manager.

Screenshot of Hindsight app showing an example user's measurements of work over several quarters, including number of Jira issues resolved, reported, commented, reopened; number of deploys; number of protests; and edits to the wiki

Each individual has a Hindsight card that shows their activity over time (quarter by quarter). Many of the numbers come from Jira, such as issues resolved, reported, commented on, etc. Others come from other SDLC tools. All numbers are clickable so that you can dive down into the details.

When we introduced Hindsight, we worried about the Number Six Principle and Goodhart’s Law (explained in the earlier post). To protect against these negative effects, we constantly emphasize two guidelines:

  • Hindsight is a starting point for discussion. It can’t tell the whole story, but it can surface trends and phenomena that are worth digging into.
  • There are no targets. There’s no notion of a “reasonable number” for a given role and level, because that would quickly become a target. We even avoid analyzing medians/averages for the metrics included.

Hindsight in action: How’s your quality?

To see how Hindsight fits into the measure-question-learn-improve cycle, consider this example: Suppose my card shows that for last quarter I resolved 100 issues and had 30 issues reopened during testing. As my manager, you might be tempted to say, “Jack is really productive, but he tries to ship a lot of buggy code and should pay more attention to quality.”

But remember — the metrics are only a starting point for discussion. You need to ask questions and dig into the data. When you read through the 30 reopened issues, you discover that only 10 of them were actual bugs, and all of those bugs were relatively minor. Now the story is changing. In fact, your investigation might drive insight into how the team can improve their communication around testing.

Measure, question, learn, improve

In this five-part series, I’ve explored how metrics help us improve how we work at Indeed. Every engineering organization can and should use data to drive important conversations. Whether you use Imhotep, spreadsheets, or other tools, it’s worth doing. Start by measuring everything you can. Then question your measurements, learn, and repeat. You’ll soon find yourself in a rewarding cycle of continuous improvement.


Read the full series of blog posts:


The Benefits of Hindsight: A Metrics-Driven Approach to Coaching cross-posted on Medium.

What’s Up, ASF? Using Imhotep to Understand Project Activity

As I described in an earlier post, we built Imhotep as a data analytics platform for the rapid exploration and analysis of large time-series datasets. In the previous post, I showed how an Imhotep dataset based on Atlassian Jira can drive improvements to the development process.

We’re continually searching for new ways to collect metrics. Examining actions in Jira, the tool we use for tracking our development process, seemed like a natural fit for gaining process insights. We decided to find a way to convert Jira issue history for a large set of projects into an Imhotep dataset of actions, organized by time.

The open source Jira Actions Imhotep Builder transforms issue activity in a Jira instance into an Imhotep dataset. Each document in the resulting dataset corresponds to a single action on a Jira issue, such as creation, edit, transition, or comment.

The builder queries the Jira REST API for each Jira issue in the specified time range, then deconstructs the issue into a series of actions. The actions are written to a series of .TSV (tab-separated values) files, which are uploaded to an Imhotep dataset.

Using that builder, we created a dataset of activity on projects in the Apache Software Foundation (from their Jira instance). We hope Apache projects take advantage of the dataset to gain insights about ways they can improve processes for their developer and user communities.

 

Diving into the ASF Jira data

We created an Imhotep dataset of ASF Jira data from January 1, 2016 through the present. As of October 17, 2018, the apachejira dataset:

  • contains nearly 3.4 million Jira actions, including 230,298 issue creations, 1.8 million edits, and 1.3 million comments
  • requires only 274MB on disk, or about 81 bytes per action

Using the apachejira dataset, we can answer many questions about what’s happening in ASF projects, such as the following examples.

Who reported the most bugs in ASF projects from July-September?

from apachejira 2018-07-01 2018-10-01
   where action="create" issuetype="Bug"
   group by actor

Beam JIRA Bot, with presumably actual person Sebb in the #2 position:

Screenshot of top ten query results for ASF project query, with 7,675 bugs total; #1 is Beam JIRA Bot with 74 issues; #2 is Sebb with 56 issues

Which projects have the most bugs reported from July-September?

from apachejira 2018-07-01 2018-10-01
   where action="create" issuetype="Bug" 
   group by project

Ignite edges out Ambari for the top spot, with 401 bugs reported.

Screenshot of top 10 query results for projects with most bugs, with 7,675 bugs total; #1 is Ignite with 401 bugs; #2 is Ambari with 349 bugs

The next two questions explore some differences in project workflows.

How many distinct status values exist in the most active projects?

from apachejira 2018-01-01 today
   group by project[10] 
   select count(), distinct(status)

Five of the top ten projects have 6 distinct statuses, and the other five have 5 distinct statuses. For example, Apache Beam has 5, and Apache Hive has 6.

Screenshot of top 10 query results for most active ASF projects, showing the distinct status values for each project.

How do the statuses used by Apache Beam and Apache Hive compare to one another?

from apachejira 2018-01-01 today
   where project in (Beam, Hive)
   group by status
   select project='Beam',project='Hive'

Hive uses the Patch Available state, Beam doesn’t. It turns out that about 11% of the Apache JIRA projects take advantage of this state.

Screenshot of query results comparing Beam and Hive projects, listing the issues in Open, In Progress, Resolved, Reopened, Closed, and for Hive only, Patch Available status

Which projects had the most contributors changing issue status to Patch Available in 2018?

from apachejira 2018-01-01 2019-01-01
   where fieldschangedtok='status' 
      status='Patch Available'
   group by project[10]
   select distinct(actor)

Hadoop ecosystem projects (Hive, HDFS, Hadoop Common, YARN, HBase, and Hadoop Distributed Data Store) claim six of the top 10 spots.

Screenshot of results for top 10 projects where contributors changed issue status to Patch Available in 2018; led by Hive, Hadoop HDFS, Ignite

Who contributed to (set status to Patch Available in) Apache Hive in 2018?

from apachejira 2018-01-01 2019-01-01
   where fieldschangedtok='status' 
      status='Patch Available' project = 'Hive'
   group by actor[10]
   select count(), distinct(issuekey)

The top 10 contributors contributed to 578 issues in 2018.

Screenshot of results for top 10 contributors who set issue status to Patch Available in Apache Hive in 2018, including count per contributor

How long does it take to get a patch accepted in the 20 most active projects?

from apachejira 2018-01-01 2019-01-01
   where prevstatus="Patch Available" 
      status="Resolved" 
      fieldschangedtok="status"
   group by project[10]
   select count(), timeinstate\3600/count() 
      /* hours in state */

Hadoop Distributed Data Store is the fastest, with an average of 102 hours between the Patch Available and Resolved states.

Screenshot of query results of top 20 ASF projects with fastest time between Patch Available and Resolved states, ranked by lowest hours in state

The average for Kafka is really high, but it turns out that about 28 outliers with resolutions of Not A Problem, Auto Closed, Duplicate, Won’t Fix, and Won’t Do contributed to the high average.

from apachejira 2018-01-01 2019-01-01
   where prevstatus="Patch Available" 
      status="Resolved" 
      fieldschangedtok="status" project = Kafka
   group by resolution
   select count(), timeinstate\3600/count() 
      /* hours in state */

Screenshot of query results of Kafka project issues by resolution, ranked from high to low hours in state; led by Not A Problem and Auto Closed.

That might be a bad thing or an okay thing for the community. Either way, digging into numbers like these can raise interesting questions.

These are a small sample of the questions we could explore in this dataset.

Creating and analyzing your own Jira datasets

We’ve made the Jira Actions Imhotep Builder available as open source. We hope you will use it to build your own Jira-based Imhotep datasets. This builder is the first one we’ve published, and we’ve also listed it in a new Imhotep Builder Directory.

If you have an idea for a new builder, or need help getting started with Imhotep, open an issue in the GitHub repository or reach out on Twitter.

In the next post in this series, I describe Hindsight, an internal tool we use to make internal contributor work visible and drive coaching insights.


Read the full series of blog posts:


Cross-posted on Medium.