As I described in an earlier post, we built Imhotep as a data analytics platform for the rapid exploration and analysis of large time-series datasets. In the previous post, I showed how an Imhotep dataset based on Atlassian Jira can drive improvements to the development process.
We’re continually searching for new ways to collect metrics. Examining actions in Jira, the tool we use for tracking our development process, seemed like a natural fit for gaining process insights. We decided to find a way to convert Jira issue history for a large set of projects into an Imhotep dataset of actions, organized by time.
The open source Jira Actions Imhotep Builder transforms issue activity in a Jira instance into an Imhotep dataset. Each document in the resulting dataset corresponds to a single action on a Jira issue, such as creation, edit, transition, or comment.
The builder queries the Jira REST API for each Jira issue in the specified time range, then deconstructs the issue into a series of actions. The actions are written to a series of .TSV (tab-separated values) files, which are uploaded to an Imhotep dataset.
Using that builder, we created a dataset of activity on projects in the Apache Software Foundation (from their Jira instance). We hope Apache projects take advantage of the dataset to gain insights about ways they can improve processes for their developer and user communities.
Diving into the ASF Jira data
We created an Imhotep dataset of ASF Jira data from January 1, 2016 through the present. As of October 17, 2018, the apachejira dataset:
- contains nearly 3.4 million Jira actions, including 230,298 issue creations, 1.8 million edits, and 1.3 million comments
- requires only 274MB on disk, or about 81 bytes per action
Using the apachejira dataset, we can answer many questions about what’s happening in ASF projects, such as the following examples.
Who reported the most bugs in ASF projects from July-September?
from apachejira 2018-07-01 2018-10-01
where action="create" issuetype="Bug"
group by actor
Beam JIRA Bot, with presumably actual person Sebb in the #2 position:
Which projects have the most bugs reported from July-September?
from apachejira 2018-07-01 2018-10-01
where action="create" issuetype="Bug"
group by project
Ignite edges out Ambari for the top spot, with 401 bugs reported.
The next two questions explore some differences in project workflows.
How many distinct status values exist in the most active projects?
from apachejira 2018-01-01 today
group by project[10]
select count(), distinct(status)
Five of the top ten projects have 6 distinct statuses, and the other five have 5 distinct statuses. For example, Apache Beam has 5, and Apache Hive has 6.
How do the statuses used by Apache Beam and Apache Hive compare to one another?
from apachejira 2018-01-01 today
where project in (Beam, Hive)
group by status
select project='Beam',project='Hive'
Hive uses the Patch Available state, Beam doesn’t. It turns out that about 11% of the Apache JIRA projects take advantage of this state.
Which projects had the most contributors changing issue status to Patch Available in 2018?
from apachejira 2018-01-01 2019-01-01
where fieldschangedtok='status'
status='Patch Available'
group by project[10]
select distinct(actor)
Hadoop ecosystem projects (Hive, HDFS, Hadoop Common, YARN, HBase, and Hadoop Distributed Data Store) claim six of the top 10 spots.
Who contributed to (set status to Patch Available in) Apache Hive in 2018?
from apachejira 2018-01-01 2019-01-01
where fieldschangedtok='status'
status='Patch Available' project = 'Hive'
group by actor[10]
select count(), distinct(issuekey)
The top 10 contributors contributed to 578 issues in 2018.
How long does it take to get a patch accepted in the 20 most active projects?
from apachejira 2018-01-01 2019-01-01
where prevstatus="Patch Available"
status="Resolved"
fieldschangedtok="status"
group by project[10]
select count(), timeinstate\3600/count()
/* hours in state */
Hadoop Distributed Data Store is the fastest, with an average of 102 hours between the Patch Available and Resolved states.
The average for Kafka is really high, but it turns out that about 28 outliers with resolutions of Not A Problem, Auto Closed, Duplicate, Won’t Fix, and Won’t Do contributed to the high average.
from apachejira 2018-01-01 2019-01-01
where prevstatus="Patch Available"
status="Resolved"
fieldschangedtok="status" project = Kafka
group by resolution
select count(), timeinstate\3600/count()
/* hours in state */
That might be a bad thing or an okay thing for the community. Either way, digging into numbers like these can raise interesting questions.
These are a small sample of the questions we could explore in this dataset.
Creating and analyzing your own Jira datasets
We’ve made the Jira Actions Imhotep Builder available as open source. We hope you will use it to build your own Jira-based Imhotep datasets. This builder is the first one we’ve published, and we’ve also listed it in a new Imhotep Builder Directory.
If you have an idea for a new builder, or need help getting started with Imhotep, open an issue in the GitHub repository or reach out on Twitter.
In the next post in this series, I describe Hindsight, an internal tool we use to make internal contributor work visible and drive coaching insights.
Read the full series of blog posts:
- Imhotep: Scalable, Efficient, and Fast
- Using Metrics to Improve the Development Process (and Coach People)
- Metrics-Driven Process Improvement: A Case Study
- What’s Up, ASF? Using Imhotep to Understand Project Activity
- The Benefits of Hindsight: A Metrics-Driven Approach to Coaching
Cross-posted on Medium.