Interactive Analytics with Imhotep
This talk was held on Wednesday, November 05, 2014 at 7:00pm
We are excited to announce the open source availability of Imhotep, the interactive data analytics platform that powers data-driven decision making at Indeed.
In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. Next we showed how our engineering and product organizations use Imhotep to focus on key metrics at scale. During this session, Product Manager Tom Bergman provided examples of valuable insights that can be gained by using Imhotep. After the presentation, attendees explored their own data in Imhotep. Product engineers were on hand to answer questions.
[CROWD CLAPPING] TOM BERGMAN: Today's talk is on large scale interactive analytics with Imhotep. I'm Tom Bergman. I'm a product manager here at Indeed, and I worked on many of the tools that you'll see here in this talk. And with me is Zak Cocos who is the manager of our marketing sciences team. And together we help people get jobs.
Indeed is the largest job search site in the world. Here's an example of our job search product. People can come, type in the type of job they're interested in and the location, and then we show them some results looking like this.
So today's talk is about Imhotep so I figured I'd start off telling you what is Imhotep? Imhotep is a highly scalable analytics architecture for querying faceted datasets. And we're very happy to announce that coming very soon, it'll be our OPEN SOURCE highly scalable analytics architecture for querying faceted datasets.
So as Keith had mentioned, in previous talks, we've talked about data. How we get all the user data from all of our remote servers, brought back to a central cluster where we can use it for analysis. We've talked about the system, Imhotep, which we use to build our decision trees and some of the engineering aspects behind how we use it for analytics.
We talked about how people can take the output of these tools, and use them to make better decisions about products. And today, we're going to talk about the tools that connect all of this. How the data gets to the people to actually make things better.
So before I get to that, let me go over a brief history of how we've done analytics at Indeed. First off, our general philosophy for everything we do here is, what's best for the job seeker? In order to figure that out, we need to test and measure everything. So we need this data to be able to make good decisions and we need those decisions to help people get jobs.
So what type of data do we use? We need good input to have good output. So the data that we use, for example from our job search program is like this. We have a lot of information on this page. A query, for example, here it's Indeed software engineer. We also have a location the job is that. In this case, Austin. And then we have a bunch of jobs that we show on the screen. We call each one of these an impression, and there's a lot of information in each of these.
So for every impression we show on Indeed, we log it. And we store the log information in something that looks like this, log entry. So here we have all the information about this impression. What is the title, the position? Was it clicked or not? What country was it in? And a bunch of other things.
So we take all of these logs and we store them in logrepo. And when we first started doing analytics at Indeed, we would actually do analytics on these raw logs themselves. So we do net that cap egrep to identify which logs matched our query, and then we could add up the metrics around them.
I know this let us answer a lot of complicated problems, but it took a lot of effort for every single request that we made. And it there was a lot of alone time.
Eventually we moved from doing this in the command line, to doing this through Java. But it was still really slow and expensive. So the next big improvement we made in analytics at Indeed was a program called Ramses. This was created by our CTO, Andrew Hudson. And it was named for his love of Egyptian culture, and the fantastic amount of RAM that he used when we first made it in 2010.
So Indeed's a search company, and search is one of our core strengths. So of course, we approached analytics like a search problem. So Ramses was, at its heart, a search engine for logs. We'd built an inverted index out of all our logs, we'd search through them, We'd extract metrics from the matches, and then we'd graph those aggregated metrics.
The way we'd get information out of it was we put in a query and a metric we wanted. And then it would output aggregated metrics by bucket. So for example, let's say we wanted to know how many organic clicks we have in Australia. So we'd put in the query, country Australia, and asked for the metric organic clicks and they would return a result, something like this.
Let's say we wanted to know what test group A or B has more revenue. Well, we log every test group on every page we show so what we'll do is we're putting the query test group A test group B and the metric revenue. It will find all the logs that have that test group, and the revenue for them and it will graph it like so.
We can also answer questions like, how is traffic from Yahoo changed over time and Great Britain Germany and Japan. So likewise we're putting the query from Yahoo and country Great Britain, Germany, Japan with the metric visits. And it can output a graph that looks like this.
So we used Ramses extensively and actually exclusively for two years to manage all our tests our monitoring. And it was really good at doing a lot of things. But there were some things it just wasn't designed to do. For example Ramses couldn't answer questions like, how many unique queries do we have in the US. Or what are the top 50 queries in the US. Or how many clicks did each of those queries Receive. So in order to answer these questions, we built a new tool called Imhotep. And he is the guy on the far right there.
So the origins of Imhotep, it began as a distributed iteration and group-by engine for building our prediction models. If you saw Andrew's talk a little while ago, we built our prediction models through a decision tree method.
So we have a decision tree builder, and it iterates over each of the nodes, level by level building at breadth first like this. So we'll go through the first level, split into two groups, go to the second level, and so forth.
So we found that this was really, really useful for building decision trees. But we also realized that we could then leverage this ability to do these massive group bys and aggregates, to make a very powerful real time analytics engine.
So in addition to the very simple queries I mentioned earlier, it can also answer much more complicated queries. For example, how many Android app users with accounts older than 30 days saved at least one job in the past week? Or what titles had the highest clickthrough rate for the query architecture in the US? Or about the lowest click through rate? Or for job seekers who click on Google Jobs in Ireland, what other companies jobs they click on?
So we could write a program that could answer these questions for us, but what's really, really powerful about Imhotep is that we can answer all of these questions trivially with a few clicks in a web app. And we don't have to do any sort of expensive ETL, or anything like that. So here to talk to you about exactly how we do this is Zak Cocos. ZAK COCOS: Is this working? Cool. Hey. I'm Zak Cocos, I manage the Marketing Sciences team here at Indeed. And I also help people get jobs. So marketing sciences is a centralized research analysis and automation team. And we support marketing initiatives.
In order to do this, we use data pretty much exclusively, and Imhotep extensively. So I'd like to reiterate what Tom said, and mentioned that we are open sourcing Imhotep, and I'm super excited about this. And I'm excited about this for three main reasons, and three main use cases that we have for Imhotep.
The first is for Ad hoc exploration. So if you don't know anything about your dataset and you just want to take a look through it and kind of explore. Imhotep is great for that.
The next is for specific analysis. So if you do know things about your dataset, and you have a specific query that you'd like to ask, you a specific question, Imhotep allows you to do that as well. But it allows you to do it in a very, very fast manner.
Finally, it's got an extensible infrastructure. So we're able to build tools on top of Imhotep to answer these questions, and to automate these tools as well. So I'm going to talk a little bit about Ad hoc exploration. And before I do so, I'm going to go ahead and upload a dataset to Imhotep live right now, so that we can explore it shortly.
So we see that in this directory, at crunchbase.tsv. So this is a file that I downloaded from crunch base's public website. I'll show you, real quick. It's got-- it's just tabular data. We have company permalink as a field. This is going to tell us where on crunch base the data actually lives-- the company actually lives. We have the company name, category, code so just tabular data.
I'm going to go ahead and upload this to hodoop. So just copy from local, I'm going to put it into a special directory inside hodoop that Imhotep knows to index.
So, I'll explain a little bit more about this data. As I mentioned, it's public data that we downloaded from crunch base's website. And here's an example of some rows.
Now each row in this dataset is an investment that's occurred for a company. So we see at the top there HomeAway in the category of travel raised a series C plus round of funding for $250 million. Again HomeAway for $160 million had a series B stage of funding, and we see some other Austin companies there as well.
Now, I should mention that the dataset is not just Austin, it's got 48,000 investment events, and we're going to look at some of the Austin companies though, just so it hits a little closer to home.
So henceforth, I'm going to refer to these rows of actually documents. And we took this terminology from search documents. Fields or the columns in this dataset, I'm just going to refer to them as fields. And that's the text categories that we have.
Finally numeric valued fields, I'm going to refer to as metrics. So Imhotep treats metrics specially, which is why I should call this out. So we're going to look at this data inside of what's called Imhotep Data Explorer. And this is going to do-- You Can think of Imhotep Data Explorer as an interactive tool for exploring Imhotep data. Or just a bad ass hyperlinked pivot table.
So the hyperlinks I should mention are on the fields and on the value. So when you click on a hyperlink on a field, it's going to do an interactive group-by, or a pivot on that field. And when you click on a value, it's going to do a filter. So enough talk. Let's go ahead and dive into Imhotep.
So as I mentioned, this is crunch based data. Each document is an investment that's occurred for a company. And you see now on the left, that I have funded your selected. So I'm going to go ahead and toggle off field so, we can look at these years.
In 2013-- so we see the year 2013, and then over on the right, we see count. This means that there are 9,737 documents with the value who's funded year was 2013. Or in other words, in 2013 there were 9,737 investments that were in this dataset. We see 2012 2011 and other years as we scroll down.
So we can also pivot very easily in Imhotep type DataExplorer. So we can look at, for instance company category code. And we see that the top category with four rounds of investment is software we then see biotech next mobile with just over 3,000 rounds of investment. So very easy, very easily. We can use Imhotep type Data Explorer to pivot on our data.
I'm going to do one more pivot on company city now, and we can look at the top cities that have raised rounds of investments in the dataset. So as you might expect, we see San Francisco on top, then New York, London Seattle, a couple more Bay Area cities and then down there at number eight with 772 investment rounds we see Austin.
So as I mentioned before, we can click on the values in this table in order to filter on these values. So I'm going to go ahead and click on Austin and you see the interface change a little bit.
Now up in the header we have company city of Austin. And what this has done is it's filtered our documents from the original 48,000 to just the 772 investments that are in Austin. So we can do something pretty cool here, which is still pivot on this data. So we'll go back to funded year, which we were looking at initially and look at just the funded years for companies that received investments in Austin. We see 2013 is still on top 2012 after that, we see that 2010 and 2011 have in fact switched for Austin, which is interesting to me at least. So let's see what other fields we can pivot on. Company category code was another one that we did before. So let's look at that. So software was the top when we looked at the dataset in its entirety, and it remains on top for Austin. But we see enterprise is now in second, and biotech, mobile, cleantech and so on.
Now let's filter on, or let's pivot on one more field which is the company name. So let's look at those companies in Austin that are receiving rounds of investment. Adometry tops the list here with nine distinct rounds of investment. We see up logic's with eight illumitax, kale the energy technologies and at number six we see infarctions.
So this has been cool. We've filtered, we've pivoted, but I mentioned that metrics. Imhotep sort of especially. So we can do something really impressive here, which is called add metric and what add metric does is it allows us to look at any other metrics that are in our dataset and add those to the table below. So I'm going to click on raised about USD and add that as a metric.
Now we can see next to the amount of next to the number of rounds of funding that were raised, we see the amount of funding in total that was raised. So for a dormitory over its nine rounds of funding it's raised 44.6 million up logic's with 45 million illuminatax and so on. Now I'm going to go ahead and just click on raised amount which will sort the data based. On who's raised the most. We see now HomeAway is at the top raising just over half a billion in its five rounds of investment. RetailMeNot comes next HelioVolt SolarWinds definitely some popular Austin companies here. But Imhotep Data Explorer can go even further than.
We can actually apply math to the metrics that we add. So I'll go back to the options, click Add metric and select raised amount USD. But I'm going to divide that by count. What this is going to do is it's going to give us the average amount that was raised per investment round for these companies. I'm going to actually label this metric, average raised, because raised amount USD divided by this sort of verbose. So I'll add that metric, and we see now next to HomeAway, over its five rounds of investment, there it raised just over half a billion, which was, on average $100 million per round.
RetailMeNot on average 59.9 million and we see HelioVolt there with 42.1 million and we don't quite round in Imhotep Data Explorer, but that'll be fixed by the time we open source it.
So I'm going to go ahead and sort by average raised. So we can see on average, which companies have ever raised the most money. HomeAway is still on top here, but we see that SolarWinds is actually switched positions and over its three rounds of funding it raised $217 million, which averaged to 72 and 1/2 million per round.
So this was really interesting and we can very easily add metrics to our dataset. But we can actually retain these metrics as we pivot to other fields. So I'm going to go ahead and toggle the fields back on and pivot on company category code.
So before we saw that software was up at the top when we sort of based on count, we're still now, we're sorting based on average raised and we see that travel on average raises 84 million per round. As you can imagine HomeAway might skew that just a little bit.
Cleantech is next we see analytics on there, network_hosting. Let's go ahead and sort just based on count again. Right, this was software, enterprise, biotech, mobile and then finally, we can sort based on raised_amount_usd. And we see that software is at the top, in term raising $1.1 billion or having companies that raise in total $1.1 billion over this dataset. We then see cleantech, biotech, enterprise as-- I personally didn't expect from this dataset.
So this was cool. I mean we knew nothing about this dataset when I uploaded it. And now we can kind of explore the data and learn a lot more about it interactively through Imhotep Data Explorer. But this dataset is just 48,000 documents. That's pretty small in terms of Imhotep. And it is at its core, a large scale interactive analytics tool, sorry. So I'd like to talk a little bit more about that.
The total size of the data that we have Imhotep living on top of is 125 terabytes. The largest index that we have is our job search index, which is 30 terabytes. And this is over $48 billion documents. Now that's a million times the size of the dataset that we were just dealing with. And Imhotep, you can interact with this data very simply, just like we did before.
Now we looked into some commercial data warehousing solutions that could handle this type of data and they ranged anywhere from-- to the tune of $20 million. So I'd just like to take this moment to reiterate that we are open sourcing Imhotep. So let's go back to the job search that Tom showed you before.
So this is a job search on our site for Indeed software engineer in Austin. And as you can imagine, there are a lot of things that we would want to log here, and a lot of things that we would want to look at inside of Imhotep.
The first might be the query. So what is the actual job search that somebody did on our site? The next is the location. So where were the job seekers searching? And finally, is impressions. So this is an organic impression. It's the first organic impression on the page.
Now, organic impression is just a job that was displayed as the result of a search. Let's take a closer look at that impression. We see that it's for Front End Software Engineer for new product at Indeed in Austin. And for these impressions, we actually have an index that I'll show you shortly.
Some things that we might want to look at in this index are firstly, the title. So what were the job titles of the impressions that our job seeker saw? The company information. So what companies showed up and for in this case Indeed was it with its 63 reviews and 4 and 1/2 stars? The description. So what was the snippet of information that we showed the job Seeker? And finally, the job age. So how old were the jobs that we showed our job seekers?
Now we actually want to log a lot more than just this. This is just kind of a snippet of the types, of the breadth of things that we log on a single organic impression. But I'll simplify that a little bit, and just show you a basic organic impression document.
So here, for that job that we just saw, we see the title is Front End Software Engineer. The position of one, that was the first ranked organic job that we had. Whether it was clicked or not, and we didn't click on that job so that'll be a zero. The country that we showed this impression in, so the United States. The query that was searched, which led to that impression, so Indeed Software Engineer. The location of Austin, and the time stamp.
Now, looking at just one impression might not be something that we want to do on a regular basis. We actually want to aggregate this information over tons and tons of impressions. So we store this into what's called the Organic Impressions Index. Now I'd like to show you the materialized view of this index in Imhotep Data Explorer.
So here we have the organic index selected from the 5th of December, 2013, to the 10th of December, 2013. And Indeed is an International company with over 50 websites and 26 different languages. So we've selected country of ie just so that we can look at an uninteresting case study for one particular country.
So we see-- I'll toggle options real quick, --we'll see that for the country of ie, It has a count of just over 3 million. What this means is that we showed just about 3 million impressions to job seekers doing searches in Ireland. Let's pivot on some more fields and see what else we can find out.
So first, something that's interesting is job language. So what were the languages of the jobs that we showed in Ireland? We can see that there were 14 distinct languages for the jobs that we showed. So that really tells me that Ireland is an International country. There's a lot of different job languages. A lot of different people searching in Ireland. Let's pivot on something else. How about job age?
So what were the age of the jobs that we showed? We see that topping the list is jobs that were just one-day-old and of the 3 million impressions, we showed 280,000 impressions for just jobs that we add or that we are posting on our site within one day. We then see two days, three days, four days and so on.
We can also look at page. So this is going to tell us what page people were on when they saw these impressions. So page one for instance, had 1.3 million impressions of the 3 million. What this means is that, over half of our impressions were shown to job seekers who paginated. Who went on to pages two, three, four, et cetera.
Finally, I'd like to do one more filter on-- sorry one more pivot on clicked. Now clicked, as I mentioned, is going to tell us whether or not a job seeker clicked on an impression. So we see here that 106,000 times of the 3 million impressions that we showed, job seekers actually clicked on the jobs. So this is a great way of measuring job seeker engagement. So if somebody is interested in a job, if they want to learn more, they're likely going to click on it. So, I'm going to go ahead and filter on clicked, and just look at those impressions which received clicks. Which again is 106,000.
We can now pivot on any of the fields that we have in this index, and see for those impressions that were clicked, what are the values for those fields. Let's go ahead and pivot on company first. And look at which companies received the most clicks. And we have a lot of fields in this index.
So, I'll go ahead. And we see that-- we see CPL jobs. That's a huge recruitment agency in Ireland. We see Tesco, which is a large grocery retailer in Ireland. And down there number seven with 717 clicks is a company that we should all be familiar with, Google. So let's dive in and explore some of the people who clicked on Google.
So I filtered by Google, and now from the 3 million impressions that we have in Ireland, we took 106,000 of those that got clicks, and now 717 of them is what we're currently looking at for the company of . Let's look at the jobs that Google posted, and see if we can find anything interesting. So I'll scroll down, and click on title.
So we see that at the top is administrative assistant in sales with 99 clicks. Now this makes sense, because Google has a massive sales operation in Dublin. We also see University program specialist, business intelligence analyst, and down there at number seven, software engineer, PhD University graduate, with 19 clicks. Now I know there are some software engineers in the audience. So let's go ahead and filter on software engineer and check out this job. So, from the initial set of documents that we had, we're now looking at just 19. And just those 19 documents that received clicks for this title of this software engineer University graduate that Google posted.
So we can do something super cool here, which is perform yet another pivot. But we're actually going to pivot on query now. What this is going to do, is it's going to tell us, for the clicks that were made on this job, what were the initial searches that led to those clicks.
So we see machine learning, software developer, we see Java developer, graduate Java. So, I just want to take a second and pause, and kind of talk about how cool this is. I mean we took from 3 million impressions, we went down to just 19 of those that got clicks. And then we were able to say, Oh, well-- we were able to do just one pivot with one click and say, what were the queries that led to those impressions?
Now, Imhotep Data Explorer just made this super, super easy. But I want to pivot on just one more field here, which is called, CTK or cookie tracking this is a unique anonymous cookie that we give to our job seekers when they do a job search on our site. And it allows us to track them, over time and see what else they've done.
So I'm going to go ahead and open up fields again, and scroll down to CTK. Get there eventually. And look at the distinct cookies or the distinct job seekers who perform these clicks. And we see that there are 19 distinct job seekers who were clicking on these jobs. Now that's interesting, but Imhotep Data Explorer impresses us yet again, and allows us-- and gives us a functionality called filter all.
So I'm going to go ahead and click this button. And what it's going to do is what's called, the CTK query So this, basically just took those 19 individual CTKs, and it applied them as a filter to this dataset. So we now have the CTK query, for country ie Clicked yes, company Google, and title of Software Engineer PHD university Graduate.
So now we can do something kind of cool. We can actually remove the other filters that we've applied, and just get those job seekers that we're interested in. So I'm going to click x on Google. And I'm actually going to click this little minus button next to this title. What that's going to do, is it's going to negate the title for me.
So now I'm looking at just impressions that were clicked on by these job seekers in Ireland. But where the click was not that particular software engineer title. So this is going to tell us every other click that these job seekers did.
So let's go ahead and first pivot on query. So what queries did these job seekers do that led to their other clicks? We can see graduate-- we saw that before received, there were 25 clicks when people searched graduate, dot net developer, Java developer, embedded sequel. So that's some pretty interesting information. Let's pivot on some other fields.
We can move down to title, and see what other titles these job seekers clicked on, besides that Google software engineer one. We see that there were hundreds distinct other titles that they clicked on, with four clicks out of this, from this group of people. We see job-Software Engineer, Graduate, Dublin. We see software developer for a trading team. We see ey which stands for Ernest and Young. Software developer, Java developer, so lots of developer jobs as one might expect.
Finally, I'm going to pivot back on company. So what this is going to tell us is, what other companies these job seekers were interested in. So we see at the top with six clicks off replacement. Also IBM with six clicks, we see Deloitte, Full Tilt Poker, and we see Google down there as well, with three more clicks. So what that means is that these job seekers actually found another Google job and clicked on that three times.
So, again, I want to pause here and just talk about this. So we have these 19 distinct job seekers that we're able to filter on. And then we were able to basically do-- build a click co-occurrence model, right in front of like, just by pointing and clicking. So normally, these types of things are reserved for data scientists or people with math backgrounds who want to program this type of stuff. But we could just do this in a matter of like five clicks. So, Imhotep Data Explorer needless to say, is a super powerful tool, which is why we're super excited to open source it and to talk about it today.
But there were some things that we weren't able to do with Imhotep Data Explorer. For instance, it can't combine results from multiple datasets. So we use the crunch based data initially. We use the organic impressions data next. But if we wanted to, say, look at the companies receiving the most clicks, who raised a particular amount of money by joining these two datasets we couldn't do that.
It also doesn't allow us to easily automate things. So it is a front end web app. So it's hard to automate things just based on that, and it doesn't really give us hooks into the data itself. Because of this, we created Imhotep Query Language or IQL.
IQL does allow us to do these things. We can combine results from multiple datasets and it allows for the easy automation of tools, which my team has done regularly. So let's talk a little bit more about IQL.
There are three requirements to an IQL query. An index, so where you want to get the data from. A date range. So when you want to get the data from. And metrics. And I mentioned before, metrics are just numeric valued fields. So what type of numbers do you want to select from this data.
We also have two optional fields that we have. Filters, so do you want to filter your data? For instance, we filtered before on country of Ireland and clicked being yes. And Group by so what groups you want to see in this data.
And before, when we were looking through metadata Explorer, we were basically just doing interactive Group by, when we selected those fields on the left. So for instance, when we clicked on company, it did a group by for us and gave us all the groups of companies that were there.
So this is what an IQL query looks like. First we select count, so select the metrics that you'd like. Counts as a special metric which counts the number of documents in the dataset. We then take the index, so from organic. So where are we going to get this data from? Organic is the organic impressions index. We then have a date range, so December 5th, 2013 through December 10th, 2013. We've got filters, so where country is Ireland, and click equals 1. And finally, the groups. So grouping by company ID.
Now this is basically just going to give us the first grouping-- the first pivot that we did in the organic dataset, where we looked at the companies who received clicks in Ireland. So with IQL, we can now answer some really cool questions. Questions like, do companies that have raised more than $10 million in Austin get more clicks on average than those that have raised less than $10 million?
So to answer this question, we kind of have three steps. The first, is go to the organic index, select companies in the US that received organic clicks. We can then go to the crunch based index, select companies and the amount of funding for these companies who received investments in Austin. And finally, we can just join these two up, segment based on that $10 million number, and do the math.
So let's go ahead and do this. We're actually going to do this using what's called ish. And ish is an interactive IQL interpretor, that we built on top of Python. Now this pulls in the pandas library in Python, which is a great numerical analysis and data analysis library. So the syntax is going to be mostly pandas.
So the first thing we want to do, as I mentioned, is just say clicks is select count from organic. We're going to do this over the past seven days, so from seven days ago to today, where country's US and clicks is one. And then we're going to go ahead and group by company ID.
So I'll run that. And we see it very quickly spits out an answer. We're assigning this to the variable of clicks and we see that right now it's sorted on company ID. But for every company ID, we have a count, and this is the number of clicks that company has received.
We can then-- in another variable called company funding --say, select raised amount USD. So the amount that was raised in US dollars, from crunch base, from over the date range that we have this dataset, where company city is Austin. And we're going to group by company ID, so that we can do the join with the other dataset, and company name, so that we can associate a name with the data. So again, we see it's very snappy, and we have company ID, company name, and raised amount USD.
Finally, we can do a join. So I'm going to assign to a variable called join, the join of clicks dot company ID. We're going to join that on company funding dot company ID. And now we have a data frame with all the data that we need. So we can now do the segmenting that I mentioned.
So first, I'm going to create a variable called less than 10 mill. And basically-- I'll show you the syntax is going to be joined where raised amount USD is less than $10 million so we can go ahead-- and I'm going to sort this based on clicks and look at the top 10 values.
So we see that favor, who raised just under $1 million, has had 2,200 clicks over the past seven days. Indeed, we raised $5 million, we had 1,200 clicks and so on. So let's go ahead and do this for the companies that raised more than $10 million.
So we see, it was very easy to segment that data using pandas. And then I'll sort it-- and sort of based on clicks, and we see that at the top, Bazaarvoice received 1,000 clicks, and they raised about 130 million in funding. We see map my fitness, HomeAway SolarWinds and so on.
So lastly, we just have to do a bit of math on this. So basically, just find the average number of clicks in each of these datasets, and see if there's a difference.
So again, pandas makes this very easy. We can simply say, less than 10 million, look at that field of count, and then do a describe. And this is going to give us the five number summaries. So the mean, median, standard deviation, and so on. We see, the mean here is 161 clicks, and the median is 41 clicks. We can do the same thing for more than 10 million, and we see the mean here is 188 clicks, and the standard deviation, and the median is 64 clicks.
So IQL allows us to very, very simply combine our datasets. And you can imagine, it's very easy to automate tools off of this. Now, I should mention, this is clearly correlation and not causation. But it's still some interesting analysis that we can do.
Imhotep has been wonderful in the marketing department, and we love using it. But it's actually much more yous than just us. So Tom is going to come back up here, and talk to you about the different people who are using Imhotep, and he's going to walk you through a real world example that we had to solve using Imhotep.
[CROWD CLAPPING]
TOM BERGMAN: So I'm Tom Bergman again, and I still help people get jobs. So like Zak said, I want to talk about a real world example of how we used Imhotep to improve the product here. Before we get that, I just want to go back and iterate through what we were talking about.
We call Imhotep our large scale interactive analytics platform. So, when we say large scale, we're talking about the amount of information in it. So we have, at Indeed 123 unique indexes. So these are all different datasets that we have put in there.
We have the largest index of 30 terabytes. And the total size of the indexes all together is about 125 terabytes. And we store this duplicated for redundancy, so the total footprint is about double that. When we say interactive, we're talking about speed. So when you ask a question, you get an answer back very quickly.
IQL, which Zack was showing off is largely programmatic access here at Indeed. And we get about 76,000 queries per day. And the average time to execute those is 0.67 seconds. So very, very fast, and very big volume.
Ramses, which I showed earlier, is still around. It's largely human, although the back end is now powered by Imhotep. So Ramses' usage, we get about 3,400 queries a day. And the average time to execute those is about 4.4 seconds, so still very, very fast for the amount of data we're talking about. The other thing about interactive is, we have a lot of users. So at Indeed, we had 198 users in the past month who used it. We did 25,622 unique, queries done by humans, and that was an average of 53 queries per user per day. So it's been able to let all of us get at the data very quickly. And because we can ask questions so fast, we can iterate through it and do different variations, ask a bunch of questions.
We say it's an analytics platform. So it's not just a tool. We call it a platform, because we can build a lot of tools off it. So like some of the things that Zak showed, we have 40 internal clients that use it. So we have six analytics web apps, like Imhotep, Data Explorer, or Ramses. We have five dashboards that pull data from it, and display it all day long.
We have 10 programming or scripting shells like ish, we have an R one, we have a Python one. We have six monitoring apps that check performance of our production software using it, and more.
What's really, really powerful for this-- about this for us as an index platform is that it gets one toolset for all the data. So we can put in data about our website usage, operational monitoring like from Nagios, financial reporting, Google Analytics, internal web app usage, and external reports, and we can put this all together in one place, and use it to solve some real problems.
So going into that, our job at Indeed is to provide the best results. And that means we're going to show jobs to users that are most interesting to them. Like Zak said earlier, clicks are a very good indicator of interest. So more clicks, generally more relevant, less clicks, less relevant. For one user, they'll click if they like it. And for a bunch of users, we can judge how valuable it is to use this as a whole.
So one particularly hard query we've had to deal with is architecture. One of the reasons that it's very hard to deal with is that, most of the architecture terminology has been co-opted by technology. So for example, there's a lot of words that are common to both of them, which makes it very hard to figure out which is which. Blueprint, design, infrastructure, modeling, architects even have to do code reviews. They're a little bit different but. And here's some of the different titles we'd see. So for example, an architect who works on buildings might have a title like architect, or CAD designer, or project manager. Whereas, one who works in software might have a title like, software architect, or UI designer, or project manager.
So it's been very hard for us to figure out, how do we get the best results to users when there's so much crossover between these? And one of the ways we've done that is using Imhotep. So I'm going go back to that organic index that Zak showed. And now, I have it filtered to queries for architecture in the US over about a month. So this is maybe three or four times the data that Zak was looking at.
So we have here, what are the titles that people who search for architecture saw at Indeed sorted by volume? So if they do a query, top one was project manager, followed by architect, architectural intern, et cetera.
So this tells us what jobs showed up, but it doesn't really tell us how users reacted to them. So to get that, I'm going to go in and add another metric here. And the metric I'm going to add is going to be CTR. So that's-- we call it Click Through Rate. It's going to be clicks, whether or not the user clicked on it, divided by count, which is going to be the number of times we showed it to people. So this, click through rate is going to be the average chance that someone clicked on it when we showed it to them. So hide that. And this is going to load up in a little bit, and we'll see, in that time period, what were the chances that someone clicked on the job with that title when it was shown to them.
So you'll see project manager, CTR of 5.3%. So when people saw that job, there was a 5.3% chance they clicked on it. Likewise, architect and so on. So what we can do that's interesting, is figure out, what were the most interesting titles to people during this time period? We can do that simply by sorting by CTR.
So now, we'll see, at the very top the highest CTR title was architects, with zero to three years experience, with 17% CTR. So there's a 17% chance that people clicked on it followed by designer I architecture, architecture intern, architectural drafter, a design professor, et cetera, et cetera.
So with this, we can tell, these titles are all about making buildings. People obviously can tell by the title, and they click on it, so they're interested. Likewise, we can sort it the other way, and see jobs that had the lowest CTR. So we'll see WebSphere operation decision management, web developer, Java architect, Systems Engineer II. What you'll notice is very different about these, is they all have a CTR of 0. So there was not a single time that someone searching for architecture, clicked on these jobs when it was presented to them.
Of course, these volumes are very low. You see, we have 28,000 titles during this period. So maybe there were other titles that had, in the long tail, that we can't see here in the results. But we can actually do a little bit something to address that.
We have a field called title words. what that is-- field. So we take the title, we tokenize it. So we break it down into its component words, and then we indexed all three of those words for each document. And then we can basically say, did a document, or did a job have this word in the title?
So when this loads up, it's going to show us, what are the top words that occurred in any of the titles for the corporate architecture in the US during this time period. So we'll see top word was architects, not surprising, followed by manager, project, engineer, intern et cetera.
We can then sort these words by CTR and see, what is the correlation between the words in the title, and whether or not it's clicked on. So we'll see again, at the top, architectural, drafter, CAD, followed by entry, intern, junior, and then architecture, summer, planning.
So these are mostly words relating to architecture buildings, but we also see a lot of entry level things. And this is largely a seasonality thing. So around the beginning of the year, lots of people start looking for entry level jobs, as they're getting ready to get out of school.
So let's go start the other way. So that we'll see the lowest CTR words for people searching for architecture. And we'll see very, very stark again, assurance, network, Java, software, Oracle, intelligence, programmer, so all very technology related words.
So now that we figured this out, what do we do? So this method of segmenting by a field, doing a group by and then looking at how it changes the matrix, is one of the ways that our CTR model works. So we actually will try to find the biggest splits and choose things by that.
CTR model is going to be effect ranking, not necessarily matching. So in order to deal with matching, we use something called query management. So we use Imhotep to improve matching through query management. The way we do that is, first, we start looking at things manually, and identified words that shouldn't show up-- words, jobs, et cetera that shouldn't show up in the matches.
So then, we can actually have automatic programs going through IQL, to determine-- or going through Imhotep to determine which results should be added or should be removed from queries, and do that automatically. So as a result of this, we've actually added 26,790 rules across all the countries, based on some of these patterns that we originally found through humans in Imhotep and then eventually automated.
So that's one example of how we use this to make our search results better for our users. I thought was pretty cool.
So at this point, I want to close it. And just say, Imhotep is going to be open source. We're working very hard to get this done as quickly as possible. We're shooting for an August 1st, 2014 ETA, and I think Darren is upstairs right now, working hard on getting this done. And Vlad is here in the audience, not working hard for some reason.
So we have data online. So please follow along at the blog, which is engineering. indeed.com, and you can see all the answers there, all the updates rather. We also have a mailing list, and you can get all the updates sent to you directly. It's go.indeed.com/imhotep-announce.
[CROWD CLAPPING]