Indeed MPH: Fast and Compact Immutable Key-Value Stores

When you need to scale an application with a lot of data, how do you decide on a storage solution? How can you both safely store and efficiently interact with large data sets? This often turns into a debate about whether to use SQL or NoSQL. Each comes with strengths and drawbacks.

But what if there was a third option that sidestepped database concerns altogether?

Consumers might need updates no more often than every few minutes. In this case, being able to load a data set into memory dramatically speeds access and allows for tremendous scale. This is why, for many projects at Indeed, we distribute a full copy of the needed data to each consumer and render the SQL or NoSQL debate unnecessary. To make this possible, we manage data size using a new key-value store based on minimal perfect hash functions. We’ve implemented this store as a Java library called Indeed MPH.

The problems with distributing data

Usually, distributing a full copy of your data to consumers is impractical for two reasons. First, the data must be mostly read-only. Second, the data cannot be so large as to prevent distribution or processing.

You can overcome the first problem by implementing batch updates. Redistribute the data every day, every hour, or even more frequently, and you keep the data read-only while ensuring that it remains useful. However, how do you keep the size down? As it turns out, the same batch update strategy that addresses the first problem also addresses the second.

Minimal perfect hash functions

Generating read-only data sets means that we know all the keys for our data set in advance. This lets us apply optimizations to the data set to reduce its size. In the case of hash tables, this knowledge allows us to compute a perfect hash function.

A perfect hash function maps every element to a distinct integer with no collisions — in mathematical terms it’s an injective function. A minimal perfect hash (mph) function maps n keys to the n consecutive integers [0, n-1]. With such a structure we can guarantee a single disk seek while maintaining 100% load in the table, but as we’ll demonstrate later, there are even more advantages.

Finding such hash functions can be difficult. In fact the Birthday Paradox, described in this Wikipedia article, tells us that zero collisions is unlikely even if m is several times larger than n. However, recent advances have yielded efficient techniques (such as described in this abstract) and quality libraries for generating minimal perfect hashes. These techniques allow our method to succeed with only a small downside: the resulting hash function itself requires a lookup table of ~2.2 bits per key, so it does not occupy a constant size. However, this lookup table is a fraction of the size of a bloom filter and only requires about 26MB for 100 million entries. In practice, the size of the lookup table is negligible compared to the size of the data being stored.

Practicalities

An mph function gives us a mapping from the keys to [0, n-1], but we still need to implement the table. In memory, we’d typically just store the data in an array. To translate this to offline storage we need to make something explicit that programming language arrays make implicit: memory offsets. Our representation is a directory with three files:

  • data.bin: the raw serialized key/value pairs
  • offsets.bin: a fixed size array where the ith value is the byte offset in data.bin for the entry whose hash is i
  • meta.bin: the serialized hash function itself, along with any other desired metadata

The following figure shows this representation with an index mapping colors to animals:

A representation of explicitly mapping memory offsets using three files.

The file containing the serialized hash requires roughly 2 bits per key, while the file containing the offsets requires 4 bytes. The raw serialized key/value pairs require 46 bytes per key.

One convenience of this representation is that data.bin is raw data available for iterating, independent of the hash function. In addition, we might have multiple hash functions indexing the same data.bin by different keys as long as those keys are always unique.

The disadvantage of this representation is with many small entries. In this case, the size of the offsets could be on par with or even larger than the data itself. For an extreme example, if the data contains 750 million mappings from int to half-precision float, we require 7.5e8 * 6 = ~4.2G for data.bin, but 7.5e8 * 8 = ~5.6G for offsets.bin! In this example, we can take advantage of every entry having a fixed size. We don’t need to store the offsets — we can index directly into data.bin. But we want to optimize the variable-sized case as well.

This high cost comes specifically when we’re dealing with small entries. If the entries are large, the size of the offsets are negligible by comparison. Because the total size is small, we can represent it with a rank (bitcount) data-structure, on the order of bits per entry instead of bytes. For a little more overhead, we can compute the inverse of rank, called select, in constant time on this same data structure. Why is that useful? If the bits are set for the start byte of every entry in data.bin, then finding the offset of the ith entry is just computing select(i).

The structure becomes the following. In this case we now have to sort data.bin by hash value.

A representation of explicitly mapping memory offsets with optimization in the offsets file.

In the optimized structure, the file containing the offsets requires 1.x * 46 bits.

A further optimization comes from the often very regular structure in our data. For instance, to map from a long to list of longs (using a single byte for the count), the size of every individual entry i (including key and value) can be represented as 8xi + 9, assuming the length of the list is xi. The offset of the kth entry becomes ∑(8xi + 9) = 9k + ∑8xi = 9k + 8x. In other words, all we have to store is x, which is much smaller than the actual offset, and a big savings with the select approach. In this case, with 10 million entries averaging 2 elements in the list, the size of data.bin is 10000000 * (9 + 16) = ~250M, but the compressed select method uses only ~2.5M.

Our implementation computes what the size would be with both an array of offsets and the select approach and chooses accordingly.

More optimizations

Throw away the keys!

If we have no collisions and know that the key we are looking up is in the table, we don’t need to verify it. Alternately, we may be able to validate the key from the value, or from other sources given the value. Finally, if we can accept some small probability of false positive lookups, we can always probabilistically verify whether keys are in the table by using a bloom filter.

We have a better alternative. Since we’ve already hashed to a value that has no collisions in the original key set, we can store a k-bit signature (the low bits of any universal hash will do) of the key. Any key not in the original set that hashes to the same value will only match the signature with probability 2-k. This is a much better error rate than with a bloom filter.

In all of these cases, we can omit the keys entirely from the table. When the values are small, the keys may be a significant fraction, or even the majority of the table.

Link by hashes, not keys

It’s common to want to link multiple tables together — the values of one table either are, or contain foreign keys into, another table.

Another benefit of using an mph function is that it compresses keys to a compact integer range. We can store the hash value instead of the key, which for most practical purposes will fit in a 4-byte integer, as opposed to 8-byte longs or larger for many other foreign keys. Not only is this more compact, but it’s also faster to read because we don’t need to re-compute the hash for the second lookup.

Code and benchmarks

We’re excited to announce that we’ve open-sourced our mph code. For benchmarks we compare with several open source alternatives:

  • SQLite3: a single-file RDBMS suitable for replication
  • LevelDB: Google’s LSM tree key-value store
  • lsm: our LSM tree key-value store
  • discodb: an immutable key-value store also based on minimal perfect hashing
  • TinyCDB: an immutable key-value store using non-perfect hashing

Note these options do not have equivalent feature sets. SQLite provides full relational operations, lsm and LevelDB provide range queries, and all three of these are mutable. Here we are concerned only with the common functionality of these options.

We look at the results for two conceptually linked stores based on our production data. This allows us to demonstrate the link by hashing optimization described above, and also enables a more natural representation in SQLite. The first store is a mapping from 50 million 64-bit hashes to a small cluster of “items,” which are 80-bit integers represented as 16-digit base-32 strings. The second store is the reverse mapping from each item in the cluster to its associated hash.

An important aspect of our own key-value stores is their use of arbitrary serialization via a plugin infrastructure. So for a fair comparison, in the sizes for our LSM trees and MPH tables we include both the sizes for opaque string to string mappings, as well as two optimizations. We encode keys as fixed 8-byte long values, and encode the values as short lists of 80-bit integers. For SQLite we encode the keys as integers.

A bar graph comparing data size in MB for string mappings between several storage methods.

In the general case of string mappings, LevelDB, lsm and mph are all of comparable size, and notably smaller than the alternative solutions. If we apply more efficient serialization lsm and mph become much smaller, and mph is increasingly able to take advantage of the regularity in size to become the smallest option.A bar graph comparing data size in MB for serializations between several storage methods.

Here we consider only the best serialization for lsm and mph. For mph we also show the size with the optimizations discussed above, first not storing the keys and then also linking to the above table via the perfect hash value. For SQLite we include the size when indexing on both columns, which allows us to remove the previous table altogether. In this case, the SQLite index is smaller than the combined sizes for discodb and TinyCDB, and comparable to LevelDB. However, mph remains smaller than everything, and in the ideal case with all optimizations applied is in fact an order of magnitude smaller.

A bar graph comparing random lookup latency speeds between several storage options.

The hash-based solutions tend to be faster, as one might expect due to their requiring fewer seeks. Mph is fastest overall by a large margin, completing hash to item operations in 2 milliseconds and item to hash operations in 3 milliseconds.A bar graph comparing write performance speeds between several storage options.

Write performance shows lsm and mph have the strongest performance. TinyCDB is excluded at this point due to data corruption — it has a fixed 4GB storage limit by design, but even at 3GB we were unable to retrieve 30% of the keys.

Try it yourself

We’ve made Indeed MPH available and open-source on Git.

If you’re already shipping a key-value store to your servers, our mph tables may be able to offer smaller size and faster reads. If you’re not already doing so, consider this as an alternative to centralized databases. With a compact representation, you can store more data in the same space, or reduce data size that you previously thought too large.


 Indeed MPH: Fast and Compact Immutable Key-Value Stores cross-posted on Medium.

Bringing Job Spotter to iOS: Indeed’s First React Native App

In 2013, Facebook released ReactJS and changed the way a lot of people (including us) think about web development. In 2015, Facebook released React Native, allowing developers to create powerful native mobile experiences using JavaScript and React. When we planned to bring the Job Spotter app to iOS, we were hopeful that React Native would live up to its promise.

Spoiler alert: it did.

Intro to React and React Native

React is a different way of thinking about web development. Instead of separating code by function, React encourages separating code by components. Each component contains all the layout, styling and business logic it needs – often in the same file.

React’s other important feature, described in this conceptual document, is one-way data flow. Compared to the two-way flow implied by other frameworks, it is much easier to figure out how state changes affect the app in React. One-way data flow also enables powerful libraries like Redux.

React Native lets you write JavaScript for Android or iOS apps with React that deliver native performance. Web developers can write powerful native apps, without needing to spend time learning unfamiliar native mobile development technologies.

Wins

React Native helped us speed up our mobile development cycle. Using out-of-the-box tools, we can run our React Native app locally in the iOS simulator. That allows us to refresh the app to test changes without recompilation, just like we would in the web. Even better, application state is maintained across these “hot reloads.” These development features of React Native translate into a big savings in time and effort, allowing us to focus on developing new features.

Since we could leverage our previous React knowledge, we didn’t have to learn a new language and could get straight into coding. This was especially great for our team, since none of us had used Objective-C or Swift. Another bonus was that, except for some custom event logging, we hardly had to write any native code.

Another interesting aspect of React Native is that it requires using inline-styles, in which CSS “classes” are actually JavaScript objects. This seemed strange at first, but we’ve grown to love it. Since we were dealing with objects, we could create powerful, composable, and extensible CSS classes and avoid a lot of style duplication seen in traditional webapps. We could develop different looks and functions as needed, rather than having to repurpose the same styles for different needs.

React Native makes heavy use of Flexbox, a newer CSS module that aims to make styling more consistent and straightforward. This removes the need to know many of the traditional quirks of CSS. After overcoming the learning curve, the more declarative approach of Flexbox allowed us to style new components more rapidly than in traditional CSS.

Debugging was also easier than we expected. We just connected to the debugger in the Chrome Developer Tools and debugged the same way we would on the web. React Native has several additional tools – the element inspector determines the styling applied to a component, and the network inspector analyzes the XHR requests your application makes. Being able to reuse familiar tools helped us track down and fix issues as they arose.

Challenges

React Native enabled us to develop more rapidly, but React Native itself developed just as fast. In the two months that we worked on the Job Spotter iOS app, React Native had six releases. After updating to each new release, we would have to do a full regression test of our app.

Another major and related challenge: React Native’s newness. A lot of capabilities that would be better as default features still required third-party libraries, such as accessing the camera and file system and handling permissions. The community works hard to pick up the slack of React Native core. However, these third-party libraries sometimes lacked the quality we expected from the core library. In fact, we had to patch more than one such library.

The components available for navigation and screen transitions were also inconsistent. We ran into an issue with the Navigator component in which every transition needed some kind of animation, with no option for an empty transition. This initially made some of our scene transitions look awkward. We tried using the NavigatorIOS component, but its interface was less friendly and straightforward, and lacked many of Navigator’s features.

Similarly the TabBarIOS component works quite well, but its transitions were completely independent of whatever Navigator we used. We wanted a single navigator handling everything. We had to use different code paths for supporting navigation when we would have liked one.

It’s worth it

Despite these challenges, we continue using React Native. It fits well in our toolbox. We can use our web development expertise to build rich mobile experiences without having to become Android or iOS experts. And as React Native matures, we expect many of these current issues to be resolved.

Our next step is porting the React Native version of Job Spotter to Android to replace our existing native Android app.


Bringing Job Spotter to iOS: Indeed’s First React Native App cross-posted on Medium.

Transitioning from Academia to Industry

Perspectives from Indeed’s Data Scientists

I still remember the moment I told my advisor that I was considering leaving academia. The stress. The fear. Saying the words, “I don’t think I want this for myself” out loud. And afterwards, the relief.

At the time, I was juggling work on my dissertation proposal, multiple publications, teaching responsibilities, and a research assistantship. Meanwhile, I had begun researching data science, a field where I could use the skills I had learned in my doctoral program, but in a setting that better suited how I wanted to live and work. So, I made the decision to leave academia, pivot my skillset, and look for data science work in the private sector.

Of course, my experience was not unique. Tenure is often touted as the ultimate form of job security, but the number of tenure track jobs available has declined while the number of graduating Ph.D.’s has increased. Furthermore, pay has not risen with inflation. Many of the data scientists I interviewed who had left academia said that “the road to professorship was long and uncertain” and often full of “soul-crushing” levels of anxiety. By contrast, as companies look to leverage their vast stores of data, job posts on Indeed for data scientist positions have exploded since 2013.

Data science jobs on Indeed.com have quadrupled over the last four years.

A line graph showing an increase in data science job postings on Indeed.com

We looked at a 7-day rolling mean of all Indeed job posts that featured “data science” or “data scientist” in the title across the world from January 1, 2014 to November 16, 2017. In that time frame, those posts went from slightly more than .02% of all job posts to over .08%. The data was pulled using Indeed’s open source analytics platform Imhotep.

Due to the sequestered nature of the academy, those who might want to leave academia often don’t know what things are like in “the real world.” Since I left the academy two years ago, some of the more common questions I receive are: “Why did you leave academia?” “What should I do to make myself more hireable?” And the most existential question of all:

“Will everything be OK if I leave?”

To answer these questions, I enlisted the help of eleven other Data Scientists and Product Scientists here at Indeed who left academia. For simplicity, I refer to them as “data scientists” throughout this post.

The data scientists I interviewed come from a variety of backgrounds, some from Liberal Arts and others from STEM. For some, Indeed was their first job after leaving academia, while others have been in industry for nearly 10 years. They left at various points in their academic careers, some while still ABD (“all but dissertation”), others while serving as assistant professors. Some went on the academic job market first, others “didn’t even bother,” still others never intended to. Some of our data scientists discussed that a major life event, like getting married or having a child, spurred them to rethink staying in academia.

It’s my hope that this blog post offers some guidance and reassurance for academics considering making the leap to industry.

Why move to industry?

Simply put, moving to industry allows for more freedom for how you spend your time, energy, and money; where you’re able to live; and what kinds of statistical methods you can use.

Suddenly, you have more autonomy to decide where to live or work. For one data scientist, moving to industry after a series of post-doctoral fellowships allowed for “a more predictable career path where there were plenty of job opportunities.”

One of the biggest adjustments former academics made when moving to industry was how much more free time they had and how much more relaxed they felt. Our data scientists described “getting weekends back” and being able to “go home at 5pm and not needing to do any more work.” Some of the former academics discovered new hobbies with their newfound time and money, like cooking, cycling, photography, and sports analytics.

Working in the private sector also provides another form of security: a better salary. One data scientist commented that her “pay doubled and the amount of work required halved” and another about “how good it feels to not be struggling financially!” The average data scientist salary on Indeed is well above the United States median income and the median salary for academics in postdocs or professor positions. I vividly remember the first time I bought a new pair of shoes on my new salary and realized that I did not have to worry about the effects on my budget.

Apart from the positive effects on their personal lives, data scientists who left academia also noted benefits for their work. Some noted that they were “afforded more methodological freedom outside of the academy,” due to the amount of data readily available and fewer concerns about whether their work would be publishable. Nearly all of the data scientists I spoke with mentioned how much they enjoyed “how much impact you can have, and how quickly.”

Speed and collaboration in the private sector

Non-academic jobs do have some notable differences that can surprise newcomers, particularly around speed and collaboration.

Nearly all of those I interviewed commented on the shorter project cycles and faster pace in the private sector, adding that “there is a constant focus on moving fast and iterating.” The tendency for academics to “spend forever on a project until it is exactly right is not a habit that will transfer well.” Another data scientist noted that “in academia you stay with one topic or research area for many years. In industry you need to be OK with working on something new every quarter or so.”

Perhaps the biggest difference is that you no longer work in isolation. “Academia was very isolating and highly values individual contributions you could claim as entirely your own,” discussed one data scientist. “However, in industry the most successful folks are the ones that collaborate well.”

The opportunity to collaborate can be especially meaningful for academics transitioning to industry. “If you’re struggling on something,” one data scientist mentioned, “you don’t have to do it alone.” Teams also expect you to check in with them periodically and not “disappear to work on a project for weeks or months at a time.” While all of the data scientists I spoke with had experience coding for their academic research, it was usually done “in an environment where the only person I would affect was myself. Having to share a codebase with 5 other people and not trip all over each other was a skill I had to learn quickly.”

Transferable skills between research and industry

It’s sometimes hard to remember in the hustle of research, but academics have a multitude of transferable skills beyond expertise in their field. In a way, “academics are trained to be startups with a single employee. They need to gather information, find funding and support, allocate scarce resources, and acquire the technical and non-technical skills needed to get off the ground,” one data scientist noted.

For some, these transferable skills include coding, statistics, and model building. For others, skills can include a variety of non-technical abilities that are so familiar to academics that they overlook their value to the private sector. As one example, because of all their independent work, academics usually pick up valuable time and project management skills. The importance of being able to do independent research and teach oneself cannot be overstated. “Lots of data science questions are complicated and no one has all the required skills,” said one data scientist. “It’s therefore super important to teach yourself.”

Academics also learn valuable communication skills. One data scientist said, “I had so much practice giving talks to an audience of experts trying to pick holes in my argument, that presenting results to clients was a breeze.” Context switching between teaching an intro class and presenting to fellow experts also “provides plenty of practice in pitching your presentation to the appropriate audience.” After writing theses, dissertations, books, and publications, it’s a lot less daunting to write documentation, project updates, and blog posts.

Finally, the value of critical thinking is immense. Academics are trained to be skeptical and to take on “big, complex, and messy problems.” They’re well-equipped to “build a project in an iron-clad way and understand all of the caveats of [their] work.”

How to get hired in industry

The data scientists I spoke with took several different approaches to redefining themselves and their skillset. These included reading books, attending onsite and online courses, working on internships over the summer, and meeting with an on-campus career counselor. Many universities offer these resources for a small fee or even for free.

Several former academics suggested starting a personal blog to write about side projects while learning data science skills. Organizations such as DataKind and Data for Democracy offer volunteer opportunities where budding data scientists can hone their skills. A couple of them “asked around for data projects to do pro bono so I could practice my skills and have some projects to share in my portfolio on my website.” At least two of the eleven data scientists I talked to attributed getting their first industry job in part to these blogs.

Some data scientists who went into industry “deliberately switched” from languages like Matlab or Stata to languages like Python and R for their research projects while still in academia. Others tried to “guide [their] academic work toward slightly more practical technological aims.” One of the most common data skills not usually learned in academia is SQL. Luckily, SQL is relatively easy to learn with abundant resources in print and online. Our data scientists recommend SQL in 10 Minutes by Ben Forta and Learning SQL by Alan Beaulieu.

Finding other academics interested in leaving academia can also be helpful. Some data scientists I spoke with formed study groups, and even practiced mock whiteboarding interview questions together. Others attended local meetups or conducted informational interviews to learn more about the field of data science.

Indeed is looking for data scientists!

Indeed’s Data and Product Science teams look for people who know statistics and how to program. Our Data Science team looks for “full-stack” engineers with strong machine learning backgrounds. Our Product Science team emphasizes prioritizing and solving business problems. Equally important to both teams are soft skills: curiosity; a desire to grow, learn, and improve; being humble and self-aware of one’s limitations; and “a true passion for helping people get jobs.”

Interested in applying? Indeed is hiring Data Scientists and Product Scientists at all levels of seniority and experience, and we love former academics! Find all of our open job listings at this link.

Methodology

This was a qualitative and self-reported study with an admittedly small sample size. We’d love to hear if this was representative (or not) of your experiences. Did you leave academia to pursue a data science role? How has your experience been similar or different? Leave us a comment below!


Special thanks to Christo du Plessis, John Jardel, Evan Koh, Eric Lawrence, Chris Lindner, Donal McMahon, Elias Ponvert, Ke Sang, Annette Taberner-Miller and Wenzhao Xu for taking the time to share their thoughts about transitioning from academia to industry. Trey Causey gave fabulous feedback on an early draft of this post, and James Beach and Leah Pasch helped edit the final draft. Thanks to all academic advisors out there who are understanding and supportive when their students leave academia (thank you, Pam!).


Transitioning from Academia to Industry cross-posted on Medium.