Indeed MPH: Fast and Compact Immutable Key-Value Stores

When you need to scale an application with a lot of data, how do you decide on a storage solution? How can you both safely store and efficiently interact with large data sets? This often turns into a debate about whether to use SQL or NoSQL. Each comes with strengths and drawbacks.

But what if there was a third option that sidestepped database concerns altogether?

Consumers might need updates no more often than every few minutes. In this case, being able to load a data set into memory dramatically speeds access and allows for tremendous scale. This is why, for many projects at Indeed, we distribute a full copy of the needed data to each consumer and render the SQL or NoSQL debate unnecessary. To make this possible, we manage data size using a new key-value store based on minimal perfect hash functions. We’ve implemented this store as a Java library called Indeed MPH.

The problems with distributing data

Usually, distributing a full copy of your data to consumers is impractical for two reasons. First, the data must be mostly read-only. Second, the data cannot be so large as to prevent distribution or processing.

You can overcome the first problem by implementing batch updates. Redistribute the data every day, every hour, or even more frequently, and you keep the data read-only while ensuring that it remains useful. However, how do you keep the size down? As it turns out, the same batch update strategy that addresses the first problem also addresses the second.

Minimal perfect hash functions

Generating read-only data sets means that we know all the keys for our data set in advance. This lets us apply optimizations to the data set to reduce its size. In the case of hash tables, this knowledge allows us to compute a perfect hash function.

A perfect hash function maps every element to a distinct integer with no collisions — in mathematical terms it’s an injective function. A minimal perfect hash (mph) function maps n keys to the n consecutive integers [0, n-1]. With such a structure we can guarantee a single disk seek while maintaining 100% load in the table, but as we’ll demonstrate later, there are even more advantages.

Finding such hash functions can be difficult. In fact the Birthday Paradox tells us that zero collisions is unlikely even if m is several times larger than n. However, recent advances have yielded efficient techniques and quality libraries for generating minimal perfect hashes. These techniques allow our method to succeed with only a small downside: the resulting hash function itself requires a lookup table of ~2.2 bits per key, so it does not occupy a constant size. However, this lookup table is a fraction of the size of a bloom filter and only requires about 26MB for 100 million entries. In practice, the size of the lookup table is negligible compared to the size of the data being stored.

Practicalities

An mph function gives us a mapping from the keys to [0, n-1], but we still need to implement the table. In memory, we’d typically just store the data in an array. To translate this to offline storage we need to make something explicit that programming language arrays make implicit: memory offsets. Our representation is a directory with three files:

  • data.bin: the raw serialized key/value pairs
  • offsets.bin: a fixed size array where the ith value is the byte offset in data.bin for the entry whose hash is i
  • meta.bin: the serialized hash function itself, along with any other desired metadata

The following figure shows this representation with an index mapping colors to animals:


One convenience of this representation is that data.bin is raw data available for iterating, independent of the hash function. In addition, we might have multiple hash functions indexing the same data.bin by different keys as long as those keys are always unique.

The disadvantage of this representation is with many small entries. In this case, the size of the offsets could be on par with or even larger than the data itself. For an extreme example, if the data contains 750 million mappings from int to half-precision float, we require 7.5e8 * 6 = ~4.2G for data.bin, but 7.5e8 * 8 = ~5.6G for offsets.bin! In this example, we can take advantage of every entry having a fixed size. We don’t need to store the offsets — we can index directly into data.bin. But we want to optimize the variable-sized case as well.

This high cost comes specifically when we’re dealing with small entries. If the entries are large, the size of the offsets are negligible by comparison. Because the total size is small, we can represent it with a rank (bitcount) data-structure, on the order of bits per entry instead of bytes. For a little more overhead, we can compute the inverse of rank, called select, in constant time on this same data structure. Why is that useful? If the bits are set for the start byte of every entry in data.bin, then finding the offset of the ith entry is just computing select(i).

The structure becomes the following. In this case we now have to sort data.bin by hash value.


A further optimization comes from the often very regular structure in our data. For instance, to map from a long to list of longs (using a single byte for the count), the size of every individual entry i (including key and value) can be represented as 8xi + 9, assuming the length of the list is xi. The offset of the kth entry becomes ∑(8xi + 9) = 9k + ∑8xi = 9k + 8x. In other words, all we have to store is x, which is much smaller than the actual offset, and a big savings with the select approach. In this case, with 10 million entries averaging 2 elements in the list, the size of data.bin is 10000000 * (9 + 16) = ~250M, but the compressed select method uses only ~2.5M.

Our implementation computes what the size would be with both an array of offsets and the select approach and chooses accordingly.

More optimizations

Throw away the keys!

If we have no collisions and know that the key we are looking up is in the table, we don’t need to verify it. Alternately, we may be able to validate the key from the value, or from other sources given the value. Finally, if we can accept some small probability of false positive lookups, we can always probabilistically verify whether keys are in the table by using a bloom filter.

We have a better alternative. Since we’ve already hashed to a value that has no collisions in the original key set, we can store a k-bit signature (the low bits of any universal hash will do) of the key. Any key not in the original set that hashes to the same value will only match the signature with probability 2-k. This is a much better error rate than with a bloom filter.

In all of these cases, we can omit the keys entirely from the table. When the values are small, the keys may be a significant fraction, or even the majority of the table.

Link by hashes, not keys

It’s common to want to link multiple tables together — the values of one table either are, or contain foreign keys into, another table.

Another benefit of using an mph function is that it compresses keys to a compact integer range. We can store the hash value instead of the key, which for most practical purposes will fit in a 4-byte integer, as opposed to 8-byte longs or larger for many other foreign keys. Not only is this more compact, but it’s also faster to read because we don’t need to re-compute the hash for the second lookup.

Code and benchmarks

We’re excited to announce that we’ve open-sourced our mph code. For benchmarks we compare with several open source alternatives:

  • SQLite3: a single-file RDBMS suitable for replication
  • LevelDB: Google’s LSM tree key-value store
  • lsm: our LSM tree key-value store
  • discodb: an immutable key-value store also based on minimal perfect hashing
  • TinyCDB: an immutable key-value store using non-perfect hashing

Note these options do not have equivalent feature sets. SQLite provides full relational operations, lsm and LevelDB provide range queries, and all three of these are mutable. Here we are concerned only with the common functionality of these options.

We look at the results for two conceptually linked stores based on our production data. This allows us to demonstrate the link by hashing optimization described above, and also enables a more natural representation in SQLite. The first store is a mapping from 50 million 64-bit hashes to a small cluster of “items,” which are 80-bit integers represented as 16-digit base-32 strings. The second store is the reverse mapping from each item in the cluster to its associated hash.

An important aspect of our own key-value stores is their use of arbitrary serialization via a plugin infrastructure. So for a fair comparison, in the sizes for our LSM trees and MPH tables we include both the sizes for opaque string to string mappings, as well as two optimizations. We encode keys as fixed 8-byte long values, and encode the values as short lists of 80-bit integers. For SQLite we encode the keys as integers.

In the general case of string mappings, LevelDB, lsm and mph are all of comparable size, and notably smaller than the alternative solutions. If we apply more efficient serialization lsm and mph become much smaller, and mph is increasingly able to take advantage of the regularity in size to become the smallest option.

Here we consider only the best serialization for lsm and mph. For mph we also show the size with the optimizations discussed above, first not storing the keys and then also linking to the above table via the perfect hash value. For SQLite we include the size when indexing on both columns, which allows us to remove the previous table altogether. In this case, the SQLite index is smaller than the combined sizes for discodb and TinyCDB, and comparable to LevelDB. However, mph remains smaller than everything, and in the ideal case with all optimizations applied is in fact an order of magnitude smaller.

The hash-based solutions tend to be faster, as one might expect due to their requiring fewer seeks. Mph is fastest overall by a large margin.

Write performance shows lsm and mph have the strongest performance. TinyCDB is excluded at this point due to data corruption — it has a fixed 4GB storage limit by design, but even at 3GB we were unable to retrieve 30% of the keys.

Try it yourself

We’ve made Indeed MPH available and open-source on Git.

If you’re already shipping a key-value store to your servers, our mph tables may be able to offer smaller size and faster reads. If you’re not already doing so, consider this as an alternative to centralized databases. With a compact representation, you can store more data in the same space, or reduce data size that you previously thought too large.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Bringing Job Spotter to iOS: Indeed’s First React Native App

In 2013, Facebook released ReactJS and changed the way a lot of people (including us) think about web development. In 2015, Facebook released React Native, allowing developers to create powerful native mobile experiences using JavaScript and React. When we planned to bring the Job Spotter app to iOS, we were hopeful that React Native would live up to its promise.

Spoiler alert: it did.

Intro to React and React Native

React is a different way of thinking about web development. Instead of separating code by function, React encourages separating code by components. Each component contains all the layout, styling and business logic it needs – often in the same file.

React’s other important feature is one-way data flow. Compared to the two-way flow implied by other frameworks, it is much easier to figure out how state changes affect the app in React. One-way data flow also enables powerful libraries like Redux.

React Native lets you write JavaScript for Android or iOS apps with React that deliver native performance. Web developers can write powerful native apps, without needing to spend time learning unfamiliar native mobile development technologies.

Wins

React Native helped us speed up our mobile development cycle. Using out-of-the-box tools, we can run our React Native app locally in the iOS simulator. That allows us to refresh the app to test changes without recompilation, just like we would in the web. Even better, application state is maintained across these “hot reloads.” These development features of React Native translate into a big savings in time and effort, allowing us to focus on developing new features.

Since we could leverage our previous React knowledge, we didn’t have to learn a new language and could get straight into coding. This was especially great for our team, since none of us had used Objective-C or Swift. Another bonus was that, except for some custom event logging, we hardly had to write any native code.

Another interesting aspect of React Native is that it requires using inline-styles, in which CSS “classes” are actually JavaScript objects. This seemed strange at first, but we’ve grown to love it. Since we were dealing with objects, we could create powerful, composable, and extensible CSS classes and avoid a lot of style duplication seen in traditional webapps. We could develop different looks and functions as needed, rather than having to repurpose the same styles for different needs.

React Native makes heavy use of Flexbox, a newer CSS module that aims to make styling more consistent and straightforward. This removes the need to know many of the traditional quirks of CSS. After overcoming the learning curve, the more declarative approach of Flexbox allowed us to style new components more rapidly than in traditional CSS.

Debugging was also easier than we expected. We just connected to the debugger in the Chrome Developer Tools and debugged the same way we would on the web. React Native has several additional tools – the element inspector determines the styling applied to a component, and the network inspector analyzes the XHR requests your application makes. Being able to reuse familiar tools helped us track down and fix issues as they arose.

Challenges

React Native enabled us to develop more rapidly, but React Native itself developed just as fast. In the two months that we worked on the Job Spotter iOS app, React Native had six releases. After updating to each new release, we would have to do a full regression test of our app.

Another major and related challenge: React Native’s newness. A lot of capabilities that would be better as default features still required third-party libraries, such as accessing the camera and file system and handling permissions. The community works hard to pick up the slack of React Native core. However, these third-party libraries sometimes lacked the quality we expected from the core library. In fact, we had to patch more than one such library.

The components available for navigation and screen transitions were also inconsistent. We ran into an issue with the Navigator component in which every transition needed some kind of animation, with no option for an empty transition. This initially made some of our scene transitions look awkward. We tried using the NavigatorIOS component, but its interface was less friendly and straightforward, and lacked many of Navigator’s features.

Similarly the TabBarIOS component works quite well, but its transitions were completely independent of whatever Navigator we used. We wanted a single navigator handling everything. We had to use different code paths for supporting navigation when we would have liked one.

It’s worth it

Despite these challenges, we continue using React Native. It fits well in our toolbox. We can use our web development expertise to build rich mobile experiences without having to become Android or iOS experts. And as React Native matures, we expect many of these current issues to be resolved.

Our next step is porting the React Native version of Job Spotter to Android to replace our existing native Android app.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

IndeedEng Culture: Be Ambitious, Stay Humble

In my previous posts on Indeed’s engineering culture, I wrote about keeping teams independent, making better mistakes, dealing with consolidation debt, encouraging autonomy and initiative, and the importance of not stack ranking individuals. These are my closing thoughts on how we work here at Indeed.

Perhaps one of Indeed’s traits that has struck me the most is how individuals, including top engineers and managers, have stayed humble, although they have built this company from the ground up. Bragging and boasting are not behaviors I have seen at Indeed. As I met my colleagues, it seems like being right or wrong is overrated. Instead, we are all focused on doing the right thing. I was surprised how veterans had time and patience. At first I wondered what I did to deserve it. They were curious to hear my perspectives although I had not demonstrated anything yet.

It took me some time to understand why. Then I figured it out. Being data-driven and expecting initiative means there is always a practical and immediate way to settle a burning argument. If you really believe that feature X is what the business needs, you don’t need to convince an army of managers and directors. You don’t need to be resentful for failing to do so, or not even trying. Your credentials (almost) do not matter. The only thing you need to do is to design the experiment that will prove or disprove your theory, and ship it. And if you are not doing so, why are you talking?


Editor’s note: This post concludes our 5-year anniversary series highlighting some of the most important aspects of Indeed’s engineering culture. We want to thank James Dingle for sharing his perspective and reminding us why we love coming to work here every day. Want to join us? Learn more at http://indeed.jobs.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone