When you need to scale an application with a lot of data, how do you decide on a storage solution? How can you both safely store and efficiently interact with large data sets? This often turns into a debate about whether to use SQL or NoSQL. Each comes with strengths and drawbacks.
But what if there was a third option that sidestepped database concerns altogether?
Consumers might need updates no more often than every few minutes. In this case, being able to load a data set into memory dramatically speeds access and allows for tremendous scale. This is why, for many projects at Indeed, we distribute a full copy of the needed data to each consumer and render the SQL or NoSQL debate unnecessary. To make this possible, we manage data size using a new key-value store based on minimal perfect hash functions. We’ve implemented this store as a Java library called Indeed MPH.
The problems with distributing data
Usually, distributing a full copy of your data to consumers is impractical for two reasons. First, the data must be mostly read-only. Second, the data cannot be so large as to prevent distribution or processing.
You can overcome the first problem by implementing batch updates. Redistribute the data every day, every hour, or even more frequently, and you keep the data read-only while ensuring that it remains useful. However, how do you keep the size down? As it turns out, the same batch update strategy that addresses the first problem also addresses the second.
Minimal perfect hash functions
Generating read-only data sets means that we know all the keys for our data set in advance. This lets us apply optimizations to the data set to reduce its size. In the case of hash tables, this knowledge allows us to compute a perfect hash function.
A perfect hash function maps every element to a distinct integer with no collisions — in mathematical terms it’s an injective function. A minimal perfect hash (mph) function maps n keys to the n consecutive integers [0, n-1]. With such a structure we can guarantee a single disk seek while maintaining 100% load in the table, but as we’ll demonstrate later, there are even more advantages.
Finding such hash functions can be difficult. In fact the Birthday Paradox tells us that zero collisions is unlikely even if m is several times larger than n. However, recent advances have yielded efficient techniques and quality libraries for generating minimal perfect hashes. These techniques allow our method to succeed with only a small downside: the resulting hash function itself requires a lookup table of ~2.2 bits per key, so it does not occupy a constant size. However, this lookup table is a fraction of the size of a bloom filter and only requires about 26MB for 100 million entries. In practice, the size of the lookup table is negligible compared to the size of the data being stored.
An mph function gives us a mapping from the keys to [0, n-1], but we still need to implement the table. In memory, we’d typically just store the data in an array. To translate this to offline storage we need to make something explicit that programming language arrays make implicit: memory offsets. Our representation is a directory with three files:
- data.bin: the raw serialized key/value pairs
- offsets.bin: a fixed size array where the ith value is the byte offset in data.bin for the entry whose hash is i
- meta.bin: the serialized hash function itself, along with any other desired metadata
The following figure shows this representation with an index mapping colors to animals:
One convenience of this representation is that data.bin is raw data available for iterating, independent of the hash function. In addition, we might have multiple hash functions indexing the same data.bin by different keys as long as those keys are always unique.
The disadvantage of this representation is with many small entries. In this case, the size of the offsets could be on par with or even larger than the data itself. For an extreme example, if the data contains 750 million mappings from int to half-precision float, we require 7.5e8 * 6 = ~4.2G for data.bin, but 7.5e8 * 8 = ~5.6G for offsets.bin! In this example, we can take advantage of every entry having a fixed size. We don’t need to store the offsets — we can index directly into data.bin. But we want to optimize the variable-sized case as well.
This high cost comes specifically when we’re dealing with small entries. If the entries are large, the size of the offsets are negligible by comparison. Because the total size is small, we can represent it with a rank (bitcount) data-structure, on the order of bits per entry instead of bytes. For a little more overhead, we can compute the inverse of rank, called select, in constant time on this same data structure. Why is that useful? If the bits are set for the start byte of every entry in data.bin, then finding the offset of the ith entry is just computing select(i).
The structure becomes the following. In this case we now have to sort data.bin by hash value.
A further optimization comes from the often very regular structure in our data. For instance, to map from a long to list of longs (using a single byte for the count), the size of every individual entry i (including key and value) can be represented as 8xi + 9, assuming the length of the list is xi. The offset of the kth entry becomes ∑(8xi + 9) = 9k + ∑8xi = 9k + 8x. In other words, all we have to store is x, which is much smaller than the actual offset, and a big savings with the select approach. In this case, with 10 million entries averaging 2 elements in the list, the size of data.bin is 10000000 * (9 + 16) = ~250M, but the compressed select method uses only ~2.5M.
Our implementation computes what the size would be with both an array of offsets and the select approach and chooses accordingly.
Throw away the keys!
If we have no collisions and know that the key we are looking up is in the table, we don’t need to verify it. Alternately, we may be able to validate the key from the value, or from other sources given the value. Finally, if we can accept some small probability of false positive lookups, we can always probabilistically verify whether keys are in the table by using a bloom filter.
We have a better alternative. Since we’ve already hashed to a value that has no collisions in the original key set, we can store a k-bit signature (the low bits of any universal hash will do) of the key. Any key not in the original set that hashes to the same value will only match the signature with probability 2-k. This is a much better error rate than with a bloom filter.
In all of these cases, we can omit the keys entirely from the table. When the values are small, the keys may be a significant fraction, or even the majority of the table.
Link by hashes, not keys
It’s common to want to link multiple tables together — the values of one table either are, or contain foreign keys into, another table.
Another benefit of using an mph function is that it compresses keys to a compact integer range. We can store the hash value instead of the key, which for most practical purposes will fit in a 4-byte integer, as opposed to 8-byte longs or larger for many other foreign keys. Not only is this more compact, but it’s also faster to read because we don’t need to re-compute the hash for the second lookup.
Code and benchmarks
We’re excited to announce that we’ve open-sourced our mph code. For benchmarks we compare with several open source alternatives:
- SQLite3: a single-file RDBMS suitable for replication
- LevelDB: Google’s LSM tree key-value store
- lsm: our LSM tree key-value store
- discodb: an immutable key-value store also based on minimal perfect hashing
- TinyCDB: an immutable key-value store using non-perfect hashing
Note these options do not have equivalent feature sets. SQLite provides full relational operations, lsm and LevelDB provide range queries, and all three of these are mutable. Here we are concerned only with the common functionality of these options.
We look at the results for two conceptually linked stores based on our production data. This allows us to demonstrate the link by hashing optimization described above, and also enables a more natural representation in SQLite. The first store is a mapping from 50 million 64-bit hashes to a small cluster of “items,” which are 80-bit integers represented as 16-digit base-32 strings. The second store is the reverse mapping from each item in the cluster to its associated hash.
An important aspect of our own key-value stores is their use of arbitrary serialization via a plugin infrastructure. So for a fair comparison, in the sizes for our LSM trees and MPH tables we include both the sizes for opaque string to string mappings, as well as two optimizations. We encode keys as fixed 8-byte long values, and encode the values as short lists of 80-bit integers. For SQLite we encode the keys as integers.
In the general case of string mappings, LevelDB, lsm and mph are all of comparable size, and notably smaller than the alternative solutions. If we apply more efficient serialization lsm and mph become much smaller, and mph is increasingly able to take advantage of the regularity in size to become the smallest option.
Here we consider only the best serialization for lsm and mph. For mph we also show the size with the optimizations discussed above, first not storing the keys and then also linking to the above table via the perfect hash value. For SQLite we include the size when indexing on both columns, which allows us to remove the previous table altogether. In this case, the SQLite index is smaller than the combined sizes for discodb and TinyCDB, and comparable to LevelDB. However, mph remains smaller than everything, and in the ideal case with all optimizations applied is in fact an order of magnitude smaller.
Write performance shows lsm and mph have the strongest performance. TinyCDB is excluded at this point due to data corruption — it has a fixed 4GB storage limit by design, but even at 3GB we were unable to retrieve 30% of the keys.
Try it yourself
We’ve made Indeed MPH available and open-source on Git.
If you’re already shipping a key-value store to your servers, our mph tables may be able to offer smaller size and faster reads. If you’re not already doing so, consider this as an alternative to centralized databases. With a compact representation, you can store more data in the same space, or reduce data size that you previously thought too large.
Cross-posted on Medium.