Indeed at FOSSASIA 2018

Indeed is a proud sponsor of the FOSSASIA Open Tech Summit, from March 22-25 in Singapore. FOSSASIA is Asia’s premier developer event and covers tracks from AI, Cloud, Blockchain Science Tech, Web, and more.

On March 24, Indeed speakers will present the following topics. We encourage you to attend.

How to Overcome Obstacles and Take Control of Your Career in Tech

Speaker: Doug Gray

Doug Gray is Senior Vice President of Engineering at Indeed. As a member of Indeed’s senior leadership team, Doug defines the organizational and development processes for the engineering team. His areas of expertise include product revitalization, process change and team development. Previously, Doug was a senior leader at Versata Software. Prior to that, he held a number of senior positions at Trilogy, where he worked for over 12 years. He holds a Bachelor of Science degree in Computer Science from Stanford University.

Finding the perfect job can be difficult. Figuring out how to stand out from the competition and then navigating human biases, archaic hiring processes and other challenges can make landing that dream job seem like a long shot. ​ ​

Learn how you can take control of your tech career path. Gain key insights from new research on what drives top tech talent’s career decisions​ while also learning how technology can help you overcome obstacles that may occur during the hiring process so that you can be in control of finding your dream job.

Indeed MPH: Fast and Compact Immutable Key-Value Stores

Speaker: Alex Shinn

Alex Shinn is an engineer working on Indeed’s job recommendation system. A former Google search engineer, he is a generalist with experience in natural language processing, building scalable architectures, and various AI technologies from expert systems to modern machine learning. In his free time, he writes Sheme compilers and libraries and contributes to the Scheme standardization process.

When you need to scale an application with a lot of data, how do you decide on a storage solution? How can you both safely store and efficiently interact with large data sets? This usually boils down to a choice between SQL or NoSQL — but what if there was a third option?

In this session, developer Alex Shinn will discuss Indeed’s MPH-Table: an open source storage solution that uses minimal perfect hash functions to generate a fast and compact key/value store. Alex will discuss how we use MPH-Table at Indeed, what situations are ideal for it, additional optimizations it allows, and how it compares to alternate solutions. Attendees should enter with a fundamental understanding of existing scalable storage solutions, and will leave with a basic understanding of MPH-Table, when they might want to use it, and how other solutions compare.


Find the full schedule for FOSSASIA, and more information about Doug and Alex’s presentations here.

We’re hiring! We have openings in Singapore for Engineering Manager, Full-Stack Software Engineer, Quality Assurance Automation Engineer and Quality Assurance Engineer. Tokyo openings include Manager, Data Science and Manager, Site Reliability Engineering. Hyderabad openings include Product Manager and Senior User Experience Designer.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Indeed Expands its Commitment to Open Source

At Indeed, we’re committed to an active role in open source communities. We’re proud to announce that we’ve joined the Cloud Native Computing Foundation (CNCF), an open source software foundation dedicated to making cloud-native computing universal and sustainable.

Cloud Native Computing Foundation logo

The CNCF, part of The Linux Foundation, is a vendor-neutral home for fast-growing projects. It promotes collaboration among the industry’s top developers, vendors, and end users. CNCF members include many leading public cloud providers and private cloud companies.

Deciding to join the CNCF was easy for us. We rely on open source technologies like OpenTracing to build and deliver best-of-class products. We believe we can serve job seekers and employers best by ensuring that open source software projects continue to grow.

Open source is now the industry model for practical software development. No longer relegated to Linux companies like Red Hat, the commercial success of open source has accelerated thanks to adoption by large technology companies such as IBM and Oracle. More recently, companies like Walmart and Verizon rely on open source programs and host their own open source projects. Microsoft has also become a major open source contributor.

Our open source projects include Proctor, an A/B testing framework that enables data-driven product design; LSM Tree, a fast key-value store for high-volume random access reads and writes; and our newest release, MPH, an immutable key-value store with efficient space utilization and fast reads.

Cloud strategies that leverage open source projects play an increasingly important role in business success. As a result, the demand for employees with open source development skills is also growing. Indeed research indicates skills in the following areas are in high demand for open source development roles: Java, Python, Git, JavaScript, Node.js, Docker, AngularJS, Jenkins, Amazon Web Services, and Agile.

Our strategic open source initiative involves partnerships, sponsorships and memberships … like our new CNCF role. In the next few months, we plan to announce sponsorships that support critical open source projects and help us engage targeted audiences. We’ll also continue to expand our open source team and become more involved in open source communities.

For updates on Indeed’s open source projects, visit our open source site. If you’re interested in open source roles at Indeed, visit our hiring page.

Doug Gray is the Senior VP of Engineering at Indeed.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Indeed MPH: Fast and Compact Immutable Key-Value Stores

When you need to scale an application with a lot of data, how do you decide on a storage solution? How can you both safely store and efficiently interact with large data sets? This often turns into a debate about whether to use SQL or NoSQL. Each comes with strengths and drawbacks.

But what if there was a third option that sidestepped database concerns altogether?

Consumers might need updates no more often than every few minutes. In this case, being able to load a data set into memory dramatically speeds access and allows for tremendous scale. This is why, for many projects at Indeed, we distribute a full copy of the needed data to each consumer and render the SQL or NoSQL debate unnecessary. To make this possible, we manage data size using a new key-value store based on minimal perfect hash functions. We’ve implemented this store as a Java library called Indeed MPH.

The problems with distributing data

Usually, distributing a full copy of your data to consumers is impractical for two reasons. First, the data must be mostly read-only. Second, the data cannot be so large as to prevent distribution or processing.

You can overcome the first problem by implementing batch updates. Redistribute the data every day, every hour, or even more frequently, and you keep the data read-only while ensuring that it remains useful. However, how do you keep the size down? As it turns out, the same batch update strategy that addresses the first problem also addresses the second.

Minimal perfect hash functions

Generating read-only data sets means that we know all the keys for our data set in advance. This lets us apply optimizations to the data set to reduce its size. In the case of hash tables, this knowledge allows us to compute a perfect hash function.

A perfect hash function maps every element to a distinct integer with no collisions — in mathematical terms it’s an injective function. A minimal perfect hash (mph) function maps n keys to the n consecutive integers [0, n-1]. With such a structure we can guarantee a single disk seek while maintaining 100% load in the table, but as we’ll demonstrate later, there are even more advantages.

Finding such hash functions can be difficult. In fact the Birthday Paradox tells us that zero collisions is unlikely even if m is several times larger than n. However, recent advances have yielded efficient techniques and quality libraries for generating minimal perfect hashes. These techniques allow our method to succeed with only a small downside: the resulting hash function itself requires a lookup table of ~2.2 bits per key, so it does not occupy a constant size. However, this lookup table is a fraction of the size of a bloom filter and only requires about 26MB for 100 million entries. In practice, the size of the lookup table is negligible compared to the size of the data being stored.

Practicalities

An mph function gives us a mapping from the keys to [0, n-1], but we still need to implement the table. In memory, we’d typically just store the data in an array. To translate this to offline storage we need to make something explicit that programming language arrays make implicit: memory offsets. Our representation is a directory with three files:

  • data.bin: the raw serialized key/value pairs
  • offsets.bin: a fixed size array where the ith value is the byte offset in data.bin for the entry whose hash is i
  • meta.bin: the serialized hash function itself, along with any other desired metadata

The following figure shows this representation with an index mapping colors to animals:


One convenience of this representation is that data.bin is raw data available for iterating, independent of the hash function. In addition, we might have multiple hash functions indexing the same data.bin by different keys as long as those keys are always unique.

The disadvantage of this representation is with many small entries. In this case, the size of the offsets could be on par with or even larger than the data itself. For an extreme example, if the data contains 750 million mappings from int to half-precision float, we require 7.5e8 * 6 = ~4.2G for data.bin, but 7.5e8 * 8 = ~5.6G for offsets.bin! In this example, we can take advantage of every entry having a fixed size. We don’t need to store the offsets — we can index directly into data.bin. But we want to optimize the variable-sized case as well.

This high cost comes specifically when we’re dealing with small entries. If the entries are large, the size of the offsets are negligible by comparison. Because the total size is small, we can represent it with a rank (bitcount) data-structure, on the order of bits per entry instead of bytes. For a little more overhead, we can compute the inverse of rank, called select, in constant time on this same data structure. Why is that useful? If the bits are set for the start byte of every entry in data.bin, then finding the offset of the ith entry is just computing select(i).

The structure becomes the following. In this case we now have to sort data.bin by hash value.


A further optimization comes from the often very regular structure in our data. For instance, to map from a long to list of longs (using a single byte for the count), the size of every individual entry i (including key and value) can be represented as 8xi + 9, assuming the length of the list is xi. The offset of the kth entry becomes ∑(8xi + 9) = 9k + ∑8xi = 9k + 8x. In other words, all we have to store is x, which is much smaller than the actual offset, and a big savings with the select approach. In this case, with 10 million entries averaging 2 elements in the list, the size of data.bin is 10000000 * (9 + 16) = ~250M, but the compressed select method uses only ~2.5M.

Our implementation computes what the size would be with both an array of offsets and the select approach and chooses accordingly.

More optimizations

Throw away the keys!

If we have no collisions and know that the key we are looking up is in the table, we don’t need to verify it. Alternately, we may be able to validate the key from the value, or from other sources given the value. Finally, if we can accept some small probability of false positive lookups, we can always probabilistically verify whether keys are in the table by using a bloom filter.

We have a better alternative. Since we’ve already hashed to a value that has no collisions in the original key set, we can store a k-bit signature (the low bits of any universal hash will do) of the key. Any key not in the original set that hashes to the same value will only match the signature with probability 2-k. This is a much better error rate than with a bloom filter.

In all of these cases, we can omit the keys entirely from the table. When the values are small, the keys may be a significant fraction, or even the majority of the table.

Link by hashes, not keys

It’s common to want to link multiple tables together — the values of one table either are, or contain foreign keys into, another table.

Another benefit of using an mph function is that it compresses keys to a compact integer range. We can store the hash value instead of the key, which for most practical purposes will fit in a 4-byte integer, as opposed to 8-byte longs or larger for many other foreign keys. Not only is this more compact, but it’s also faster to read because we don’t need to re-compute the hash for the second lookup.

Code and benchmarks

We’re excited to announce that we’ve open-sourced our mph code. For benchmarks we compare with several open source alternatives:

  • SQLite3: a single-file RDBMS suitable for replication
  • LevelDB: Google’s LSM tree key-value store
  • lsm: our LSM tree key-value store
  • discodb: an immutable key-value store also based on minimal perfect hashing
  • TinyCDB: an immutable key-value store using non-perfect hashing

Note these options do not have equivalent feature sets. SQLite provides full relational operations, lsm and LevelDB provide range queries, and all three of these are mutable. Here we are concerned only with the common functionality of these options.

We look at the results for two conceptually linked stores based on our production data. This allows us to demonstrate the link by hashing optimization described above, and also enables a more natural representation in SQLite. The first store is a mapping from 50 million 64-bit hashes to a small cluster of “items,” which are 80-bit integers represented as 16-digit base-32 strings. The second store is the reverse mapping from each item in the cluster to its associated hash.

An important aspect of our own key-value stores is their use of arbitrary serialization via a plugin infrastructure. So for a fair comparison, in the sizes for our LSM trees and MPH tables we include both the sizes for opaque string to string mappings, as well as two optimizations. We encode keys as fixed 8-byte long values, and encode the values as short lists of 80-bit integers. For SQLite we encode the keys as integers.

In the general case of string mappings, LevelDB, lsm and mph are all of comparable size, and notably smaller than the alternative solutions. If we apply more efficient serialization lsm and mph become much smaller, and mph is increasingly able to take advantage of the regularity in size to become the smallest option.

Here we consider only the best serialization for lsm and mph. For mph we also show the size with the optimizations discussed above, first not storing the keys and then also linking to the above table via the perfect hash value. For SQLite we include the size when indexing on both columns, which allows us to remove the previous table altogether. In this case, the SQLite index is smaller than the combined sizes for discodb and TinyCDB, and comparable to LevelDB. However, mph remains smaller than everything, and in the ideal case with all optimizations applied is in fact an order of magnitude smaller.

The hash-based solutions tend to be faster, as one might expect due to their requiring fewer seeks. Mph is fastest overall by a large margin.

Write performance shows lsm and mph have the strongest performance. TinyCDB is excluded at this point due to data corruption — it has a fixed 4GB storage limit by design, but even at 3GB we were unable to retrieve 30% of the keys.

Try it yourself

We’ve made Indeed MPH available and open-source on Git.

If you’re already shipping a key-value store to your servers, our mph tables may be able to offer smaller size and faster reads. If you’re not already doing so, consider this as an alternative to centralized databases. With a compact representation, you can store more data in the same space, or reduce data size that you previously thought too large.


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone