Vectorized VByte Decoding: High Performance Vector Instructions

Data-driven organizations like Indeed need great tools. We built Imhotep, our interactive data analytics platform (released last year), to manage the parallel execution of queries. To balance memory efficiency and performance in Imhotep, we developed a technique called vectorized variable-byte (VByte) decoding.

VByte with differential decoding

Many applications use VByte and differential encoding to compress sorted sequences of integers. The most common compression method for inverted indexes uses this style of encoding. This approach encodes successive differences between integers instead of the integers themselves, using fewer bytes for smaller integers at the cost of using more bytes for larger integers.

A conventional VByte decoder examines only one byte at a time, which limits throughput. Also, each input byte requires one branch, leading to mispredicted branches.

Vectorized VByte decoding

Our masked VByte decoder processes larger chunks of input data — 12 bytes — at one time, which is much faster than decoding one byte at a time. This is important for Indeed because Imhotep spends ~40% of its CPU time decoding variable-byte integers. We described this approach in a tech talk last year: Large Scale Analytics and Machine Learning at Indeed.

Jeff Plaisance (Indeed), Nathan Kurz (Verse Communications), and Daniel Lemire (LICEF, Université du Québec) discuss the masked VByte decoder in detail in Vectorized VByte Decoding. The paper’s abstract follows:

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

Vectorized VByte Decoding has been accepted to the International Symposium on Web Algorithms (iSWAG) on June 2-3, 2015. iSWAG promotes academic and industrial research on all topics related to web algorithms.

Large-scale interactive tools

To learn more about Imhotep, check out these tech talks and slides: Scaling Decision Trees and Large-Scale Analytics with Imhotep. You can find the source and documentation for Imhotep on GitHub.