Vectorized VByte Decoding: High Performance Vector Instructions

Data-driven organizations like Indeed need great tools. We built Imhotep, our interactive data analytics platform (released last year), to manage the parallel execution of queries. To balance memory efficiency and performance in Imhotep, we developed a technique called vectorized variable-byte (VByte) decoding.

VByte with differential decoding

Many applications use VByte and differential encoding to compress sorted sequences of integers. The most common compression method for inverted indexes uses this style of encoding. This approach encodes successive differences between integers instead of the integers themselves, using fewer bytes for smaller integers at the cost of using more bytes for larger integers.

A conventional VByte decoder examines only one byte at a time, which limits throughput. Also, each input byte requires one branch, leading to mispredicted branches.

Vectorized VByte decoding

Our masked VByte decoder processes larger chunks of input data — 12 bytes — at one time, which is much faster than decoding one byte at a time. This is important for Indeed because Imhotep spends ~40% of its CPU time decoding variable-byte integers. We described this approach in a tech talk last year: Large Scale Analytics and Machine Learning at Indeed.

Jeff Plaisance (Indeed), Nathan Kurz (Verse Communications), and Daniel Lemire (LICEF, Université du Québec) discuss the masked VByte decoder in detail in Vectorized VByte Decoding. The paper’s abstract follows:

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

Vectorized VByte Decoding has been accepted to the International Symposium on Web Algorithms (iSWAG) on June 2-3, 2015. iSWAG promotes academic and industrial research on all topics related to web algorithms.

Large-scale interactive tools

To learn more about Imhotep, check out these tech talks and slides: Scaling Decision Trees and Large-Scale Analytics with Imhotep. You can find the source and documentation for Imhotep on GitHub.

Memory Mapping with util-mmap

We are excited to highlight the open-source availability of util-mmap, a memory mapping library for Java. It provides an efficient mechanism for accessing large files. Our analytics platform Imhotep (released last year) uses it for managing data access.

Why use memory mapping?

Our backend services handle large data sets, like LSM trees and Lucene indexes. The util-mmap library provides safe memory mapping of these kinds of large files. It also overcomes known limitations of MappedByteBuffer in the JDK.

Memory mapping is the process of bringing part of a file into a virtual memory segment. Applications can then treat the mapped part like primary memory. We use memory mapping in latency-sensitive production applications that have particularly large files. By doing so, we prevent expensive I/O operations.

Limitations with MappedByteBuffer

The JDK provides MappedByteBuffer in the java.nio package for doing memory mapping. This library has three main problems:

Unable to safely unmap
The only way to request unmapping with MappedByteBuffer is to call System.gc(). This approach doesn’t guarantee unmapping and is a known bug. You must unmap a memory mapped file before you can delete it. This bug will cause disk space problems when mapping large, frequently-updated files.

Unable to map files larger than 2GB
MappedByteBuffer uses integers for all indexes. That means you must use multiple buffers to manage files that are larger than 2GB. Managing multiple buffers can lead to complicated, error-prone code.

Thread safety
ByteBuffer maintains internal state to track the position and limit. Reading using relative methods like get() requires a unique buffer per thread via duplicate(). Example:

public class ByteBufferThreadLocal extends ThreadLocal<ByteBuffer>
{
    private ByteBuffer src;
    public ByteBufferThreadLocal(ByteBuffer src)
    {
        src = src;
    }

    @Override
    protected synchronized ByteBuffer initialValue()
    {
        return src.duplicate();
    }
}

Memory mapping with util-mmap

util-mmap addresses all of these issues:

  • implements unmapping so that you can delete unused files immediately;
  • uses long pointers, so it is capable of memory mapping files larger than 2GB;
  • works well with our AtomicSharedReference for safe, simple access from multiple threads.

Example: memory mapping a large long[] array

Use Guava’s LittleEndianDataOutputStream to write out a binary file:

try (LittleEndianDataOutputStream out =
        new LittleEndianDataOutputStream(new FileOutputStream(filePath))) {
    for (long value : contents) {
        out.writeLong(value);
    }
}

Use  MMapBuffer to memory map this file:

final MMapBuffer buffer = new MMapBuffer(
       filePath,
       FileChannel.MapMode.READ_ONLY,
       ByteOrder.LITTLE_ENDIAN);
final LongArray longArray =
    buffer.memory().longArray(0, buffer.memory().length() / 8);

Why not use Java serialization?
Java manages data in big-endian form. Indeed’s production systems run on Intel processors that are little endian. Also, the actual data for a long array starts at 17 bytes into the file, after the object header.

To properly memory map a native Java serialized array, you would have to write code to manage the above mentioned offset correctly. You would also have to flip the bytes around, which is expensive. Writing data in little endian results in more straightforward memory mapping code.

Thread Safety

For safe access from multiple threads, use AtomicSharedReference. This class wraps the Java object that’s using the memory mapped file. For example:

final AtomicSharedReference<LongArray> objRef =
    AtomicSharedReference.create(longArray);

The objRef variable is a mutable reference to the underlying SharedReference, a ref-counted object. When using the array, you must call getCopy() and then close the reference.

try(final SharedReference<LongArray> myData = objRef.getCopy())  {
    LongArray obj = myData.get();
    // … do something …
}

SharedReference keeps track of references and unmaps the file when none are still open.

Reloads

Use the setQuietly method to replace newer copies of the file.

final MyObject newMyObj = reloadMyObjectFromDisk();
objRef.setQuietly(newMyObj);

Close

Use closeQuietly upon application shutdown to unmap the file.

objRef.closeQuietly();

Get started with util-mmap

At Indeed, we use util-mmap in several production services. We are using it to access files that are up to 15 GB and updated every few minutes. If you need to memory map your large files, visit us on GitHub and give util-mmap a try.

Making History with Code Festival 2014

What do you get when you combine a coding contest, a music and arts festival, and games? Do those things even belong together?

Code Festival 2014

For two days in November 2014, two hundred collegiate coders came together in Tokyo to participate in Code Festival 2014, put on by Recruit and Indeed Tokyo. The event included five distinct coding challenges, as well as fun non-coding activities. After the initial coding challenge – the main round – participants could brush up on their skills with tutoring tailored to that challenge. Not your average coding contest.

So, what does a room full of 200 coders look like?

Large group of participants in Code Festival 2014

Why make history?

Organizers wanted to capitalize on love of coding and competition and bring lots of talented coders together to work hard, have fun, make friends, and learn something new. Traditionally, programming contests are limited to a few top competitors, which can discourage those who don’t make the cut.

Ayaka Matsuo, the event’s project lead, decided to break free from that tradition. The structure of the festival allowed many more participants to take advantage of the events. Another history-making facet of the event was Matsuo’s event team: 16 new college hires who will join Indeed Tokyo after graduation in April 2015. They provided ideas, helped run the event, and generated a lot of enthusiasm.

Expanding on a tradition

Indeed and Recruit held coding duels in Fall 2013, December 2013, and February 2014. (Read about them here.) They were a warm-up to the November 2014 festival in Tokyo, providing a lot of valuable insights into how to expand the types of coding challenges. Code Festival 2014 was so successful that plans are already forming for the next event. Could we host even more competitors?

The coding challenges

The event included 5 separate coding challenges. Two of the challenges – the main round and the morning programming contest on the second day – were standard programming contests, but the remaining three were not at all traditional.

Main round

All 200 participants worked through 10 questions (including debugging) in 3 hours.

Participants seated in individual cubbies in large room.

Participants during the main round portion of the contest

The participant who solved the most problems in the least amount of time won the round. The top five participants from the main round advanced to the exhibition challenge.

Participants posing with their awards.

Top five winners with Indeed’s Senior VP of Engineering, Doug Gray

The winner took home 294,409 yen, an amount that resembles a prime number. Last year’s Tokyo coding duel awarded prize amounts that were prime numbers. This year, to mix it up, the organizers chose amounts that are strong pseudo-primes. In Japanese, these numbers are called Kyogi sosu (強擬素数) — a clever choice, since “Kyogi” can also mean strongly doubted or competition. Check out the prize amounts for the top 20 contestants here.

Exhibition

In the evening of the first day, the top five finalists from the main round moved to a separate room for the exhibition challenge.

Seated participants being filmed as they work on a coding challenge.

Participants were filmed during the exhibition challenge.

This room was far from private, however, as live video from each of the five computers was streamed into the main hall, allowing everyone to follow along with the competitors’ progress in solving the problem. Audio commentary added to the excitement.

Group of smiling and clapping participants.

Onlookers during the exhibition challenge

Were the challengers aware that their every move was being evaluated in the next room? Yes! And being watched only made the competition more lively.

Morning programming contest

All 200 participants were invited to return the next morning for another programming contest. To change it up, participants joined one of three groups, determined by skill level, and competed individually against others in the same group.

AI challenge

The AI programming contest required participants to write code that manipulates virtual players in a computer game. Fifty participants who had registered in advance of the festival participated in a preliminary challenge, with the top 16 progressing to the final. Those 16 were divided into four groups of 4, competing tournament style.

An exhibition match followed with Naohiro Takahashi (President of AtCoder and a competition programmer) and Colun (a competition programmer) and the first- and second-place winners of the AI challenge.

Team relay

During the last challenge on day 2, the 200 participants were divided into 20 teams of 10 members each. Each team needed to solve 10 questions, one at a time, within 1 hour 30 minutes. Live video aired, along with commentator play-by-play.

While the participant solved the problem, the rest of the team huddled apart from the contest area. If the teammate with the “baton” had a question, s/he stepped away from the computer to collaborate with the other teammates.

Group of participants huddled over papers.

Team huddle during the relay

Other festival activities

Event organizers sought to ensure that all participants had a chance to learn, play, and connect with their peers. Non-coding activities included calligraphy with code-related content, board games, Taiko-no Tatsujin (drum masters), and DDR (Dance Dance Revolution).

Four participants kneeling over calligraphy.

Calligraphy coding

Participants also had the opportunity to take private lessons with coding competition experts and attend panels with industry professionals covering these topics:

  • The future of programming contests
  • A question: Is the coding competition effective for learning programming?
  • How to create redcoder
  • How to handle increasing speed in coding competitions

Want to know more?

Gizmodo Japan wrote about the Code Festival. To review the participants’ submissions, navigate to the AtCoder standings page and click the magnifying glass beside each user’s name. To brush up on your own skills, participate in Top Coder and challenge yourself with past problems from the ACM-ICPC World Finals.