Memory Mapping with util-mmap

We are excited to highlight the open-source availability of util-mmap, a memory mapping library for Java. It provides an efficient mechanism for accessing large files. Our analytics platform Imhotep (released last year) uses it for managing data access.

Why use memory mapping?

Our backend services handle large data sets, like LSM trees and Lucene indexes. The util-mmap library provides safe memory mapping of these kinds of large files. It also overcomes known limitations of MappedByteBuffer in the JDK.

Memory mapping is the process of bringing part of a file into a virtual memory segment. Applications can then treat the mapped part like primary memory. We use memory mapping in latency-sensitive production applications that have particularly large files. By doing so, we prevent expensive I/O operations.

Limitations with MappedByteBuffer

The JDK provides MappedByteBuffer in the java.nio package for doing memory mapping. This library has three main problems:

Unable to safely unmap
The only way to request unmapping with MappedByteBuffer is to call System.gc(). This approach doesn’t guarantee unmapping and is a known bug. You must unmap a memory mapped file before you can delete it. This bug will cause disk space problems when mapping large, frequently-updated files.

Unable to map files larger than 2GB
MappedByteBuffer uses integers for all indexes. That means you must use multiple buffers to manage files that are larger than 2GB. Managing multiple buffers can lead to complicated, error-prone code.

Thread safety
ByteBuffer maintains internal state to track the position and limit. Reading using relative methods like get() requires a unique buffer per thread via duplicate(). Example:

public class ByteBufferThreadLocal extends ThreadLocal<ByteBuffer>
{
    private ByteBuffer src;
    public ByteBufferThreadLocal(ByteBuffer src)
    {
        src = src;
    }

    @Override
    protected synchronized ByteBuffer initialValue()
    {
        return src.duplicate();
    }
}

util-mmap addresses all of these issues:

implements unmapping so that you can delete unused files immediately;
uses long pointers, so it is capable of memory mapping files larger than 2GB;
works well with our AtomicSharedReference for safe, simple access from multiple threads.

Example: memory mapping a large long[] array

Use Guava’s LittleEndianDataOutputStream to write out a binary file:

try (LittleEndianDataOutputStream out =
        new LittleEndianDataOutputStream(new FileOutputStream(filePath))) {
    for (long value : contents) {
        out.writeLong(value);
    }
}

Use MMapBuffer to memory map this file:

final MMapBuffer buffer = new MMapBuffer(
       filePath,
       FileChannel.MapMode.READ_ONLY,
       ByteOrder.LITTLE_ENDIAN);
final LongArray longArray =
    buffer.memory().longArray(0, buffer.memory().length() / 8);

Why not use Java serialization?
Java manages data in big-endian form. Indeed’s production systems run on Intel processors that are little endian. Also, the actual data for a long array starts at 17 bytes into the file, after the object header.

To properly memory map a native Java serialized array, you would have to write code to manage the above mentioned offset correctly. You would also have to flip the bytes around, which is expensive. Writing data in little endian results in more straightforward memory mapping code.

Thread Safety

For safe access from multiple threads, use AtomicSharedReference. This class wraps the Java object that’s using the memory mapped file. For example:

final AtomicSharedReference<LongArray> objRef =
    AtomicSharedReference.create(longArray);

The objRef variable is a mutable reference to the underlying SharedReference, a ref-counted object. When using the array, you must call getCopy() and then close the reference.

try(final SharedReference<LongArray> myData = objRef.getCopy())  {
    LongArray obj = myData.get();
    // … do something …
}

SharedReference keeps track of references and unmaps the file when none are still open.

Reloads

Use the setQuietly method to replace newer copies of the file.

final MyObject newMyObj = reloadMyObjectFromDisk();
objRef.setQuietly(newMyObj);

Close

Use closeQuietly upon application shutdown to unmap the file.

objRef.closeQuietly();

Get started with util-mmap

At Indeed, we use util-mmap in several production services. We are using it to access files that are up to 15 GB and updated every few minutes. If you need to memory map your large files, visit us on GitHub and give util-mmap a try.

Memory Mapping with util-mmap

Why use memory mapping?

Limitations with MappedByteBuffer