A Rust Spot: mmap

Posted on January 9, 2016 by Tommy McGuire

Labels: rust, digital humanities, unix, tip

Since I have recently been fiddling with natural language programming, I decided I ought to learn a bit about the topic. The textbook I picked is Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schütze. (The alternative, Speech and Language Processing, by Daniel Jurafsky and James H. Martin, while it is more popular as a textbook and more recent, did not seem to be as well written from the excerpts I’ve read.¹) So, I’ve been working my way through the book slowly, writing the example programs in Rust.

I chose Rust because I enjoy the language, and because I am trying to write code with better performance than is available in Java or Python, two other popular languages in this arena that I also like very much. But I’m an old systems guy, so my biases are, well, biased.

To that end, I would like to be able to read a text file into memory without copying the data. That’s where mmap enters the picture.

About mmap

The Unix system call mmap provides more-or-less general access to the operating system’s memory memory management system. You can use it, to read and write files, to allocate a block of memory, to share memory between processes, and probably some other stuff that I’m forgetting. It cures pellagra, asthma, and kills dust-mites. It’s not quite a dessert topping and a floor wax (for that, see ioctl), but it can be pretty useful.

For this task, I intend to use it solely for reading file contents into memory, for which it is pretty easy to use. One note: writing to the file is also pretty trivial, but extending the file requires quite a few more dance steps. While mmap is providing an interface to the system’s file system memory buffers and most file systems write additional data to the end of the file by slapping additional pages on the end of the buffer mapping and scribbling on them, this is not guaranteed to work by the mmap documentation and can cause bad things to happen on some systems. (Like deadlocking the file system into a giant tar-baby, causing any process that tries to do anything to it to get stuck and requiring a reboot to clean up. That’s NFS on AIX 3.2.5, by the way.)

Anyway, the basic process is this: get an open, readable file descriptor for the file, call mmap to make the contents of the file look like a block of memory, have your way with them, then munmap the memory and close the file descriptor.

mmap in Rust: part one

An interface to mmap is part of the Rust libc library, which is currently unstable-ish in my copy of Rust, so the first step is to add a dependency on libc to the Cargo.toml file. The next step is to provide access to a file descriptor:

// A file descriptor, open for reading.
struct FileDescriptor(libc::c_int);

impl FileDescriptor {
    unsafe fn open(filename: &str) -> Result<FileDescriptor,String> {
        if let Ok(file) = std::ffi::CString::new(filename) {
            let fd = libc::open(file.as_ptr(), libc::O_RDONLY, 0);
            if fd >= 0 {
                Ok( FileDescriptor(fd) )
            } else {
                Err( format!("failure in open({}): {}",
                    filename,
                    std::io::Error::last_os_error()) )
            }
        } else {
            Err( format!("failure getting CString: {}", filename) )
        }
    }
}

A file descriptor is an integer that references system structures associated with the process. Most of the complexity here is in the interface to the C functions: converting the file name string to a CString and calling libc::open with assorted necessary error handling.

Closing the file descriptor when it is no longer needed is safely handled with an implementation of the Drop trait which closes the file descriptor.

impl Drop for FileDescriptor {
    fn drop(&mut self) {
        let FileDescriptor(fd) = *self;
        unsafe {
            libc::close(fd);
        }
    }
}

In order to map the contents of the file, the program needs to know how big the file is, so the following code uses libc::stat:

impl FileDescriptor {
    unsafe fn get_size(&self) -> Result<libc::size_t,String> {
        let FileDescriptor(fd) = *self;
        let mut stat: libc::stat = std::mem::zeroed();
        if libc::fstat(fd, &mut stat) < 0 {
            Err( format!("failure in fstat(): {}", std::io::Error::last_os_error()) )
        } else {
            Ok( stat.st_size as libc::size_t )
        }
    }

    fn get_fd(&self) -> &libc::c_int { &self.0 }
}

The one interesting part of get_size is the call to std::mem::zeroed to allocate the libc::stat structure, which is both large (as in, it has many elements) and very system dependent. The call to zeroed happily gives me the structure in a good state very simply.

Much of this code is declared unsafe, because it’s using the C api. However, the only really unsafe part is the management of the file descriptor, which is an integer and can easily leak outside of the code where it is open and in a good state. As a nod to this danger, get_fd returns a borrowed pointer to the integer rather than the integer itself, a pointer which is only valid for the duration of the lifetime of the FileDescriptor.

mmap in Rust: part two

The safe, useful external interface to an mmap-ed region of memory is MappedRegion, a structure containing the file descriptor for the file (which must remain open while the region is valid and be closed immediately after), the size of the region (which will be needed when the region is unmapped), and a pointer to the memory that contains the file contents (more precisely, that is backed by the file).

pub struct MappedRegion {
    _fd: FileDescriptor,
    sz: libc::size_t,
    ptr: *mut u8,
}

impl MappedRegion {

    pub fn mmap(filename: &str) -> Result<MappedRegion,String> {
        unsafe {
            match FileDescriptor::open(filename) {
                Ok(fd) => map(fd),
                Err(e) => Err(e)
            }
        }
    }
    ...
}

Actually mapping the file, once the file descriptor has been opened, is the purpose of the unsafe map function:

unsafe fn map(fd: FileDescriptor) -> Result<MappedRegion,String> {
    match fd.get_size() {
        Ok(sz) => {
            let address = libc::mmap(0 as *mut libc::c_void,
                                     sz as u64,
                                     libc::PROT_READ,
                                     libc::MAP_PRIVATE,
                                     *fd.get_fd(),
                                     0);
            if address < 0 as *mut libc::c_void {
                Err( format!("failure in mmap(): {}",
                             std::io::Error::last_os_error()) )
            } else {
                Ok( MappedRegion {
                    _fd: fd,
                    ptr: address as *mut u8,
                    sz: sz,
                })
            }
        }
        Err(e) => { Err(e) }
    }
}

The function calls libc::mmap with the following arguments:

A 0. This is a hint to the operating system about where in the processes’ address space the mapped region should be located; a zero says the program does not care.
The size of the file. It’s possible to map a smaller segment of the file but not a larger, by the way.
A protection flag of PROT_READ. This indicates that the program is interested in reading the contents of the file, but not in writing or executing them. (Yeah, you can dynamically load code this way.)
A sharing flag of MAP_PRIVATE. This indicates that changes to the file will not be visible to other processes mapping the same file or be written to the file. (Such changes should not be possible with PROT_READ anyway, but something needs to be specified here.) This argument also possibly takes other flags to request other magic from the operating system, none of which are needed for this case.
The file descriptor. Yeah, it had to be in here somewhere.
An offset of 0. If you were mapping a smaller segment of the file, this would allow you to specify where the offset started in the file.

Simple? Clear? Easy? (Questions? Comments? Drop slips? (Hi, Mittens!))

Unmapping the file when the program is done is the task of another implementation of the Drop trait:

impl Drop for MappedRegion {
    fn drop(&mut self) {
        unsafe {
            if libc::munmap(self.ptr as *mut libc::c_void, self.sz) < 0 {
                panic!("munmap: {}", std::io::Error::last_os_error());
            }
        }
    }
}

The arguments are the pointer of the mapped region and the size. Since Drop doesn’t have a way to propagate errors, this code panics. Normally, my practice would be to swallow the error (What, exactly, am I supposed to do about a failure in munmap?), but here I suspect such errors will be sufficiently rare (and important, or at least interesting) to report somehow.

The final two methods in MappedRegion (contained in the impl above) provide safe access to the contents of the file. The first, get_slice, returns it as a slice of bytes, spelled here u8.

    pub fn get_slice(&self) -> &[u8] {
        unsafe {
            std::slice::from_raw_parts(self.ptr, self.sz as usize)
        }
    }

The second method returns the contents as a reference to a str, using a potentially failing UTF-8 conversion:

    pub fn get_str<'s>(&'s self) -> Result<&'s str,String> {
        std::str::from_utf8(self.get_slice()).map_err(|e| { format!("{}", e) })
    }

Both of the objects returned by these methods are references with the same lifetime as the MappedRegion, so it should be impossible for pointers into the file contents to leak outside that lifetime. If the program needs some of the contents longer than that, it is up to the program to copy them.

In use

With MappedRegion as a safe interface, using mmap to read files in Rust is relatively easy, modulo the Results:

MappedRegion::mmap(file).and_then(|contents| {
    contents.get_str().map(|text| {
        do_something(text)
}).expect(&format!("cannot read {}", file))

In this chunk, contents is a MappedRegion, representing a pointer-like-thing to the file contents and text is a string reference to the contents themselves. If this chunk is the body of a function, the return value will be the value of do_something (which presumably does something with the file contents) or a panic due to the expect.

Next time: given the file contents, can we diddle-about with word-like tokens in it without copying strings around?

How about this quote from Foundations of Statistical Natural Language Processing:
While it is much better to refer to such a curve as a ‘normal distribution’ than as a ‘bell curve’, if you really want to fit into the Statistical NLP or pattern recognition communities, you should instead learn to refer to these functions as Gaussians, and to remark things like, ‘Maybe we could model that using 3 Gaussians’ at appropriate moments.
↩