# mmap and the Rust Foreign Function Interface

Posted on June 12, 2013 by Tommy McGuire

Calling C code from Rust is (almost) incredibly painless. Rust uses a source-level marker similar to C++'s extern "C" to provide function declarations that link against existing C interfaces. As a result, the only real work is in creating those declarations, along with any needed data structure definitions and special values, and in calling the functions from unsafe Rust code. The Rust Foreign Function Interface Tutorial is a clear and concise guide to (most of) the process.

In this post, I want to provide an example of using that FFI to provide a safe interface to mmap, to map a file into a Rust vector of bytes so that the contents of the file can be accessed without calling read and without copying the data.

mmap creates a new mapping in the process' virtual address space, in this case with the memory in the mapped area backed by the contents of the file. mmap returns the address starting the area; accessing the memory following that address, within the mapped area, calls for the contents of the file to be paged in and appears, to the program, as if the mapped area were a buffer into which the contents of the file have been read. (It is possible to modify the contents of the file through the mapping, but I am only interested in reading those contents at the moment.)

Calling mmap requires an open, readable file descriptor on the file and the size of the file, plus some flags which are not especially interesting in this case. The first step is to get a file descriptor, which is possible with Rust's standard library libc module.

struct FileDescriptor(libc::c_int);impl Drop for FileDescriptor {    fn finalize(&self) { unsafe { libc::close(**self); } }}unsafe fn open(filename : &str) -> FileDescriptor {    let fd = do str::as_c_str(filename) |cs| {        libc::open(cs, libc::O_RDONLY as libc::c_int, 0)    };    if fd < 0 {        fail!(fmt!("failure in open(%s): %s", filename, os::last_os_error()));    }    return FileDescriptor(fd);}

The std::libc module provides access to many standard C library and system interfaces, types, and values. In this code, libc::open is being called with a C string acquired from a Rust string, and the value libc::O_RDONLY. An error result causes the task to fail! with a (hopefully useful) error message; success returns a FileDescriptor built from the C integer file descriptor. The FileDescriptor structure (containing only an anonymous integer) is introduced as a type to allow the Drop trait to be associated with the file descriptor. This trait implements RAII (Resource Acquisition Is Initialization) resource management: the file will be closed when the file descriptor value goes out of scope or is freed.

The libc::open and libc::close functions, and this open function, are marked unsafe. The Rust Reference Manual describes unsafety as:

Unsafe operations are those that potentially violate the memory-safety guarantees of Rust's static semantics. Specifically, the following operations are considered unsafe:

• Dereferencing a raw pointer.
• Casting a raw pointer to a safe pointer type.
• Calling an unsafe function.

Unsafe operations can only be invoked inside an unsafe block, as seen in the finalize implementation for the Drop trait.

A block of code can also be prefixed with the unsafe keyword, to permit a sequence of unsafe operations in an otherwise-safe function. This facility exists because the static semantics of Rust are a necessary approximation of the dynamic semantics. When a programmer has sufficient conviction that a sequence of unsafe operations is actually safe, they can encapsulate that sequence (taken as a whole) within an unsafe block. The compiler will consider uses of such code "safe", to the surrounding context.

(I believe the open function to be safe to use, so the unsafe marker on the function could be replaced by an unsafe block inside the function. However, since I am not exposing it in the interface to the mmap module, I do not feel the need to do that at this point.)

The next step is to request the size of the file, using libc::fstat. The fstat function requires a file descriptor and a pointer to an existing, modifiable libc::stat structure. This is the pain point of working with Rust's FFI:

unsafe fn fstat(fd : libc::c_int) -> libc::stat {    /* target_arch = "x86_64", target_os = "linux" or target_os = "android" */    let mut s = libc::stat {        st_dev        : 0,        st_ino        : 0,        st_mode       : 0,        st_nlink      : 0,        st_uid        : 0,        st_gid        : 0,        st_rdev       : 0,        st_size       : 0,        st_blksize    : 0,        st_blocks     : 0,        st_atime      : 0,        st_atime_nsec : 0,        st_mtime      : 0,        st_mtime_nsec : 0,        st_ctime      : 0,        st_ctime_nsec : 0,        __pad0        : 0,        __unused      : [0,0,0]    };    if libc::fstat(fd, &mut s) < 0 {        fail!(fmt!("failure in fstat(): %s", os::last_os_error()));    }    return s;}

The Rust language supports multiple operating systems and multiple architectures, each of which at least potentially lays out data structures differently, even data structures with the same fields. The variable s in my Rust fstat wrapper is a zero-filled, empty stat structure for x86_64 Linux, including padding, actually between the st_gid and st_rdev fields, and unused bytes in the structure available for future use. Mac OS on a 32-bit x86 architecture uses the following, different, structure (from libc.rs):

                pub struct stat {                    st_dev: dev_t,                    st_mode: mode_t,                    st_nlink: nlink_t,                    st_ino: ino_t,                    st_uid: uid_t,                    st_gid: gid_t,                    st_rdev: dev_t,                    st_atime: time_t,                    st_atime_nsec: c_long,                    st_mtime: time_t,                    st_mtime_nsec: c_long,                    st_ctime: time_t,                    st_ctime_nsec: c_long,                    st_birthtime: time_t,                    st_birthtime_nsec: c_long,                    st_size: off_t,                    st_blocks: blkcnt_t,                    st_blksize: blksize_t,                    st_flags: uint32_t,                    st_gen: uint32_t,                    st_lspare: int32_t,                    st_qspare: [int64_t, ..2],                }

The standardized fields are present, along with additional fields and no padding, in the Mac OS version. The result here, as it stands, is that my mmap code will only work on x86_64 Linux. It would be possible to provide a Rust function which tests the architecture and operating system, in the same way as the libc module, and returns the appropriate, empty stat structure. I would hope that would be added to the standard library at some point. Additionally, it will probably also be necessary for the Rust compiler's build process to read the target operating system's header files to build the libc.rs module, the way GCC's system header files are built on proprietary operating systems; it appears to be built by hand currently.

One further thing is needed before writing the code to wrap the mmap interface: declarations of the system's C mmap function and the necessary values:

pub mod raw {    pub extern {        unsafe fn mmap(addr : *libc::c_char, length : libc::size_t,                        prot : libc::c_int,   flags  : libc::c_int,                        fd   : libc::c_int,   offset : libc::off_t) -> *u8;        unsafe fn munmap(addr : *u8, length : libc::size_t) -> libc::c_int;    }    /* From /usr/include/asm-generic/mman-common.h on Linux */    /* prot values */    pub static PROT_NONE   : libc::c_int = 0x0;    pub static PROT_READ   : libc::c_int = 0x1;    pub static PROT_WRITE  : libc::c_int = 0x2;    pub static PROT_EXEC   : libc::c_int = 0x4;    // ...    /* flags */    pub static MAP_SHARED  : libc::c_int = 0x1;    pub static MAP_PRIVATE : libc::c_int = 0x2;    // ...}

This code defines a sub-module with the unsafe mmap and munmap declarations and values for the prot and flags arguments.

The contents of libc.rs are very similar to this, by the way.

Given those declarations, a simple, unsafe wrapper for mmap can be built using the same RAII techniques used for the file descriptor above:

struct MappedRegion {    reg : *u8,    siz : libc::size_t}impl Drop for MappedRegion {    fn finalize(&self) {        unsafe {            if raw::munmap(self.reg, self.siz) < 0 {                fail!(fmt!("munmap(): %s", os::last_os_error()));            }        }    }}unsafe fn mmap(fd : libc::c_int, size : libc::size_t) -> MappedRegion {    let buf = raw::mmap(0 as *libc::c_char, size, raw::PROT_READ, raw::MAP_SHARED, fd, 0);    if buf == -1 as *u8 {        fail!(fmt!("mmap(): %s", os::last_os_error()));    }    return MappedRegion { reg : buf, siz : size }; }

This mmap function calls raw::mmap (i.e. the function supplied by the system's standard library) with the arguments necessary to create a mapped region containing the readable contents of the file. The mapped region will be unmapped when the MappedRegion value goes out of scope or is freed.

However, this mmap function is not for public consumption, because the MappedRegion is not associated with the file descriptor. According to the Linux mmap man page, "closing the file descriptor does not unmap the region," which I interpret to mean that closing the descriptor would leave the region valid. On the other hand, I have no idea if that behavior is standard, at all reliable, or even a valid assumption. It would be safer to link the two, so that the lifetimes of the mapped region and the file descriptor were safely nested. Further, a MappedRegion contains an unsafe, raw C pointer, which is not acceptable as a good Rust interface. To polish off both problems, the public interface is:

pub fn with_mmap_file_contents<U>(filename : &str, f : &fn(v : &[u8]) -> U) -> U {    unsafe {        let fd = open(filename);        let st = fstat(*fd);        let buf = mmap(*fd, st.st_size as libc::size_t);        return vec::raw::buf_as_slice(buf.reg, buf.siz as uint, f);    }}

The function with_mmap_file_contents accepts a file name string and a function which in turn accepts a reference to a vector of u8 bytes. The function argument returns a value of type U; this value is in turn returned by with_mmap_file_contents. The file is safely unmapped and closed as with_mmap_file_contents returns. Most of the heavy lifting is provided by the library function, vec::raw::buf_as_slice, which also provided the inspiration for with_mmap_file_contents' interface.

An example of using with_mmap_file_contents is the following simple implementation of the Unix cat utility:

extern mod mmap;fn main() {    let args = os::args();    do mmap::with_mmap_file_contents(args[1]) |b| {        print(str::from_bytes(b));    }}$rustc --lib ./mmap.rs$ rustc -L. mmap-test.rs$./mmap-test /etc/motdWelcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-45-generic x86_64) * Documentation: https://help.ubuntu.com/$ ./mmap-test /etc/motrust: task failed at 'failure in open(/etc/mot): No such file or directory', mmap.rs:29

The complete module can be found on github. For the use of with_mmap_file_contents, see my next post.

Out of curiosity, why not set the return type of open() to Option, instead of having it fail if fd < 0?

'2013-06-15T21:19:29.582-05:00'

Apparently my post got html escaped. Of course, I mean to type Option <FileDescriptor>

'2013-06-15T21:21:52.863-05:00'

That's a good idea, although I would probably want to use Result<FileDescriptor,~str>, which can contain an error message (the ~str) if fd < 0. Many of Rust's io functions use Result, such as the corresponding file_reader that returns a Reader, although Reader.read fails if reading returns an actual error.

In this case, I didn't use Result, because I wanted to keep the final interface simple and because failure suits my use case. If I did, the with_mmap_file_contents type would be:

pub fn with_mmap_file_contents<U>(filename : &str, f : &fn(v : &[u8]) -> U) -> Result<U,~str>

Tommy McGuire
'2013-06-16T17:31:01.519-05:00'

Yep, Result <> is actually better, as you pointed out. It occurred to me after I posted my comment :)