Wednesday, January 27, 2016

Letterpress cheating in Rust 1.6: How long has it been?!?

Has it been over a year since I ran these crappy benchmarks? Really?

I have upgraded all of the assorted toy Rust programs, my ports of Jeff Knupp's Creating and Optimizing a Letterpress Cheating Program in Python, to Rust 1.6.0. I also re-executed the notoriously crappy benchmarks.

Language Program Rust 0.6
Duration
(seconds)
Rust 0.8
Duration
(seconds)
Rust 0.9
Duration
(seconds)
Rust 0.11
Duration
(seconds)
Rust 1.6.0
Duration
(seconds)
Python alternatives/presser_one.py 49.1 47.8 39.0 59.4 41.4
Python alternatives/presser_two.py 12.8 12.3 11.6 17.2 13.0
Nim alternatives/nimrod_anagrams 12.3 18.0 11.6
Rust anagrams-hashmap-wide 9.3 12.1 19.6 15.7 8.0
Rust anagrams-vectors 8.0 11.9 8.1 11.0 6.5
Rust anagrams-vectors-wide 11.8 12.2 16.8 12.4 6.1
C alternatives/anagrams-vectors 8.0 5.8 6.0 9.6 5.7
Python alternatives/presser_three.py 6.0 6.0 5.8 8.3 5.2
Rust anagrams-vectors-tasks 27.1 4.2 4.6 7.7 3.5
Rust anagrams-hashmap 6.0 7.2 7.0 9.3 3.2
Rust anagrams-djbhash-tasks 6.2 5.5 2.8
Rust anagrams-djbhashmap 2.8 2.5 2.2
Rust anagrams-hashmap-mmap 4.8 7.3 6.3 2.9 2.1
C alternatives/anagrams-hash 0.9 1.0 0.9 1.4 1.1

The programming languages and versions for this run are:

  • Python: Python 2.7.9, with Python 2.7.3 through 2.7.6 for previous versions.
  • C: gcc 4.9.2, with 4.6.3 through 4.8.2 for the prior runs, all with -O3.
  • Nim: Nim 0.13.0, with Nimrod 0.9.2 and 0.9.4 on prior runs, compiled with -d:release.
  • Rust: Rust 1.6.0, compiled with -O.

Sunday, January 24, 2016

unwrap!

Speaking of macros, there's a problem with Option::unwrap. Now, I only use unwrap in a couple of instances: when I'm slapping something together to try something out and don't want to worry about the error cases, and when I'm absolutely, positively sure it won't fail. The problem is when unwrap does fail, it panics with an error message somewhere in option.rs, which is slightly unhelpful. When that happens while I'm slapping something together, I now have to replace all the unwrap calls with something that'll tell me what just failed. (When it happens otherwise, I have bigger problems.)

So, I've come to a conclusion. unwrap should be a macro, so it can create an error message that points to what is failing.

macro_rules! unwrap {
    ($e:expr) => {
        match $e {
            Some(v) => v,
            None => panic!(concat!(stringify!($e), ": unwrap! produced None"))
        }
    }
}

(Unfortunately, I'll need a different macro to handle Result.)

We now return you to whatever you were doing before. Thanks for your attention.

Thursday, January 21, 2016

Another Rust spot: delegation

Yeah, I know: I lied. In the last post, I wrote, "Next time: given the file contents, can we diddle-about with word-like tokens in it without copying strings around?" and yet I'm not writing about that now. Sorry. Maybe next time. This seemed more important.

Anyway, here's the deal: suppose you have a type Basic which has a few methods and perhaps an implementation of a trait:

struct Basic {
    some_field: usize,
}

impl Basic {
    pub fn new() -> Basic { Basic { some_field: 4} }
    pub fn foo(&self) -> usize { self.some_field - 1 }
    pub fn bar(&self, i: usize) -> f64 { self.some_field as f64 / i as f64 }
    pub fn baz(&mut self) -> usize { self.some_field }
}

trait A {
    fn quux(&self) -> usize;
}

impl A for Basic {
    fn quux(&self) -> usize { self.some_field }
}

And you have a type, Extended, that has a field of type Basic:

pub struct Extended {
    inner: Basic,
    _red_herring: usize,
}

impl Extended {
    pub fn new() -> Extended {
        Extended { inner: Basic::new(), _red_herring: 12 } }
}

What you want is for Extended to support the same set of methods (and the trait) as Basic, with the methods being delegated to the Basic field.

There are, currently, three ways to deal with this:

  • Write a bunch of trivial methods for Extended like,

    pub fn foo(&self) -> { self.inner.foo() }

    But that is rather tedious.

  • Wait for a compiler extension or change to the Rust language to support a "delegate" attribute or something. Like maybe RFC 292: allow delegating some methods from an trait impl to a field of a struct #292.

    Now, I'm lazy, so that's not a bad plan, but still, I would like to get some work done now.

  • Drag out some macro-fu, polish it up bright and shiny, and lay waste to every village and small township between here and the sea! Remember: pillage, then burn.

    There are undoubtedly some weaknesses here, and it's going to have to be a bit more verbose than the second option (and I won't get a nap), but it's what I'm doing. Otherwise, this wouldn't be much of a blog post, would it?

The macro

Without further ado, here's the macro:

#[macro_export]
macro_rules! delegate {
    ( $fld:ident : ) => {
    };

    ( $fld:ident : $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty ) => {
        fn $fcn ( &self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
    };

    ( $fld:ident : $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty, $($rest:tt)* ) => {
        fn $fcn ( &self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
        delegate!($fld : $($rest)*);
    };

    ( $fld:ident : pub $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty ) => {
        pub fn $fcn ( &self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
    };

    ( $fld:ident : pub $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty, $($rest:tt)* ) => {
        pub fn $fcn ( &self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
        delegate!($fld : $($rest)*);
    };

    ( $fld:ident : mut $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty ) => {
        fn $fcn ( &mut self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
    };

    ( $fld:ident : mut $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty, $($rest:tt)* ) => {
        fn $fcn ( &mut self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
        delegate!($fld : $($rest)*);
    };

    ( $fld:ident : pub mut $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty ) => {
        pub fn $fcn ( &mut self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
    };

    ( $fld:ident : pub mut $fcn:ident ( $( $a:ident : $at:ty ),* ) -> $r:ty, $($rest:tt)* ) => {
        pub fn $fcn ( &mut self, $( $a : $at ),* ) -> $r {
            (self.$fld).$fcn( $( $a ),* )
        }
        delegate!($fld : $($rest)*);
    };

}

Whew. I haven't written that many funny characters since the last time I did Perl professionally.

This is actually a pretty simple recursive macro with a great number of repeated elements. A basic invocation would be:

delegate!{
    inner:
    foo() -> usize
}

This little guy produces:

fn foo(&self) -> usize { inner.foo() }

What the macro does is to take the name of a field as a first argument and a list of simplified method declarations as the remaining arguments, and then produces a sequence of method definitions that delegate the matching methods to the field.

The minimum information required to generate the definitions is:

  • The field that is the destination of the delegation (with this macro, it has to be a named field; structures with a single anonymous ".0" name aren't supported),

  • The name of the method,

  • The types (and names) of the method's arguments and return value, and

  • A couple of markers that I'll get into below.

Since this is a simple syntax extension, all of those have to be present in the call to delegate. Let me break down the macro.

  • The first rule is a simple default which expands to nothing if the body of the invocation is empty (or commented out).

  • The remainder of the rules are in four pairs:

    • One with no extra markers, as in the example above,

    • One with a pub marker, producing a method with pub visibility,

    • One with a mut marker, producing a method where the reference to self is mutable, and

    • One with both pub and mut markers.

  • Each of the pairs consists of a base case, producing a single method as an expansion, and a recursive case, producing a method and then re-invoking the macro on the next method type specification.

  • In each rule, $fld is an identifier, the field specifier (which only needs to be given once; it's passed along to the recursive calls). $fcn is another identifier and is the function name. $a (an identifier) and $at (a type) are the argument names and types of any method arguments (other than self). $r is the return type of the method. I wonder if you can specify () as the return type to get a procedure that does not return anything. Possibly another weakness of this monstrosity.

Finally, in the recursive rules, $rest is the remainder of the macro invocation. To tell the truth, I'm not entirely sure what a tt is or why you have to have a sequence of them for the recursive calls; I copied it from the example in the book. Yay, cargo-culting!

Finally, here are some examples of it in use.

impl Extended {
    delegate!{
        inner:
        pub foo() -> usize,
        pub bar(i:usize) -> f64,
        pub mut baz() -> usize
    }
}

This provides delegate methods for Extended, for methods foo, bar, and baz; the last of which needs a mutable reference to self.

impl A for Extended {
    delegate!{
        inner:
        quux() -> usize
    }
}

This call delegates quux and provides an implementation of the A trait, as well as demonstrating a case where you don't want pub: it gets an error in a trait implementation.

Saturday, January 9, 2016

A Rust Spot: mmap

Since I have recently been fiddling with natural language programming, I decided I ought to learn a bit about the topic. The textbook I picked is Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schütze. (The alternative, Speech and Language Processing, by Daniel Jurafsky and James H. Martin, while it is more popular as a textbook and more recent, did not seem to be as well written from the excerpts I’ve read.1) So, I’ve been working my way through the book slowly, writing the example programs in Rust.

I chose Rust because I enjoy the language, and because I am trying to write code with better performance than is available in Java or Python, two other popular languages in this arena that I also like very much. But I’m an old systems guy, so my biases are, well, biased.

To that end, I would like to be able to read a text file into memory without copying the data. That’s where mmap enters the picture.

About mmap

The Unix system call mmap provides more-or-less general access to the operating system’s memory memory management system. You can use it, to read and write files, to allocate a block of memory, to share memory between processes, and probably some other stuff that I’m forgetting. It cures pellagra, asthma, and kills dust-mites. It’s not quite a dessert topping and a floor wax (for that, see ioctl), but it can be pretty useful.

For this task, I intend to use it solely for reading file contents into memory, for which it is pretty easy to use. One note: writing to the file is also pretty trivial, but extending the file requires quite a few more dance steps. While mmap is providing an interface to the system’s file system memory buffers and most file systems write additional data to the end of the file by slapping additional pages on the end of the buffer mapping and scribbling on them, this is not guaranteed to work by the mmap documentation and can cause bad things to happen on some systems. (Like deadlocking the file system into a giant tar-baby, causing any process that tries to do anything to it to get stuck and requiring a reboot to clean up. That’s NFS on AIX 3.2.5, by the way.)

Anyway, the basic process is this: get an open, readable file descriptor for the file, call mmap to make the contents of the file look like a block of memory, have your way with them, then munmap the memory and close the file descriptor.

mmap in Rust: part one

An interface to mmap is part of the Rust libc library, which is currently unstable-ish in my copy of Rust, so the first step is to add a dependency on libc to the Cargo.toml file. The next step is to provide access to a file descriptor:

// A file descriptor, open for reading.
struct FileDescriptor(libc::c_int);

impl FileDescriptor {
    unsafe fn open(filename: &str) -> Result<FileDescriptor,String> {
        if let Ok(file) = std::ffi::CString::new(filename) {
            let fd = libc::open(file.as_ptr(), libc::O_RDONLY, 0);
            if fd >= 0 {
                Ok( FileDescriptor(fd) )
            } else {
                Err( format!("failure in open({}): {}",
                    filename,
                    std::io::Error::last_os_error()) )
            }
        } else {
            Err( format!("failure getting CString: {}", filename) )
        }
    }
}

A file descriptor is an integer that references system structures associated with the process. Most of the complexity here is in the interface to the C functions: converting the file name string to a CString and calling libc::open with assorted necessary error handling.

Closing the file descriptor when it is no longer needed is safely handled with an implementation of the Drop trait which closes the file descriptor.

impl Drop for FileDescriptor {
    fn drop(&mut self) {
        let FileDescriptor(fd) = *self;
        unsafe {
            libc::close(fd);
        }
    }
}

In order to map the contents of the file, the program needs to know how big the file is, so the following code uses libc::stat:

impl FileDescriptor {
    unsafe fn get_size(&self) -> Result<libc::size_t,String> {
        let FileDescriptor(fd) = *self;
        let mut stat: libc::stat = std::mem::zeroed();
        if libc::fstat(fd, &mut stat) < 0 {
            Err( format!("failure in fstat(): {}", std::io::Error::last_os_error()) )
        } else {
            Ok( stat.st_size as libc::size_t )
        }
    }

    fn get_fd(&self) -> &libc::c_int { &self.0 }
}

The one interesting part of get_size is the call to std::mem::zeroed to allocate the libc::stat structure, which is both large (as in, it has many elements) and very system dependent. The call to zeroed happily gives me the structure in a good state very simply.

Much of this code is declared unsafe, because it’s using the C api. However, the only really unsafe part is the management of the file descriptor, which is an integer and can easily leak outside of the code where it is open and in a good state. As a nod to this danger, get_fd returns a borrowed pointer to the integer rather than the integer itself, a pointer which is only valid for the duration of the lifetime of the FileDescriptor.

mmap in Rust: part two

The safe, useful external interface to an mmap-ed region of memory is MappedRegion, a structure containing the file descriptor for the file (which must remain open while the region is valid and be closed immediately after), the size of the region (which will be needed when the region is unmapped), and a pointer to the memory that contains the file contents (more precisely, that is backed by the file).

pub struct MappedRegion {
    _fd: FileDescriptor,
    sz: libc::size_t,
    ptr: *mut u8,
}

impl MappedRegion {

    pub fn mmap(filename: &str) -> Result<MappedRegion,String> {
        unsafe {
            match FileDescriptor::open(filename) {
                Ok(fd) => map(fd),
                Err(e) => Err(e)
            }
        }
    }
    ...
}

Actually mapping the file, once the file descriptor has been opened, is the purpose of the unsafe map function:

unsafe fn map(fd: FileDescriptor) -> Result<MappedRegion,String> {
    match fd.get_size() {
        Ok(sz) => {
            let address = libc::mmap(0 as *mut libc::c_void,
                                     sz as u64,
                                     libc::PROT_READ,
                                     libc::MAP_PRIVATE,
                                     *fd.get_fd(),
                                     0);
            if address < 0 as *mut libc::c_void {
                Err( format!("failure in mmap(): {}",
                             std::io::Error::last_os_error()) )
            } else {
                Ok( MappedRegion {
                    _fd: fd,
                    ptr: address as *mut u8,
                    sz: sz,
                })
            }
        }
        Err(e) => { Err(e) }
    }
}

The function calls libc::mmap with the following arguments:

  1. A 0. This is a hint to the operating system about where in the processes’ address space the mapped region should be located; a zero says the program does not care.

  2. The size of the file. It’s possible to map a smaller segment of the file but not a larger, by the way.

  3. A protection flag of PROT_READ. This indicates that the program is interested in reading the contents of the file, but not in writing or executing them. (Yeah, you can dynamically load code this way.)

  4. A sharing flag of MAP_PRIVATE. This indicates that changes to the file will not be visible to other processes mapping the same file or be written to the file. (Such changes should not be possible with PROT_READ anyway, but something needs to be specified here.) This argument also possibly takes other flags to request other magic from the operating system, none of which are needed for this case.

  5. The file descriptor. Yeah, it had to be in here somewhere.

  6. An offset of 0. If you were mapping a smaller segment of the file, this would allow you to specify where the offset started in the file.

Simple? Clear? Easy? (Questions? Comments? Drop slips? (Hi, Mittens!))

Unmapping the file when the program is done is the task of another implementation of the Drop trait:

impl Drop for MappedRegion {
    fn drop(&mut self) {
        unsafe {
            if libc::munmap(self.ptr as *mut libc::c_void, self.sz) < 0 {
                panic!("munmap: {}", std::io::Error::last_os_error());
            }
        }
    }
}

The arguments are the pointer of the mapped region and the size. Since Drop doesn’t have a way to propagate errors, this code panics. Normally, my practice would be to swallow the error (What, exactly, am I supposed to do about a failure in munmap?), but here I suspect such errors will be sufficiently rare (and important, or at least interesting) to report somehow.

The final two methods in MappedRegion (contained in the impl above) provide safe access to the contents of the file. The first, get_slice, returns it as a slice of bytes, spelled here u8.

    pub fn get_slice(&self) -> &[u8] {
        unsafe {
            std::slice::from_raw_parts(self.ptr, self.sz as usize)
        }
    }

The second method returns the contents as a reference to a str, using a potentially failing UTF-8 conversion:

    pub fn get_str<'s>(&'s self) -> Result<&'s str,String> {
        std::str::from_utf8(self.get_slice()).map_err(|e| { format!("{}", e) })
    }

Both of the objects returned by these methods are references with the same lifetime as the MappedRegion, so it should be impossible for pointers into the file contents to leak outside that lifetime. If the program needs some of the contents longer than that, it is up to the program to copy them.

In use

With MappedRegion as a safe interface, using mmap to read files in Rust is relatively easy, modulo the Results:

MappedRegion::mmap(file).and_then(|contents| {
    contents.get_str().map(|text| {
        do_something(text)
}).expect(&format!("cannot read {}", file))

In this chunk, contents is a MappedRegion, representing a pointer-like-thing to the file contents and text is a string reference to the contents themselves. If this chunk is the body of a function, the return value will be the value of do_something (which presumably does something with the file contents) or a panic due to the expect.

Next time: given the file contents, can we diddle-about with word-like tokens in it without copying strings around?


  1. How about this quote from Foundations of Statistical Natural Language Processing:

    While it is much better to refer to such a curve as a ‘normal distribution’ than as a ‘bell curve’, if you really want to fit into the Statistical NLP or pattern recognition communities, you should instead learn to refer to these functions as Gaussians, and to remark things like, ‘Maybe we could model that using 3 Gaussians’ at appropriate moments.

Monday, December 7, 2015

An ode to black humor

Last night I came to a realization:

Black humor embodies the highest virtue of the human spirit. Standing on the gallows mocking the hangman because he cannot tie his shoes demonstrates a fine disinclination to be defeated. Given that the forces arrayed against a person are many and cannot be overcome, a bleak joke is as formidable a response as can be mounted and one as likely to succeed as any.

In other news, I suspect there is a race condition in the Stanford Natural Language Processing Java library. Also, I'm not a poet.