Letterpress cheating in Rust: mmap, the dictionary, and borrowed pointers

Posted on June 17, 2013 by Tommy McGuire
Labels: rust, c, programming language

In some recent posts, I have been playing with an anagram program to explore the Rust programming language. One issue that consistently has bothered me is the speed difference between the hashmap-based Rust version and the hashing-based C version. In my most recent timing check, the Rust version is over 6 times slower than the C version:

LanguageProgramTime (sec)
Rustanagrams-hashmap5.98
Calternatives/anagrams-hash0.94

That is not the whole story, though. The Rust version is about 60 relatively straightforward lines, while the C version is about 150 lines of hairier code. Further, there are some differences in the way they work.

The Rust version reads the dictionary file into a hashmap made up of owned vectors of bytes and owned vectors of strings. The C version mmap's the file into memory and builds a custom hashtable structure around it (which is one of the reasons it is significantly longer). The question I am attempting to answer in this post is whether reading the file, allocating space for the hashmap and associated keys and values, and copying the file data is responsible for, or at least a major component of, the speed difference.

With that in mind, I slapped together a wrapper for mmap, then begged for help from Niko Matsakis to get a working hashmap based on the byte-buffer. The result is actually pretty nice: not grossly bigger or more complex, although it does have a bit more type-hackery.

If you go back to the post on mmap, the public interface to it is:


do mmap::with_mmap_file_contents("anadict-rust.txt") |buf| {
...
}

The type of buf in that block is &[u8], a borrowed pointer to a vector of bytes. The key to this approach is to build a HashMap whose keys and values are vector slices backed by that buf vector. The function which builds the HashMap is:


fn line_map<'b>(buffer : &'b [u8]) -> HashMap<&'b [u8],&'b [u8]> {
let length = buffer.len();
let mut map = HashMap::new();
let mut i = 0;
while i < length {
let mut j = i;
while j < length && buffer[j] != ' ' as u8 { j += 1; }
let mut k = j+1;
while k < length && buffer[k] != '\n' as u8 { k += 1; }
map.insert(vec::slice(buffer, i, j), vec::slice(buffer, j+1, k));
i = k + 1;
}
return map;
}

The first thing to notice about this function is the type. It accepts an argument of a reference to a vector with a lifetime 'b and returns a HashMap whose keys and values are vectors with the lifetime 'b. This is exactly what we want: the keys and values cannot live longer than the buffer that they are actually pointing into, using some trickery from the vec module.

Now, the format of the dictionary file is a series of lines:


key space word space word ... newline

With that in mind, at the point of the call to map.insert(), i is the index of the first character on a line, j is the index of the first space on the line, and k is the index of the newline at the end of the line. vec::slice() creates "a slice that points into another slice", so the arguments to map.insert() are the key at the beginning of the line, up to the first space, and the remainder of the line, from the character following the space up to the newline. And no copies of the data from the file are made.

The actual work of the program, checking each combination of letters, is handled by the search function:


fn search<'b>(letters : &[u8], dictionary : &'b HashMap<&'b [u8],&'b [u8]>) -> HashSet<&'b [u8]>
{
let mut set = HashSet::new();
for uint::range(2, letters.len() + 1) |i| {
let mut key = MapKey(vec::from_elem(i, 0));
for combinations::each_combination(letters,i) |combo| {
for combo.eachi |j,&ch| { key[j] = ch; }
dictionary.find_equiv(&key).map(|&v| set.insert(*v));
}
}
return set;
}

The search function is simple enough, with the exception of the final line of the inner block. There is a major difficulty in using this HashMap: the type of the keys includes the lifetime parameter. As Niko wrote in his response, "the key [of the HashMap] is `&'b [u8]` but you can't supply precisely that type, because the key you are using to do the lookup has a shorter lifetime." In fact, the key only exists during the uint::range block. As a result, the find method, which has the type fn find<'a>(&'a self, k: &K) -> Option<&'a V> for a HashMap<K,V>, is not applicable. Niko suggested the find_equiv method (which I'd missed entirely) as well as the MapKey workaround because the equivalence I need is not currently supported.

The find_equiv method has the following type:


fn find_equiv<'a, Q:Hash + Equiv<K>>(&'a self, k: &Q) -> Option<&'a V> {

The argument, a Q, should be equivalent to the type of the keys, as indicated by Equiv<K>, and should hash consistently with that equivalence. (Niko adds the following advice: "Ah, one thought, you may want to be more careful with the def'n of IterBytes on MapKey to be sure that the hashcodes are the same," which does not appear to be a problem at the moment.) The remainder of the solution is to introduce the MapKey type to hang the Equiv trait on.


#[deriving(IterBytes)]
struct MapKey(~[u8]);

impl<'self> Equiv<&'self [u8]> for MapKey {
fn equiv(&self, other: & &'self [u8]) -> bool {
let slice1: &[u8] = **self;
let slice2: &[u8] = *other;
slice1 == slice2
}
}

So the punchline is, when faced with a type problem, to add more types.

In any case, the main function incorporating with_mmap_file_contents, line_map, and search is:


fn main() {
let args = os::args();
if args.len() < 2 {
fail!(~"Usage: anagrams letters");
}
let letters = get_letters(args[1]);
do mmap::with_mmap_file_contents("anadict-rust.txt") |buf| {
let map = line_map(buf);
let set = search(letters, &map);
let mut count = 0;
for set.each |ln| {
count += 1 + vec::count(*ln, &(' ' as u8));
}
println(fmt!("%?", count));
}
}

(Ok, so rather than assembling a sorted result set of words, this code cheats and just counts them. So does the C version, for that matter, and anyway, manipulating the final results does not seem to be a major component of the time taken by the program.)

While from the standpoint of learning Rust, this adventure was a terrific success, from a performance standpoint, the results of this experiment are slightly disappointing.

LanguageProgramTime (sec)
Rustanagrams-hashmap5.21
Rustanagrams-hashmap-mmap4.75
Calternatives/anagrams-hash0.93

Those numbers are a current run with a (slightly outdated) Rust incoming. The difference with mmap is significant, but not overwhelming.

I wonder if Rust's rather heavyweight hashing could be a bottleneck?

The code for anagrams-hashmap-mmap.rs, along with all of the rest of this stuff, are available on github.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.