A faster hashmap in Rust

Posted on October 12, 2013 by Tommy McGuire
Labels: software development, rust, programming language

If you take a look at the most recent Letterpress cheating in Rust results, you may notice something odd: Python, an interpreted, "scripting" language, is faster (or at least not slower) than Rust, a "systems" language that compiles to raw, executable, machine code. The head-to-head comparison is between alternatives/presser_three.py, taking 6 seconds, and anagrams-hashmap, currently taking 7.2 seconds. There are several reasons for this.

First, most of the Python code is actually optimized C. The implementation of the combination generator is in C (check around line 2291), and the dictionary (/hashmap/hashtable; this data structure has a lot of names) is a very important part of Python and is heavily optimized C. (dictobject.c, dictobject.h, dictnotes.txt)

Second, the Rust standard library's hash module implements SipHash 2-4, a cryptographically strong keyed hash.

Why is SipHash good (and coincidentally, why is Rust using it)? If you used, say, a non-cryptographically-strong hash as part of a hashmap to store HTTP request headers or parameters from a HTTP POST in an HTTP server, a denial of service attack can be very effective, simply by inserting a bunch of headers or parameters which hash to the same bucket and thereby make your hashmap behave like a very expensive linked list. A strong keyed hash avoids that, by making it difficult to predict what is going to hash to the same bucket. And, SipHash is a good strong hash function.

(Some notes on that link: "pyrhash" is reportedly "Python's new randomized hash, with a 128-bit key and 64-bit output", which may-or-may-not be the same hash function that is enabled by Python's -R argument (see the bug discussion). The weird part is that, according to my measurements, Python with -R is as fast as Python without -R, which may-or-may-not mean that the measured implementation of SipHash is faster than Python's default hash function.)

So, why is SipHash bad? If you ignore that last paragraph (which seems entirely ignorable for the moment), SipHash is fairly slow in comparison to a normal, weak hash function. According to my quick and dirty benchmark, it is about 2.6 times slower than my favorite DJB hash function, in fact. The DJB hash function has a long history and is both simple and fairly decent. Neatly, D. J. Bernstein is the "DJB" in djbhash, and also is a co-author of SipHash.

Fair enough. What happens if we replace SipHash with djbhash in Rust's hashmap? Well, unfortunately, it is not that simple. Replacing the default, SipHash, implementation of the Hash trait seems to make the Rust compiler unhappy. The default implementation is based on another trait, IterBytes, which gives access to the raw bytes of the to-be-hashed data; anything that implements IterBytes also gets the Hash trait. If you provide an implementation of Hash for your new data type without implementing IterBytes, you receive a complaint about not implementing Iterbytes. If you do implement IterBytes as well, you get a complaint about multiple implementations of Hash. As a result, replacing SipHash means replacing Rust's HashMap implementation as well.

A DJB hash

The djbhash function itself is tiny, but the translation into Rustic expands it a bit. This code is also a bit more flexible than the bare implementation, because it uses the IterBytes trait itself, to simplify the interface.


struct DJBState { hash : u64 }

impl DJBState {
fn new() -> DJBState { DJBState { hash : 5381u64 } }

#[inline]
fn djbhash<T:IterBytes>(t : &T) -> u64 {
let mut state = DJBState::new();
t.iter_bytes(true, |b| { state.write(b); true });
state.flush();
return state.hash();
}

fn hash(&self) -> u64 { self.hash }
}

impl Writer for DJBState {
#[inline]
fn write(&mut self, buf : &[u8]) {
let len = buf.len();
let mut i = 0;
while i < len { self.hash = (33u64 * self.hash) ^ buf[i] as u64; i += 1; }
}
fn flush(&mut self) { }
}

The values 5381 and 33 are magic, experimentally determined numbers in djbhash; you are not expected to understand them. The write method also uses a non-idiomatic style of iteration simply because it is slightly faster, according to my measurements.

The basic interface here is:


let hash = DJBState::djbhash(str);

The Writer implementation and the new and hash methods would be useful to hash multiple values but for the simple cases I need can be entirely encapsulated by djbhash.

A faster HashMap

For the hashmap itself, I decided to go more or less all out, using the Python dictionary implementation as inspiration. Aside from the code linked above, my references were a StackOverflow question, a blog post by Laurent Luce, and a pure Python implementation. You may want to read those before going further, particularly if you want to flame me for not knowing what I'm doing.

The HashMap struct itself has four main entries:


pub struct HashMap<K,V> {
table : ~[Entry<K,V>],
capacity : uint,
mask : u64,
length : uint,
ghosts : uint,
}

table is the primary vector of entries containing the contents of the map. capacity is the number of elements in table; one invariant of this structure is that capacity, and the size of the vector, are powers of 2. mask is a bitmask used to turn a hash value into an index into table; because of the power-of-two invariant, it is simply capacity - 1. Both capacity and mask can easily be calculated, but for the moment I have kept them because they are used frequently. length is the number of key/value pairs stored in the HashMap. ghosts is the number of slots that have had entries removed but have not been reused; it is used to ensure that table has some empty slots.

The Python dictionary and this HashMap use open addressing; all of the key/value pairs are stored in the table itself, rather than being linked off of a bucket. As a result, probing for a key in the map may involve multiple references into table, if multiple keys hash to the same slot index. In that case, a new index is calculated and the probe moves to that entry, a process called "chaining".

The Entry structure is important to understanding the implementation. Originally, I simply used Option, but that becomes problematic when key/value pairs can be removed; if you simply replace a removed pair with None, you lose the information that this slot is part of a chain with multiple keys hashing to the same index.

Table entries


enum Entry<K,V> {
Empty,
Full(K,V),
Ghost(K),
}

An Empty slot has never had an occupant. A Full slot has a current occupant. A Ghost slot (referred to in the Python implementation as a "dummy") is a key/value that has been removed; the key is kept to make probing faster in the case of probing for a value which was in the HashMap.

The primary method associated with Entry is matches:


impl<K : Eq, V> Entry<K,V> {
fn matches(&self, key : &K) -> bool {
match *self {
Empty => true,
Full(ref k,_) | Ghost(ref k) => k == key,
}
}
}

A key matches an Entry if the key could be inserted into the map in this Entry. An Empty Entry could be replaced by any key/value pair, so matches any key. A Full Entry matches its specific key. And a Ghost matches if the key that was in the slot is equal to the probed key.

Which brings me to probing....

Probing


impl HashMap {

fn probe(&self, key : &K) -> uint {
let mut free = None;
let mut hash = DJBState::djbhash(key);
let mut i = hash & self.mask;
while !self.table[i].matches(key) {
if free.is_none() && self.table[i].is_ghost() { free = Some(i); }
i = ((5 * i) + 1 + hash) & self.mask;
hash = hash >> PERTURB_SHIFT;
}
if self.table[i].is_full() || free.is_none() {
i as uint
} else {
free.unwrap() as uint
}
}

Probing a hashmap is the process of identifying which slot a given key is, or should be, located in. Probing is a three-step process:

  1. Compute an initial slot, i. That is done by the two lines:


    let mut hash = DJBState::djbhash(key);
    let mut i = hash & self.mask;
  2. If the slot does not match the key, compute the next slot i in the chain. If a Ghost entry is seen during this process, remember it (as free). Repeat as necessary until a slot matches. This computation is done by the two lines:


    i = ((5 * i) + 1 + hash) & self.mask;
    hash = hash >> PERTURB_SHIFT;

    Note three things about this code: First, the use of hash (and the right-shift of hash) allows all of the bits of the hash value to enter the chain computation. Second, hash will eventually be zero, following the right shift, and \((5i + 1) \mod\ 2^n\) by itself will cycle through all of the values between 0 and \(2^n\), so an empty slot is guaranteed to be found, if one exists. Third, the order of slots searched will always be the same for a given key, but two keys that happen to hash to the same slot will likely search a completely different chain of slots.

    PERTURB_SHIFT is an experimentally-determined 5, by the way.

  3. There are three possibilities when a matching slot is found:

    • Slot i is Full (and its key is equal to the probed key): return i.
    • Slot i is Empty and no Ghost slot was seen: return i.
    • Slot i is Empty and a Ghost slot was seen: return the free Ghost slot.

Containerization

The next steps in writing HashMap are to implement some traits. Container (which supports len() returning length) is trivial; Mutable (providing clear to empty a map) only a little less so. Oddly enough, Map<K,V> is also pretty simple:


impl<K : Eq + IterBytes,V> Map<K,V> for HashMap<K,V> {
fn find<'a>(&'a self, key: &K) -> Option<&'a V> {
let i = self.probe(key);
match self.table[i] {
Empty | Ghost(_) => None,
Full(_, ref val) => Some(val),
}
}
}

The only non-trivial part is that the Full pattern has to match val by reference, so it can be returned as a reference with the right lifetime.

MutableMap<K,V> is the only remaining trait, which requires three methods: swap, pop, and find_mut. find_mut is identical to find, with the exception that it matches the value by mutable reference: Full(_, ref mut val).

Until this point, this implementation in Rust has been, well, normal. I have treated Rust as sort of a combination of C and ML or OCaml, casually, and with relative impunity. But pop is where things start to get interesting. The pop method is used to remove an entry from the HashMap, which takes some finagling.


fn pop(&mut self, key: &K) -> Option<V> {
self.expand();
let i = self.probe(key);
let (result,replacement) = match util::replace(&mut self.table[i], Empty) {
Empty => (None,Empty),
Ghost(k) => (None,Ghost(k)),
Full(k,v) => {
self.length -= 1;
self.ghosts += 1;
(Some(v),Ghost(k))
},
};
self.table[i] = replacement;
result
}

Ignore self.expand() for the moment; I'll get to it in a bit. The first important thing this code does is util::replace(&mut self.table[i], Empty): swap out the current entry in self.table[i] for Empty, returning the original entry. Then the method examines the original value to determine the result of the method and the actual replacement value that goes in table[i].

The swap-out is necessary because, in the Full case for example, I would be moving the k and v values out of the existing entry, leaving it in an indeterminate, unsafe state. The unsafe state would only be temporary, but the compiler's guarantees will not let me do it. Fortunately, I can use Empty as a placeholder.

The final method is swap, used to insert values.


fn swap(&mut self, key: K, value: V) -> Option<V> {
self.expand();
let i = self.probe(&key);
match self.table[i] {
Empty => {
self.table[i] = Full(key,value);
self.length += 1;
None
},
Ghost(_) => {
self.table[i] = Full(key,value);
self.length += 1;
self.ghosts -= 1;
None
},
Full(_,ref mut v) => {
Some( util::replace(v, value) )
},
}
}

Replacing an Empty slot increases length; replacing a Ghost increases length and decreases ghosts, and replacing a Full involves replacing the value of the entry while returning the original value, again using util::replace to return the old value.

Maintaining invariants

So far, I have mentioned two invariants of the HashMap: the size of table is a power of 2, and table has at least one Empty element. Maintaining those invariants is the responsibility of expand, in two cases: adding new entries to the map, potentially requiring an increase in the table size, and removing elements, leaving Ghost entries.


fn expand(&mut self) {
if self.length * 3 > self.capacity * 2 {
// Expand table if live entries nearing capacity
self.do_expand( self.capacity * 2 );
} else if (self.length + self.ghosts) * 3 >= self.capacity * 2 {
// Rehash to flush out excess ghosts
self.do_expand( self.capacity );
}
}

Those two cases form the two branches of expand. In the first case, if length is greater than 2/3 of capacity, the table size is doubled (this is from Python). In the second, if length + ghosts is at least 2/3 capacity, the size is kept, but the table is rehashed.

Performing the expansion or rehashing is handled by do_expand (and a lovely name that is, too).


fn do_expand(&mut self, new_capacity : uint) {
let mut new_tbl = HashMap::with_capacity( new_capacity );
for elt in self.table.mut_iter() {
match util::replace(elt, Empty) {
Full(k,v) => { new_tbl.insert(k,v); },
Empty | Ghost(_) => { },
}
}
// Copy new table's elements into self. Note: attempting
// to do this more directly causes: "use of partially moved
// value"
let cap = new_tbl.capacity;
let mask = new_tbl.mask;
let len = new_tbl.length;
let ghosts = new_tbl.ghosts;
self.table = new_tbl.table;
self.capacity = cap;
self.mask = mask;
self.length = len;
self.ghosts = ghosts;
}

do_expand simply creates a new HashMap of the right size and moves the Full elements to the new table, using util::replace to swap them out of the original table. Then, it copies the new table's fields into self, being careful not to leave self or new_tbl in a state the compiler would regard as unsafe until after it is safe to do so.

And that's all. The constructors, with_capacity and new, simply set up the structure, initializing the invariants with


let capacity = uint::next_power_of_two(sz);

Results

A version of anagrams-hashmap built to use djbhash runs in approximately 3 seconds, half of the time of the standard library HashMap and the Python version.

The code for the faster hashmap is on Github.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote quotes R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.