Rust Hashers

Posted on October 20, 2018 by Tommy M. McGuire
Labels: rust, data structure
A (very bad) hash function in the wild. (Thanks, Wikipedia!)

A (very bad) hash function in the wild. (Thanks, Wikipedia!)

A hash function, for our purposes here, is a function that takes as input another, general, value, and returns a number that is ideally unique to that value. This number can be used to store the value in an array and then locate it again later without searching the array; in other words, in O(1) time. More or less: there are a lot of other details. For more information, see Rust’s HashMap and various information sources online.

In Rust specifically, std:hash:Hasher is a trait:

pub trait Hasher {
    fn finish(&self) -> u64;
    fn write(&mut self, bytes: &[u8]);

    fn write_u8(&mut self, i: u8) { ... }
    fn write_u16(&mut self, i: u16) { ... }
    ...
}

Hasher has two required methods, finish and write, and default implementations of other useful methods like write_u8 and write_u16, calling write. In use, an implementation of Hasher is created, data is fed to it using the various write methods, then the result is returned using the finish method to get the hash number out.

Comparing Rust Hashers

This post is based on a small project, a collection of Rust Hashers. I created this project to have a single place to provide a number of hash functions, demonstrate their use, and test some of the qualities of the hash functions. The functions currently included are:

More should be added as I get time.

Statistics

Ideally, the values of a hash function would come from a uniform distribution; the probability of a value would be one over the number of possible values. This is useful because a hash table with n buckets and m items should have m/n items in each bucket (modulo open addressing).

χ² test

The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. – Chi-squared test

This program attempts to compute the hash values for the members of one of a number of data sets, then simulates using those values in a 128-bucket hash table (a 27 - 1 mask) and tries to determine if the hash buckets are uniformly distributed. I think. I’m not a statistician and apparently not much of a programmer any more. Sorry.

The samples are:

The setup for this test is taken (loosely) from Bob Jenkins’ Hash Function FAQ. Jenkins’ page above describes how to do this test, although I could not get his computation of χ² to produce reasonable values, so I went with the algorithm presented by Chi-Square Statistic: How to Calculate It / Distribution.

Anyway, it shows what may be the χ² test of the lower bits of the hash values for a number of samples and for each Hasher. Numbers closer to 0 are apparently better, and between 3.0 and -3.0 are apparently “ok.” Maybe.

It’s a giant piece of crap. I hate statistics. Here’s the results from one run of chi2:

 Uniform distribution			Alphanumeric distribution
bricolage:       -1.4086		bricolage:        0.1075
default  :        0.7410		default  :        0.4921
djb2     :       -1.0239		djb2     :       -1.9064
fnv1a 32 :        2.0534		fnv1a 32 :       -1.4312
fnv1a 64 :       -0.5713		fnv1a 64 :       -0.4582
fxhash   :       -2.0874		fxhash   :       -0.5487
fxhash32 :       -0.9560		fxhash32 :        3.9994
fxhash64 :       -2.0874		fxhash64 :       -0.5487
lookup3  :        1.1483		lookup3  :        1.8272
loselose :       -0.1188		loselose :       -3.4677
null     :    11214.0064		null     :    11214.0064
OAAT     :       -1.0918		OAAT     :        0.7863
Pass     :       -1.4312		Pass     :       94.6222
sdbm     :       -0.5940		sdbm     :        0.1754
spooky   :       -0.4356		spooky   :        0.6053

 Generated identifiers			   Dictionary words
bricolage:        8.7964		bricolage:        2.4863
default  :       -1.1823		default  :        2.0961
djb2     :       89.5989		djb2     :       -3.5788
fnv1a 32 :       -8.2873		fnv1a 32 :        4.6737
fnv1a 64 :       -5.7078		fnv1a 64 :        0.1587
fxhash   :       -5.2552		fxhash   :     1473.2659
fxhash32 :       -5.8888		fxhash32 :       44.0437
fxhash64 :       -5.2552		fxhash64 :     1473.2659
lookup3  :       -0.3224		lookup3  :        1.2984
loselose :      525.4030		loselose :       26.1546
null     :    11214.0064		null     :  1113214.9110
OAAT     :       -1.4991		OAAT     :       -1.0941
Pass     :     1031.6688		Pass     :   278857.6686
sdbm     :      214.3212		sdbm     :       42.9022
spooky   :       -1.0465		spooky   :       -1.2023

What it is possible to conclude here is that null is horrible, pass isn’t much better, lose-lose is bad, and the older, simpler suspects are not very good. One thing that concerns me about the test is the performance of fx.

Kolmogorov-Smirnov test

The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. – Kolmogorov–Smirnov test.

This one I have a bit more confidence in. It hashes the same samples as the chi2 program, then attempts to determine how far from uniformly distributed the 64-bit hash values are, reporting values between 0.0 and 1.0. Lower values are better. 32-bit hashes like DJB2 trivially fail this test, though, although they may be fine for HashMaps with much less than 232 entries.

Uniform distribution		Alphanumeric distribution
 bricolage  0.0286    		bricolage  0.0162    
 default    0.0336    		default    0.0114    
 djb2       0.9990    		djb2       0.9999    
 fnv1a 64   0.0227    		fnv1a 64   0.0066    
 fxhash     0.0190    		fxhash     0.0083    
 fxhash32   0.9990    		fxhash32   0.9999    
 fxhash64   0.0190    		fxhash64   0.0083    
 lookup3    0.0137    		lookup3    0.0122    
 loselose   0.9990    		loselose   0.9999    
 null       0.9990    		null       0.9999    
 oaat       0.0314    		oaat       0.0071    
 passthru   0.9990    		passthru   0.9999    
 sdbm       0.9990    		sdbm       0.9999    
 spooky     0.0323    		spooky     0.0115    

Generated identifiers		Dictionary words
 bricolage  0.0059    		bricolage  0.0025    
 default    0.0053    		default    0.0029    
 djb2       0.9999    		djb2       1.0000    
 fnv1a 64   0.1072    		fnv1a 64   0.0050    
 fxhash     0.0090    		fxhash     0.0022    
 fxhash32   0.9999    		fxhash32   1.0000    
 fxhash64   0.0090    		fxhash64   0.0022    
 lookup3    0.0083    		lookup3    0.0034    
 loselose   0.9999    		loselose   1.0000    
 null       0.9999    		null       1.0000    
 oaat       0.3990    		oaat       0.0138    
 passthru   0.9984    		passthru   0.5251    
 sdbm       0.9999    		sdbm       1.0000    
 spooky     0.0082    		spooky     0.0034    

The results of this test are pretty much to be expected: modern, complicated hash functions have better properties.

Benchmarks

The other requirement for a hash function is speed: the hash function is a significant chunk of the overhead of an O(1) hash table. There are a number of different uses cases, as demonstrated by the samples above, used by the statistical tests. However, they break down into two broad classes: computing the hash of a relatively small object, and computing the hash of a large block of data.

Hash function benchmarks

Rust supplies a very nice benchmarking framework, and the project has a number of largely unhelpful benchmarks for each hash function. Two are to hash 1000 dictionary words and to hash the entire dictionary.

               1000 Words					           File
bricolage         79,908 ns/iter (+/- 3,959)		bricolage           9,500,089 ns/iter (+/- 266,247)
default            9,481 ns/iter (+/- 615)		default               361,301 ns/iter (+/- 101,824)
djb2              10,088 ns/iter (+/- 354)		djb2                1,237,351 ns/iter (+/- 29,693)
fnv1a64           11,435 ns/iter (+/- 701)		fnv1a32             1,237,659 ns/iter (+/- 26,776)
fxhash             5,009 ns/iter (+/- 390)		fnv1a64             1,239,444 ns/iter (+/- 23,934)
lookup3           11,313 ns/iter (+/- 481)		fxhash                256,959 ns/iter (+/- 9,187)
loselose           6,042 ns/iter (+/- 228)		lookup3               553,407 ns/iter (+/- 13,112)
oaat              17,661 ns/iter (+/- 605)		loselose              427,386 ns/iter (+/- 13,388)
passthrough        5,867 ns/iter (+/- 275)		oaat                2,178,518 ns/iter (+/- 81,012)
sdbm              10,746 ns/iter (+/- 321)		passthrough           616,022 ns/iter (+/- 36,241)
spooky            26,120 ns/iter (+/- 1,416)		sdbm                1,240,484 ns/iter (+/- 31,541)
							spooky                193,378 ns/iter (+/- 4,973)

And here we find the major downside of bricolage, which appears to have very nice properties and is not very complex: It is very slow. This is likely due to processing one byte of the input at a time. The faster functions are either much simpler or are very heavily engineered for speed.

Hashmap benchmarks

This is where the purported rubber meets the alleged road: actual HashMap usage. Like all such simple tests, it is limited by the use: the hashes are of dictionary-word sized chunks.

This program finds the number of words that can be made from the letters “asdwtribnowplfglewhqagnbe”, based on the anagrams dictionary in data/anadict.txt. (There are 7440 of them, out of about 33 million HashMap lookups.) It uses implementations of HashMap and HashSet parameterized by Hashers, and reports the time taken by each hasher as well as a comparison with DefaultHasher.

For more information, check out my ancient series of blog posts:

The results are interesting:

default       2.475s
bricolage     8.498s (343.4%)
djb2          2.569s (103.8%)
fnv-1a 32     2.195s (88.7%)
fnv-1a 64     2.296s (92.8%)
fxhash        2.647s (107%)
fxhash32      1.376s (55.6%)
fxhash64      2.567s (103.7%)
lookup3       2.354s (95.1%)
oaat          2.952s (119.3%)
sdbm          2.319s (93.7%)
spooky        4.094s (165.4%)

The interesting result is from fxhash32: at 1.4 seconds, it is much faster than the default hash function and almost as fast as the simple, but rather obfuscated, C code I wrote for the blog posts. That program consistently runs in 1.0 second; at only 40% slower, the 32-bit FX hash version is better in almost every respect. (I think the times could be improved further for this program by improving the program’s IO behavior, as the C version and other Rust versions already do.)

Using different hash functions

How do you use these functions in your own programs?

Using a custom hash function with Rust’s HashMap or HashSet has long been regarded as a deep mystery. Now, I will strip back the curtains of ignorance and reveal the secrets in all their unholy glory!

Or something like that. It’s not really a big deal.

There are two ways to create a HashMap that uses a custom Hasher implementation: setting the hasher on a hash-map instance, and type-level hackery.

Explicitly telling a HashMap what Hasher to use

Everytime a value needs to be hashed, when inserting or querying the HashMap for example, a new Hasher instance is created. (Remember that the only methods in the Hasher trait update its state or return the final value.)

As a result, instead of passing in a Hasher, we have to pass an instance of another trait, std::hash::BuildHash. Rust’s standard library currently has two implementations of that trait:

All of the Hashers in this collection also implement Default.

use std::collections::HashMap;
use std::hash::BuildHasherDefault;

use hashers::fx_hash::FxHasher;

// BuildHasherDefault also implements Default---it's not really interesting.
let mut map =
  HashMap::with_hasher( BuildHasherDefault::<FxHasher>::default() );

map.insert(1, 2);
assert_eq!(map.get(&1), Some(&2));

Using types to specify what Hasher to use

As an alternative, HashMap has three type-level parameters: the type of keys, the type of values, and a type implementing std::hash::BuildHash. By default, the latter is RandomState, which securely creates DefaultHashers. By replacing RandomState, the Hashers used by the map can be determined by the HashMap’s concrete type. std::hash::BuildHasherDefault is useful here, as well.

use std::collections::HashMap;
use std::hash::BuildHasherDefault;

use hashers::fnv::FNV1aHasher64;

// This could be more complicated.
fn gimmie_a_map() -> HashMap<i32,i32,BuildHasherDefault<FNV1aHasher64>> {
    HashMap::default()
}

let mut map = gimmie_a_map();

map.insert(1,2);
assert_eq!(map.get(&1), Some(&2));

A more complicated example is the anagrams-hashmap.rs example program included with this module.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.