Rust Hashers
A hash function, for our purposes here, is a function that takes as input another, general, value, and returns a number that is ideally unique to that value. This number can be used to store the value in an array and then locate it again later without searching the array; in other words, in O(1) time. More or less: there are a lot of other details. For more information, see Rust’s HashMap and various information sources online.
In Rust specifically, std:hash:Hasher
is a trait:
pub trait Hasher {
fn finish(&self) -> u64;
fn write(&mut self, bytes: &[u8]);
fn write_u8(&mut self, i: u8) { ... }
fn write_u16(&mut self, i: u16) { ... }
...
}
Hasher has two required methods, finish
and write
, and default implementations of other useful methods like write_u8
and write_u16
, calling write
. In use, an implementation of Hasher is created, data is fed to it using the various write methods, then the result is returned using the finish method to get the hash number out.
Comparing Rust Hashers
This post is based on a small project, a collection of Rust Hashers. I created this project to have a single place to provide a number of hash functions, demonstrate their use, and test some of the qualities of the hash functions. The functions currently included are:
Bob Jenkins’ one-at-a-time, lookup3, and spooky hash.
Ozan Yigit’s collection of djb2, sdbm, and lose-lose (the hideous function from K&R 1st edition).
Steven Pigeon’s bricolage function.
FXhash, extracted from rustc and used internally by FireFox. This hash function comes in 32 and 64 bit versions. (This module is imported rather than implemented in this package.)
FNV-1a, the Fowler-Noll-Vo hash function.
The default (SIP-13?) Rust hasher.
A null hasher and a pass-through hasher, for comparison purposes.
More should be added as I get time.
Statistics
Ideally, the values of a hash function would come from a uniform distribution; the probability of a value would be one over the number of possible values. This is useful because a hash table with n buckets and m items should have m/n items in each bucket (modulo open addressing).
χ² test
The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. – Chi-squared test
This program attempts to compute the hash values for the members of one of a number of data sets, then simulates using those values in a 128-bucket hash table (a 27 - 1 mask) and tries to determine if the hash buckets are uniformly distributed. I think. I’m not a statistician and apparently not much of a programmer any more. Sorry.
The samples are:
1000 uniformly distributed 6-byte binary values.
1000 uniformly distributed 6-byte alphanumeric (ASCII) values.
1000 generated identifiers of the form ‘annnnn’.
The words from data/words.txt
The setup for this test is taken (loosely) from Bob Jenkins’ Hash Function FAQ. Jenkins’ page above describes how to do this test, although I could not get his computation of χ² to produce reasonable values, so I went with the algorithm presented by Chi-Square Statistic: How to Calculate It / Distribution.
Anyway, it shows what may be the χ² test of the lower bits of the hash values for a number of samples and for each Hasher. Numbers closer to 0 are apparently better, and between 3.0 and -3.0 are apparently “ok.” Maybe.
It’s a giant piece of crap. I hate statistics. Here’s the results from one run of chi2:
Uniform distribution Alphanumeric distribution bricolage: -1.4086 bricolage: 0.1075 default : 0.7410 default : 0.4921 djb2 : -1.0239 djb2 : -1.9064 fnv1a 32 : 2.0534 fnv1a 32 : -1.4312 fnv1a 64 : -0.5713 fnv1a 64 : -0.4582 fxhash : -2.0874 fxhash : -0.5487 fxhash32 : -0.9560 fxhash32 : 3.9994 fxhash64 : -2.0874 fxhash64 : -0.5487 lookup3 : 1.1483 lookup3 : 1.8272 loselose : -0.1188 loselose : -3.4677 null : 11214.0064 null : 11214.0064 OAAT : -1.0918 OAAT : 0.7863 Pass : -1.4312 Pass : 94.6222 sdbm : -0.5940 sdbm : 0.1754 spooky : -0.4356 spooky : 0.6053 Generated identifiers Dictionary words bricolage: 8.7964 bricolage: 2.4863 default : -1.1823 default : 2.0961 djb2 : 89.5989 djb2 : -3.5788 fnv1a 32 : -8.2873 fnv1a 32 : 4.6737 fnv1a 64 : -5.7078 fnv1a 64 : 0.1587 fxhash : -5.2552 fxhash : 1473.2659 fxhash32 : -5.8888 fxhash32 : 44.0437 fxhash64 : -5.2552 fxhash64 : 1473.2659 lookup3 : -0.3224 lookup3 : 1.2984 loselose : 525.4030 loselose : 26.1546 null : 11214.0064 null : 1113214.9110 OAAT : -1.4991 OAAT : -1.0941 Pass : 1031.6688 Pass : 278857.6686 sdbm : 214.3212 sdbm : 42.9022 spooky : -1.0465 spooky : -1.2023
What it is possible to conclude here is that null is horrible, pass isn’t much better, lose-lose is bad, and the older, simpler suspects are not very good. One thing that concerns me about the test is the performance of fx.
Kolmogorov-Smirnov test
The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. – Kolmogorov–Smirnov test.
This one I have a bit more confidence in. It hashes the same samples as the chi2 program, then attempts to determine how far from uniformly distributed the 64-bit hash values are, reporting values between 0.0 and 1.0. Lower values are better. 32-bit hashes like DJB2 trivially fail this test, though, although they may be fine for HashMaps with much less than 232 entries.
Uniform distribution Alphanumeric distribution bricolage 0.0286 bricolage 0.0162 default 0.0336 default 0.0114 djb2 0.9990 djb2 0.9999 fnv1a 64 0.0227 fnv1a 64 0.0066 fxhash 0.0190 fxhash 0.0083 fxhash32 0.9990 fxhash32 0.9999 fxhash64 0.0190 fxhash64 0.0083 lookup3 0.0137 lookup3 0.0122 loselose 0.9990 loselose 0.9999 null 0.9990 null 0.9999 oaat 0.0314 oaat 0.0071 passthru 0.9990 passthru 0.9999 sdbm 0.9990 sdbm 0.9999 spooky 0.0323 spooky 0.0115 Generated identifiers Dictionary words bricolage 0.0059 bricolage 0.0025 default 0.0053 default 0.0029 djb2 0.9999 djb2 1.0000 fnv1a 64 0.1072 fnv1a 64 0.0050 fxhash 0.0090 fxhash 0.0022 fxhash32 0.9999 fxhash32 1.0000 fxhash64 0.0090 fxhash64 0.0022 lookup3 0.0083 lookup3 0.0034 loselose 0.9999 loselose 1.0000 null 0.9999 null 1.0000 oaat 0.3990 oaat 0.0138 passthru 0.9984 passthru 0.5251 sdbm 0.9999 sdbm 1.0000 spooky 0.0082 spooky 0.0034
The results of this test are pretty much to be expected: modern, complicated hash functions have better properties.
Benchmarks
The other requirement for a hash function is speed: the hash function is a significant chunk of the overhead of an O(1) hash table. There are a number of different uses cases, as demonstrated by the samples above, used by the statistical tests. However, they break down into two broad classes: computing the hash of a relatively small object, and computing the hash of a large block of data.
Hash function benchmarks
Rust supplies a very nice benchmarking framework, and the project has a number of largely unhelpful benchmarks for each hash function. Two are to hash 1000 dictionary words and to hash the entire dictionary.
1000 Words File bricolage 79,908 ns/iter (+/- 3,959) bricolage 9,500,089 ns/iter (+/- 266,247) default 9,481 ns/iter (+/- 615) default 361,301 ns/iter (+/- 101,824) djb2 10,088 ns/iter (+/- 354) djb2 1,237,351 ns/iter (+/- 29,693) fnv1a64 11,435 ns/iter (+/- 701) fnv1a32 1,237,659 ns/iter (+/- 26,776) fxhash 5,009 ns/iter (+/- 390) fnv1a64 1,239,444 ns/iter (+/- 23,934) lookup3 11,313 ns/iter (+/- 481) fxhash 256,959 ns/iter (+/- 9,187) loselose 6,042 ns/iter (+/- 228) lookup3 553,407 ns/iter (+/- 13,112) oaat 17,661 ns/iter (+/- 605) loselose 427,386 ns/iter (+/- 13,388) passthrough 5,867 ns/iter (+/- 275) oaat 2,178,518 ns/iter (+/- 81,012) sdbm 10,746 ns/iter (+/- 321) passthrough 616,022 ns/iter (+/- 36,241) spooky 26,120 ns/iter (+/- 1,416) sdbm 1,240,484 ns/iter (+/- 31,541) spooky 193,378 ns/iter (+/- 4,973)
And here we find the major downside of bricolage, which appears to have very nice properties and is not very complex: It is very slow. This is likely due to processing one byte of the input at a time. The faster functions are either much simpler or are very heavily engineered for speed.
Hashmap benchmarks
This is where the purported rubber meets the alleged road: actual HashMap usage. Like all such simple tests, it is limited by the use: the hashes are of dictionary-word sized chunks.
This program finds the number of words that can be made from the letters “asdwtribnowplfglewhqagnbe”, based on the anagrams dictionary in data/anadict.txt. (There are 7440 of them, out of about 33 million HashMap lookups.) It uses implementations of HashMap and HashSet parameterized by Hashers, and reports the time taken by each hasher as well as a comparison with DefaultHasher.
For more information, check out my ancient series of blog posts:
- Creating a Letterpress cheating program in Rust
- Letterpress cheating in Rust 0.9
- Letterpress cheating in Rust 1.6: How long has it been?!?
- And others.
The results are interesting:
default 2.475s bricolage 8.498s (343.4%) djb2 2.569s (103.8%) fnv-1a 32 2.195s (88.7%) fnv-1a 64 2.296s (92.8%) fxhash 2.647s (107%) fxhash32 1.376s (55.6%) fxhash64 2.567s (103.7%) lookup3 2.354s (95.1%) oaat 2.952s (119.3%) sdbm 2.319s (93.7%) spooky 4.094s (165.4%)
The interesting result is from fxhash32: at 1.4 seconds, it is much faster than the default hash function and almost as fast as the simple, but rather obfuscated, C code I wrote for the blog posts. That program consistently runs in 1.0 second; at only 40% slower, the 32-bit FX hash version is better in almost every respect. (I think the times could be improved further for this program by improving the program’s IO behavior, as the C version and other Rust versions already do.)
Using different hash functions
How do you use these functions in your own programs?
Using a custom hash function with Rust’s HashMap or HashSet has long been regarded as a deep mystery. Now, I will strip back the curtains of ignorance and reveal the secrets in all their unholy glory!
Or something like that. It’s not really a big deal.
There are two ways to create a HashMap that uses a custom Hasher implementation: setting the hasher on a hash-map instance, and type-level hackery.
Explicitly telling a HashMap what Hasher to use
Everytime a value needs to be hashed, when inserting or querying the HashMap for example, a new Hasher instance is created. (Remember that the only methods in the Hasher trait update its state or return the final value.)
As a result, instead of passing in a Hasher, we have to pass an instance of another trait, std::hash::BuildHash
. Rust’s standard library currently has two implementations of that trait:
std::collections::hash_map::RandomState
, which creates instances of DefaultHasher, Rust’s implementation of SIP-something using cryptographic keys to prevent denial-of-service attacks.std::hash::BuildHasherDefault
, which can create instances of any Hasher implementation that also implements theDefault
trait.
All of the Hashers in this collection also implement Default.
use std::collections::HashMap;
use std::hash::BuildHasherDefault;
use hashers::fx_hash::FxHasher;
// BuildHasherDefault also implements Default---it's not really interesting.
let mut map =
HashMap::with_hasher( BuildHasherDefault::<FxHasher>::default() );
map.insert(1, 2);
assert_eq!(map.get(&1), Some(&2));
Using types to specify what Hasher to use
As an alternative, HashMap has three type-level parameters: the type of keys, the type of values, and a type implementing std::hash::BuildHash
. By default, the latter is RandomState
, which securely creates DefaultHashers
. By replacing RandomState, the Hashers used by the map can be determined by the HashMap’s concrete type. std::hash::BuildHasherDefault
is useful here, as well.
use std::collections::HashMap;
use std::hash::BuildHasherDefault;
use hashers::fnv::FNV1aHasher64;
// This could be more complicated.
fn gimmie_a_map() -> HashMap<i32,i32,BuildHasherDefault<FNV1aHasher64>> {
HashMap::default()
}
let mut map = gimmie_a_map();
map.insert(1,2);
assert_eq!(map.get(&1), Some(&2));
A more complicated example is the anagrams-hashmap.rs example program included with this module.