Creating a Letterpress cheating program in Rust

Posted on February 25, 2013 by Tommy McGuire
Labels: rust, c, python, programming language

Update: Creating a Letterpress cheating program in Rust, v2.

As I believe I have mentioned, I have been casually learning Rust (currently, Rust 0.5) by working on a port of Jeff Knupp's Creating and Optimizing a Letterpress Cheating Program in Python. It's now time to describe my current progress.

For those of you with short attention spans, this is the punchline:

ProgramDuration (sec)Speedup
pesser_one.py49.71
anagrams-hashmap.rs24.02
anagrams-vectors.rs17.63
presser_two.py12.54
presser_three.py6.08
anagrams-vectors.c5.69
anagrams-hash.c0.955

The argument for all of these runs was "asdwtribnowplfglewhqagnbe", which produces 7440 results from my dictionary; I get 33,554,406 combinations. I used Python 2.7.3; gcc 4.6.3, with -O3; and rustc 0.5, with -O. (I missed "--opt-level 3" the first time around and did get some improvement when I used it. But that's not shown here.)

I started by getting all three of the Python versions from Knupp working, to have something to compare against. (I would like to point out that presser_one.py is not something I would consider idiomatic Python, but I do need something to make the rest look better.) Then, I wrote the anagrams-hashmap Rust version, the meat of which looks like:


fn main() {
let args = os::args();
if args.len() < 2 { fail ~"Usage: anagrams letters"; }
let mut letters = str::chars(args[1]);
std::sort::quick_sort(letters, |a,b| *a <= *b);
let dictionary = load_dictionary();
let mut set : Set<@~str> = HashMap();
for uint::range(2,letters.len()) |i| {
for combinations::each_combination_ref(letters,i) |combo| {
let ana = @vec::from_fn(combo.len(), |i| *combo[i]);
match dictionary.find(ana) {
Some(ref val) => {
for val.each() |word| { set_add(set, *word); }
}
None => { }
}
}
}
let result = vec_from_set(set);
println(fmt!("%u", result.len()));
}

This code uses the each_combination function I described last time. Unfortunately, the hashmap interface (and implementation) are a bit of a mess in 0.5, leading to some unpleasant code and probably poor performance. Specifically, I'm looking at the line "@vec::from_fn(combo.len(), |i| *combo[i])", which copies the contents of the combination vector into a separate vector, since the call to "find(ana)" is going to take ownership.

Unhappy with the performance of that, I wrote the anagrams-vectors Rust version, which avoids the use of the hashmap for the dictionary, replacing it with a binary search similar to presser_one.py. In fact, it uses a conversion of the same bisect_left function. The fact that anagrams-vectors shows improvement should tell you there's something fishy in the state of Denmark (er, hashmap). (Unfortunately, the Set used to contain the resulting words is a thin wrapper on the hashmap, so I did not completely remove that module.)

Following several attempts to improve the Rust results without much success, I decided to dust off my C and go old-school, medieval in fact, on this sucker. I first came up with the vector-based C version, which mmaps the anagram dictionary file and builds its dictionary structure based on that file's sorted lines. Given that attempt's not-completely-satisfactory success, I added the hash version, which builds a pseudo-hashtable over the mmap'ed dictionary. That version is about as far as I'm willing to go at the moment.

There are probably a few worthwhile notes on the C versions:

I expect the performance to improve as Rust work continues. In fact, the hashmap interface is much better in the version in the core library in incoming. When 0.6 lands (or sooner), I'll update this code and provide up-to-date results.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.