Letterpress cheating in Rust 0.11.0, part 1

Posted on July 19, 2014 by Tommy McGuire

Labels: rust, toy problems, programming language

It's been a while. I got busy with work and other things and have not revisited my collection of Rust toys recently. In fact, I skipped Rust 0.10 entirely. (It gets worse: I have about 2/3 of a Letterpress cheating program written in the dependently-typed ATS 1 language that I have not touched in months. I feel bad.)

But I am turning over a new leaf. Or flake of rust, as the case may be. I am going to revisit the programs for Rust 0.11 before 0.12 comes out. I hope.

In any case, instead of simply presenting the differences between the previous and current versions, I am going to show the entire, updated programs. The changes have been significant, with the replacement of '~' and the symbolic vec type by owned boxes and vectors in the standard library, and with a serious rearrangement of the standard library itself.

To remind you, this started out as a port of Jeff Knupp's Creating and Optimizing a Letterpress Cheating Program in Python as a way of learning Rust, back in the 0.5 days. Since then, the project has gotten out of hand, since I am also interested in performance, in exploring more esoteric parts of Rust, and have used it as a way to explore other languages. The release of Rust 0.9 resulted in the most recent update, back in January.

Creating the anagram dictionary

The first required step is to create an anagram dictionary, a file which maps a sorted collection of letters to the list of words that can be made from those letters. The format of the dictionary file is one mapping per line, where each line begins with the letter collection, separated by a space from the list of space-separated words. The following program, presented in its entirety for your viewing pleasure, implements that task.

In main, the program reads a list of words, one per line, from the ever-useful /usr/share/dict/words file. Then it creates a new file for the dictionary and writes it out.

Rust supports parametric polymorphism, along with a typeclass/interface feature called "traits". In the phrase, read_dict<R:Reader>, R is a type variable acting as a compile-time, type-level parameter to the read_dict function; and Reader is a trait which acts as a constraint on R. R is in turn used in the phrase, reader : &mut BufferedReader<R>, which indicates that the reader parameter should be a borrowed reference to a BufferedReader, a structure providing the ability to read buffered, line-oriented input from the bare Reader input stream.

A HashMap is a structure implementing a hash map. In this instance the type parameters of the HashMap are the key, String, and the value, a Vec<String>. A String is a structure with an owned, dynamically-allocated, box containing a UTF-8 character string; String replaces the previous versions' ~str. The String is therefore resizable. A Vec<T> is a similar structure containing an owned vector of T's; in this case Strings. Vec<T> replaces the previous ~[T].

read_dict reads each word from the input, checks that it matches the requirements of the original Letterpress cheating program, creates a sorted vector of its characters, and inserts the word into the dictionary, adding it to an existing vector value if possible and adding a new vector if not.

print_dict performs the inverse, output operation, by getting the list of keys from the dictionary, sorting them, and then iterating over them to print the key and the associated list of words on each line.

An alternative: Rust traits

Rust presents an interesting alternative to the bare-bones, imperative style of mk_anadict. Rust supports traits, which are somewhat similar to Haskell's typeclasses and vaguely related to Java's interfaces. A trait is a definition of a set of function (or method) declarations and definitions implemented by a type. Traits interact closely with Rust's type system and its support for generic code and type variables, by providing bounds on the types that can appear in the given positions. For more information, check out the Rust Tutorial and Reference Manual.

Traits also allow an existing type to be extended with new behavior. Unlike most object-oriented languages, the definitions of the traits and their methods that are available on a type are separate from the definition of the type itself. As a result, it is possible to define a new trait and provide an implementation of it for a separate, previously existing type. As a result, existing Rust types can be extended with the behaviors needed to create the anagram dictionary.

A dictionary with ordered keys

The Rust HashMap can return the keys contained in the map, but cannot do so in, say, alphabetical order. SortedKeys is a trait intended to help.

SortedKeys has a type parameter, K, with the constraint that K be ordered. It describes a method, sorted_keys, which can be called on the object for which the trait is implemented and returns a (sorted) list of K's. An implementation is provided for Rust's HashMap type, which is parameterized over the types of keys and values; the HashMap requires the other constraints on K. The implementation of sorted_keys uses the HashMap's methods and other library functions to retrieve the keys and sort them.

A writer for anagram dictionaries

Another capability of traits is to extend other traits. DictWriter extends the standard library's Writer trait to provide a method that writes the anagram dictionary to an output stream.

The DictWriter trait is not parameterized and simply uses the other methods of the extended trait for its implementation. As a result, the implementations for File and IoResult<File> are trivial.

IoResult is the current method for indicating I/O errors or passing results along. It would be better for the method to return an IOResult than to call fail!.

The main program

The main function of the program is uninteresting. In fact, the only virtue it has is to demonstrate that using the new traits on existing types requires only making the new traits and their implementations visible.

Iterating over combinations of letters

The code to run through all of the possible combinations of a given length from a larger array of values is given by each_combination. This function is fundamentally stolen from Python's itertools module.

values are source array of values, r is the length of the subsequences to look at, and fun is a function accepting a reference to a subsequence and doing something useful with it.

This is an internal iterator, which differs from the external iterators supporting Rust's for loop. While external iterators have a number of practical benefits over internal iterators, in this case an external iterator would be significantly slower than the design presented here.

The problem is that the iterator would be iterating over a synthetic collection, the collection of subsequences. Since there is nothing handy to hang the lifetime on, the subsequences returned by the external iterator would need to be newly allocated to preserve memory safety, and given the processing of each subsequence below, the allocation dominates the cost of each iteration.

(If you want to play with the effects, compare the permutation iterator in the Rust standard library for vectors with the internal permutation function found in combinations.rs. The Rust library version will return permutations in a different order to minimize the number of internal copies, and so should be faster.)

The implementation of each_combination has not changed greatly. The only interesting feature is the use of get() and get_mut(), "*indices.get(0)" and "*indices.get_mut(i)", to access vector elements. That used to use square brackets, but I believe there are plans to introduce traits for those operators, allowing that notation to be reintroduced.

I am not at all sure why "*indices.get(0) > max_indices0" requires the dereference but the right-hand-side of "*indices.get_mut(i) = indices.get(i-1) + 1" does not.

The program

Loading the anagram dictionary

Reading the anagram dictionary is largely the inverse of the corresponding operation when creating the dictionary. This function is probably more verbose that it could be.

One interesting feature is the function used to split an incoming line into separate strings. This uses Rust's iterators to good effect, mapping over the strings to get newly allocated words and collecting them into a vector.

At one time, to_string was to_owned and was supposed to be a general method for references to create new, distinct objects.

Arranging the letter-set

The letter-set provided as an argument must be sorted in order to find all of the possible anagrams. Failing to do this for one implementation cost me an inordinate amount of debugging time.

Searching

Searching for possible words requires a nested iteration: the outer loop covers the possible lengths of the subsequences of the letter-set, and the inner loop uses each_combination to cover each possible subsequence of a given length. The body of the inner loop simply looks for the subsequence in the anagram dictionary (which almost always fails) and adds the list of matching words, if any, to a resulting set.

Pay no attention to the commented-out timing lines.

Putting the pieces together

Like the original Letterpress cheating code, this program prints the number of found words, as a sanity check.

Results

For comparison, here are the timings for this program since Rust 0.6 (yeah, missing 0.10). These timings are all made with the same optimization level, same input letter-set, and same number of results found.

Language	Program	Rust 0.6 Duration (seconds)	Rust 0.7 Duration (seconds)	Rust 0.8 Duration (seconds)	Rust 0.9 Duration (seconds)	Rust 0.11 Duration (seconds)
Rust	anagrams-hashmap	6.0	35.5	7.2	7.0	6.6

I am working on updating all of the implementations for comparison. You can find the most recent version of the code on Github. Thanks to gist-it for the ability to insert parts of the Github source into this post.