Saturday, August 9, 2014

Letterpress cheating in Rust 0.11.0, part 2

I have finally completed upgrading all of the assorted toy Rust programs, my ports of Jeff Knupp's Creating and Optimizing a Letterpress Cheating Program in Python, to 0.11. I also re-executed the notoriously crappy benchmarks.

These programs look for all of the words that can be made from a given set of letters, based on the system dictionary. The argument was "asdwtribnowplfglewhqagnbe", which produces 7440 results from my dictionary with a possible 33,554,406 combinations made from those letters.

Language Program Rust 0.6
Duration
(seconds)
Rust 0.7
Duration
(seconds)
Rust 0.8
Duration
(seconds)
Rust 0.9
Duration
(seconds)
Rust 0.11
Duration
(seconds)
Python alternatives/presser_one.py 49.1 48.6 47.8 39.0 59.4
Nimrod alternatives/nimrod_anagrams 12.3 18.0
Python alternatives/presser_two.py 12.8 12.6 12.3 11.6 17.2
Rust anagrams-hashmap-wide 9.3 15.4 12.1 19.6 15.7
Rust anagrams-vectors-wide 11.8 13.1 12.2 16.8 12.4
Rust anagrams-vectors 8.0 8.2 11.9 8.1 11.0
Rust anagrams-hashmap 6.0 35.5 7.2 7.0 9.3
C alternatives/anagrams-vectors 8.0 5.8 5.8 6.0 9.6
Python alternatives/presser_three.py 6.0 6.3 6.0 5.8 8.3
Rust anagrams-vectors-tasks 27.1 13.8 4.2 4.6 7.7
Rust anagrams-djbhash-tasks 6.2 5.5
Rust anagrams-hashmap-mmap 4.8 10.6 7.3 6.3 2.9
Rust anagrams-djbhashmap 2.8 2.5
C alternatives/anagrams-hash 0.9 1.0 1.0 0.9 1.4

The programming languages and versions for this run are:

  • Python: Python 2.7.6, with Python 2.7.3 and 2.7.5 for previous versions.
  • C: gcc 4.8.2, with 4.6.3 and 4.8.1 for the prior runs, all with -O3.
  • Nimrod: Nimrod 0.9.4 this time, 0.9.2 last, compiled with -d:release.
  • Rust: Rust 0.11.0, compiled with -O.

The various versions of the programs take slightly different approaches. Those with hashmap use a hashtable to store the anagram dictionary while those with vector use a sorted array and binary search to look up anagrams. Those with djbhash use an alternative hashtable implementation, based on the DJB hash algorithm and Python's dictionary implementation. The mmap version, as well as both of the C versions, import the dictionary via mmap rather than reading. All of the programs are single threaded, except for the wide and tasks versions. The wide versions split the dictionary into segments and have each thread search all of the possible combinations in its reduced dictionary. The tasks versions allow each task to have a copy of the full dictionary and the master process divides round-robins the combinations to the tasks. The parameters of each were tuned a while back and have not been adjusted.

Friday, August 8, 2014

I'm rather proud of an answer to Robert Harper's discussion of C typing that I wrote as a comment to a post, Six Points about Type Safety.

The post includes the footnote:

Dr. Robert Harper describes such a type safe analysis of C in a comment here

and while I largely agree with the six points, I disagree with Harper (and don't feel the need to carry the disagreement to wherever Harper posted his comment).

He says, in part, "For example, C is perfectly type safe. It’s semantics is a mapping from 2^64 64-bit words to 2^64 64-bit words. It should be perfectly possible to call rnd(), cast the result as a word pointer, write to it, and read it back to get the same value. Unix never implemented the C dynamics properly, so we get absurdities like 'Bus Error' that literally have no meaning whatsoever in terms of C code."

I don't believe this to be the case for two reasons, philosophical and definitional.

In the first place, if Unix, etc., "never implemented the C dynamics properly", we are very definitely into the discussion of the status of the concept of "unicorns", given that no actual unicorn exists. As a result, I feel perfectly free to assert anything without any real worry about contradiction---what is he going to do, declare me a heretic? Further, philosophically---my previously existing, unreasoned prejudices---I find his stance silly.

In the second, he is wrong about the definition of C. It's semantics, for any reasonable definition of the "semantics of C", are not a mapping from any number of any sized words to other words. The C standard, which is not formal but which is C in a real sense, pretty clearly says that the operation he describes is either undefined or implementation defined or otherwise similarly verboten. In which case, Unix does implement C dynamics properly. He and many others may not include SIGSEGV in their mental model of C's semantics, but that does not mean that he nor they are right.

Those are both significant problems, although the first is the worse. In what sense can one talk about the semantics of a language if no implementation of the language follows those semantics (even assuming those are the semantics of the definition of the language)? I do know that such is not a useful thing to do.

[Update] ...and the post on Hacker News for the Six Points article has been declared dead. Boy, do I love HN.

Wednesday, July 23, 2014

Quote o' the day: Ouchie!

From a discussion of the Hemingway app on Hacker News, jaxn writes,

Could someone create a gmail plugin to filter my outgoing emails through this. Maybe it would help to make my emails clearer and more persuasive.

To which fl0wenol replies,

No, because instead of discerning if you're a dullard by the quality of your writing at a glance, I'll instead have to stumble through mechanically de-voiced inanity. At least without Hemmingway your stream-of-consciousness as committed to typed form can be examined and admired as a unique reflection of your particular damage.

And suddenly I've gone from frothing at Hemingway, "utilize", and "gift" (as a verb, for no particular reason) to looking for some pain-relieving ointment

And it's nice to know that I'm not the only one who wants to apply an extra "m" to Papa H.

Saturday, July 19, 2014

Letterpress cheating in Rust 0.11.0, part 1

It's been a while. I got busy with work and other things and have not revisited my collection of Rust toys recently. In fact, I skipped Rust 0.10 entirely. (It gets worse: I have about 2/3 of a Letterpress cheating program written in the dependently-typed ATS 1 language that I have not touched in months. I feel bad.)

But I am turning over a new leaf. Or flake of rust, as the case may be. I am going to revisit the programs for Rust 0.11 before 0.12 comes out. I hope.

In any case, instead of simply presenting the differences between the previous and current versions, I am going to show the entire, updated programs. The changes have been significant, with the replacement of '~' and the symbolic vec type by owned boxes and vectors in the standard library, and with a serious rearrangement of the standard library itself.

To remind you, this started out as a port of Jeff Knupp's Creating and Optimizing a Letterpress Cheating Program in Python as a way of learning Rust, back in the 0.5 days. Since then, the project has gotten out of hand, since I am also interested in performance, in exploring more esoteric parts of Rust, and have used it as a way to explore other languages. The release of Rust 0.9 resulted in the most recent update, back in January.

Creating the anagram dictionary

The first required step is to create an anagram dictionary, a file which maps a sorted collection of letters to the list of words that can be made from those letters. The format of the dictionary file is one mapping per line, where each line begins with the letter collection, separated by a space from the list of space-separated words. The following program, presented in its entirety for your viewing pleasure, implements that task.

In main, the program reads a list of words, one per line, from the ever-useful /usr/share/dict/words file. Then it creates a new file for the dictionary and writes it out.

Rust supports parametric polymorphism, along with a typeclass/interface feature called "traits". In the phrase, read_dict<R:Reader>, R is a type variable acting as a compile-time, type-level parameter to the read_dict function; and Reader is a trait which acts as a constraint on R. R is in turn used in the phrase, reader : &mut BufferedReader<R>, which indicates that the reader parameter should be a borrowed reference to a BufferedReader, a structure providing the ability to read buffered, line-oriented input from the bare Reader input stream.

A HashMap is a structure implementing a hash map. In this instance the type parameters of the HashMap are the key, String, and the value, a Vec<String>. A String is a structure with an owned, dynamically-allocated, box containing a UTF-8 character string; String replaces the previous versions' ~str. The String is therefore resizable. A Vec<T> is a similar structure containing an owned vector of T's; in this case Strings. Vec<T> replaces the previous ~[T].

read_dict reads each word from the input, checks that it matches the requirements of the original Letterpress cheating program, creates a sorted vector of its characters, and inserts the word into the dictionary, adding it to an existing vector value if possible and adding a new vector if not.

print_dict performs the inverse, output operation, by getting the list of keys from the dictionary, sorting them, and then iterating over them to print the key and the associated list of words on each line.

An alternative: Rust traits

Rust presents an interesting alternative to the bare-bones, imperative style of mk_anadict. Rust supports traits, which are somewhat similar to Haskell's typeclasses and vaguely related to Java's interfaces. A trait is a definition of a set of function (or method) declarations and definitions implemented by a type. Traits interact closely with Rust's type system and its support for generic code and type variables, by providing bounds on the types that can appear in the given positions. For more information, check out the Rust Tutorial and Reference Manual.

Traits also allow an existing type to be extended with new behavior. Unlike most object-oriented languages, the definitions of the traits and their methods that are available on a type are separate from the definition of the type itself. As a result, it is possible to define a new trait and provide an implementation of it for a separate, previously existing type. As a result, existing Rust types can be extended with the behaviors needed to create the anagram dictionary.

A dictionary with ordered keys

The Rust HashMap can return the keys contained in the map, but cannot do so in, say, alphabetical order. SortedKeys is a trait intended to help.

SortedKeys has a type parameter, K, with the constraint that K be ordered. It describes a method, sorted_keys, which can be called on the object for which the trait is implemented and returns a (sorted) list of K's. An implementation is provided for Rust's HashMap type, which is parameterized over the types of keys and values; the HashMap requires the other constraints on K. The implementation of sorted_keys uses the HashMap's methods and other library functions to retrieve the keys and sort them.

A writer for anagram dictionaries

Another capability of traits is to extend other traits. DictWriter extends the standard library's Writer trait to provide a method that writes the anagram dictionary to an output stream.

The DictWriter trait is not parameterized and simply uses the other methods of the extended trait for its implementation. As a result, the implementations for File and IoResult<File> are trivial.

IoResult is the current method for indicating I/O errors or passing results along. It would be better for the method to return an IOResult than to call fail!.

The main program

The main function of the program is uninteresting. In fact, the only virtue it has is to demonstrate that using the new traits on existing types requires only making the new traits and their implementations visible.

Iterating over combinations of letters

The code to run through all of the possible combinations of a given length from a larger array of values is given by each_combination. This function is fundamentally stolen from Python's itertools module.

values are source array of values, r is the length of the subsequences to look at, and fun is a function accepting a reference to a subsequence and doing something useful with it.

This is an internal iterator, which differs from the external iterators supporting Rust's for loop. While external iterators have a number of practical benefits over internal iterators, in this case an external iterator would be significantly slower than the design presented here.

The problem is that the iterator would be iterating over a synthetic collection, the collection of subsequences. Since there is nothing handy to hang the lifetime on, the subsequences returned by the external iterator would need to be newly allocated to preserve memory safety, and given the processing of each subsequence below, the allocation dominates the cost of each iteration.

(If you want to play with the effects, compare the permutation iterator in the Rust standard library for vectors with the internal permutation function found in combinations.rs. The Rust library version will return permutations in a different order to minimize the number of internal copies, and so should be faster.)

The implementation of each_combination has not changed greatly. The only interesting feature is the use of get() and get_mut(), "*indices.get(0)" and "*indices.get_mut(i)", to access vector elements. That used to use square brackets, but I believe there are plans to introduce traits for those operators, allowing that notation to be reintroduced.

I am not at all sure why "*indices.get(0) > max_indices0" requires the dereference but the right-hand-side of "*indices.get_mut(i) = indices.get(i-1) + 1" does not.

The program

Loading the anagram dictionary

Reading the anagram dictionary is largely the inverse of the corresponding operation when creating the dictionary. This function is probably more verbose that it could be.

One interesting feature is the function used to split an incoming line into separate strings. This uses Rust's iterators to good effect, mapping over the strings to get newly allocated words and collecting them into a vector.

At one time, to_string was to_owned and was supposed to be a general method for references to create new, distinct objects.

Arranging the letter-set

The letter-set provided as an argument must be sorted in order to find all of the possible anagrams. Failing to do this for one implementation cost me an inordinate amount of debugging time.

Searching

Searching for possible words requires a nested iteration: the outer loop covers the possible lengths of the subsequences of the letter-set, and the inner loop uses each_combination to cover each possible subsequence of a given length. The body of the inner loop simply looks for the subsequence in the anagram dictionary (which almost always fails) and adds the list of matching words, if any, to a resulting set.

Pay no attention to the commented-out timing lines.

Putting the pieces together

Like the original Letterpress cheating code, this program prints the number of found words, as a sanity check.

Results

For comparison, here are the timings for this program since Rust 0.6 (yeah, missing 0.10). These timings are all made with the same optimization level, same input letter-set, and same number of results found.

Language Program Rust 0.6
Duration
(seconds)
Rust 0.7
Duration
(seconds)
Rust 0.8
Duration
(seconds)
Rust 0.9
Duration
(seconds)
Rust 0.11
Duration
(seconds)
Rust anagrams-hashmap 6.0 35.5 7.2 7.0 6.6

I am working on updating all of the implementations for comparison. You can find the most recent version of the code on Github. Thanks to gist-it for the ability to insert parts of the Github source into this post.

Thursday, June 26, 2014

Quote o' the day: On the use of ducks

Because if your self already had the answer, you wouldn't need the duck, would you?

hypster on rubber-duck debugging.