Saturday, July 19, 2014

Letterpress cheating in Rust 0.11.0, part 1

It's been a while. I got busy with work and other things and have not revisited my collection of Rust toys recently. In fact, I skipped Rust 0.10 entirely. (It gets worse: I have about 2/3 of a Letterpress cheating program written in the dependently-typed ATS 1 language that I have not touched in months. I feel bad.)

But I am turning over a new leaf. Or flake of rust, as the case may be. I am going to revisit the programs for Rust 0.11 before 0.12 comes out. I hope.

In any case, instead of simply presenting the differences between the previous and current versions, I am going to show the entire, updated programs. The changes have been significant, with the replacement of '~' and the symbolic vec type by owned boxes and vectors in the standard library, and with a serious rearrangement of the standard library itself.

To remind you, this started out as a port of Jeff Knupp's Creating and Optimizing a Letterpress Cheating Program in Python as a way of learning Rust, back in the 0.5 days. Since then, the project has gotten out of hand, since I am also interested in performance, in exploring more esoteric parts of Rust, and have used it as a way to explore other languages. The release of Rust 0.9 resulted in the most recent update, back in January.

Creating the anagram dictionary

The first required step is to create an anagram dictionary, a file which maps a sorted collection of letters to the list of words that can be made from those letters. The format of the dictionary file is one mapping per line, where each line begins with the letter collection, separated by a space from the list of space-separated words. The following program, presented in its entirety for your viewing pleasure, implements that task.

In main, the program reads a list of words, one per line, from the ever-useful /usr/share/dict/words file. Then it creates a new file for the dictionary and writes it out.

Rust supports parametric polymorphism, along with a typeclass/interface feature called "traits". In the phrase, read_dict<R:Reader>, R is a type variable acting as a compile-time, type-level parameter to the read_dict function; and Reader is a trait which acts as a constraint on R. R is in turn used in the phrase, reader : &mut BufferedReader<R>, which indicates that the reader parameter should be a borrowed reference to a BufferedReader, a structure providing the ability to read buffered, line-oriented input from the bare Reader input stream.

A HashMap is a structure implementing a hash map. In this instance the type parameters of the HashMap are the key, String, and the value, a Vec<String>. A String is a structure with an owned, dynamically-allocated, box containing a UTF-8 character string; String replaces the previous versions' ~str. The String is therefore resizable. A Vec<T> is a similar structure containing an owned vector of T's; in this case Strings. Vec<T> replaces the previous ~[T].

read_dict reads each word from the input, checks that it matches the requirements of the original Letterpress cheating program, creates a sorted vector of its characters, and inserts the word into the dictionary, adding it to an existing vector value if possible and adding a new vector if not.

print_dict performs the inverse, output operation, by getting the list of keys from the dictionary, sorting them, and then iterating over them to print the key and the associated list of words on each line.

An alternative: Rust traits

Rust presents an interesting alternative to the bare-bones, imperative style of mk_anadict. Rust supports traits, which are somewhat similar to Haskell's typeclasses and vaguely related to Java's interfaces. A trait is a definition of a set of function (or method) declarations and definitions implemented by a type. Traits interact closely with Rust's type system and its support for generic code and type variables, by providing bounds on the types that can appear in the given positions. For more information, check out the Rust Tutorial and Reference Manual.

Traits also allow an existing type to be extended with new behavior. Unlike most object-oriented languages, the definitions of the traits and their methods that are available on a type are separate from the definition of the type itself. As a result, it is possible to define a new trait and provide an implementation of it for a separate, previously existing type. As a result, existing Rust types can be extended with the behaviors needed to create the anagram dictionary.

A dictionary with ordered keys

The Rust HashMap can return the keys contained in the map, but cannot do so in, say, alphabetical order. SortedKeys is a trait intended to help.

SortedKeys has a type parameter, K, with the constraint that K be ordered. It describes a method, sorted_keys, which can be called on the object for which the trait is implemented and returns a (sorted) list of K's. An implementation is provided for Rust's HashMap type, which is parameterized over the types of keys and values; the HashMap requires the other constraints on K. The implementation of sorted_keys uses the HashMap's methods and other library functions to retrieve the keys and sort them.

A writer for anagram dictionaries

Another capability of traits is to extend other traits. DictWriter extends the standard library's Writer trait to provide a method that writes the anagram dictionary to an output stream.

The DictWriter trait is not parameterized and simply uses the other methods of the extended trait for its implementation. As a result, the implementations for File and IoResult<File> are trivial.

IoResult is the current method for indicating I/O errors or passing results along. It would be better for the method to return an IOResult than to call fail!.

The main program

The main function of the program is uninteresting. In fact, the only virtue it has is to demonstrate that using the new traits on existing types requires only making the new traits and their implementations visible.

Iterating over combinations of letters

The code to run through all of the possible combinations of a given length from a larger array of values is given by each_combination. This function is fundamentally stolen from Python's itertools module.

values are source array of values, r is the length of the subsequences to look at, and fun is a function accepting a reference to a subsequence and doing something useful with it.

This is an internal iterator, which differs from the external iterators supporting Rust's for loop. While external iterators have a number of practical benefits over internal iterators, in this case an external iterator would be significantly slower than the design presented here.

The problem is that the iterator would be iterating over a synthetic collection, the collection of subsequences. Since there is nothing handy to hang the lifetime on, the subsequences returned by the external iterator would need to be newly allocated to preserve memory safety, and given the processing of each subsequence below, the allocation dominates the cost of each iteration.

(If you want to play with the effects, compare the permutation iterator in the Rust standard library for vectors with the internal permutation function found in combinations.rs. The Rust library version will return permutations in a different order to minimize the number of internal copies, and so should be faster.)

The implementation of each_combination has not changed greatly. The only interesting feature is the use of get() and get_mut(), "*indices.get(0)" and "*indices.get_mut(i)", to access vector elements. That used to use square brackets, but I believe there are plans to introduce traits for those operators, allowing that notation to be reintroduced.

I am not at all sure why "*indices.get(0) > max_indices0" requires the dereference but the right-hand-side of "*indices.get_mut(i) = indices.get(i-1) + 1" does not.

The program

Loading the anagram dictionary

Reading the anagram dictionary is largely the inverse of the corresponding operation when creating the dictionary. This function is probably more verbose that it could be.

One interesting feature is the function used to split an incoming line into separate strings. This uses Rust's iterators to good effect, mapping over the strings to get newly allocated words and collecting them into a vector.

At one time, to_string was to_owned and was supposed to be a general method for references to create new, distinct objects.

Arranging the letter-set

The letter-set provided as an argument must be sorted in order to find all of the possible anagrams. Failing to do this for one implementation cost me an inordinate amount of debugging time.

Searching

Searching for possible words requires a nested iteration: the outer loop covers the possible lengths of the subsequences of the letter-set, and the inner loop uses each_combination to cover each possible subsequence of a given length. The body of the inner loop simply looks for the subsequence in the anagram dictionary (which almost always fails) and adds the list of matching words, if any, to a resulting set.

Pay no attention to the commented-out timing lines.

Putting the pieces together

Like the original Letterpress cheating code, this program prints the number of found words, as a sanity check.

Results

For comparison, here are the timings for this program since Rust 0.6 (yeah, missing 0.10). These timings are all made with the same optimization level, same input letter-set, and same number of results found.

Language Program Rust 0.6
Duration
(seconds)
Rust 0.7
Duration
(seconds)
Rust 0.8
Duration
(seconds)
Rust 0.9
Duration
(seconds)
Rust 0.11
Duration
(seconds)
Rust anagrams-hashmap 6.0 35.5 7.2 7.0 6.6

I am working on updating all of the implementations for comparison. You can find the most recent version of the code on Github. Thanks to gist-it for the ability to insert parts of the Github source into this post.

Thursday, June 26, 2014

Quote o' the day: On the use of ducks

Because if your self already had the answer, you wouldn't need the duck, would you?

hypster on rubber-duck debugging.

Saturday, June 7, 2014

If you have a problem and you think, I'll use sed...

If you have a problem and you think, "I'll use sed," you now have two problems. One is that you know too much about sed. The second is your original problem, because unless you really know too much about sed, it does not actually do anything.

Recently, I discovered the need to translate a Ruby on Rails module using Builder::XmlMarkup into a small collection of FreeMarker templates. (SOAP requests, to be specific. Isn't my life great?) Considering that the basic structure of the Ruby code was identical to the XML that I needed, I decided to fire up my old sed skills for a line-by-line translation. This is the script I came up with:

sed \
    -e 's,xm.\([^(]*\)(\(.*\)).*$,<\1>${(\2)!},'                   \
    -e 's/nvl(\([^,]*\), *.\([^)]*\)\x27)/\1.\2/'                  \
    -e '/ do *$/{s/xm.\([^ ]*\).*/\1/; h; s/\( *\)\(.*\)/\1<\2>/}' \
    -e '/end/{ g; s/\( *\)\(.*\)/\1<\/\2>/}'

If that is not sanity-blasting enough, I will now explain what it does. Hide the impressionable youngsters and small pets.

xm is the Builder::XmlMarkup object; a call such as xm.VIN(...) produces XML of the form <VIN>...</VIN>. Therefore, the first expression converts a line like xm.VIN(...) # argh! into <VIN>${(...)!}</VIN> complete with the FreeMarker expansion syntax. The first capturing group is the XML tag and the second is the argument text which becomes the content of the tag.

The contents of many of the tags are properties of objects. However, the object may not actually exist when the Ruby code is used; therefore it used a function nvl(X, 'Y') to evaluate X.Y safely even if X is nil. (I do not know what nvl stands for. "No Value L'here" maybe.) The second sed expression captures the two arguments to nvl and translates them to X.Y directly. In this case, the FreeMarker syntax ${(...)!} will expand to the right side of the ! operator if anything in the ellipsis is invalid and the right side of this is "", the same thing as produced by nvl. Because the second argument to nvl is a string enclosed by single quotes which I am also using in my sed script, I had to use the ASCII hex value 0x27 for the closing single quote in the pattern.

According to the Builder::XmlMarkup documentation, "Any method with a block will be treated as an XML markup tag with nested markup in the block." xm.VEH_INFO do\n...\nend is such a block, creating an XML tag with nested content. Unfortunately, the do was on one line and the end was on a subsequent one. Fortunately, sed has a "hold space" to temporarily store text to be used on a subsequent line. (Learned something new about sed, eh? Don't say I didn't warn you.) The third expression finds do lines and translates them into the bare tag; then uses the h sed command to place the bare tag in hold space and translates the bare tag still in the pattern space into an XML open tag, complete with initial whitespace. Thus xm.VEH_INFO do becomes <VEH_INFO>.

The fourth expression matches the end of the Ruby block and uses the g sed command to replace the end line with the hold space and translates that into an XML close tag, changing end into </VEH_INFO>. This only works for a single nested block, but since I was using this script from within Emacs to produce the templates while I was also noting what needed to be present in the environment for template expansion, working in chunks was eminently feasible.

End result: Success! Yay! Go me!

I'm going to go wash my brain out with gin and tonic now.

Monday, April 21, 2014

Quote o' the day: If that's in my future I may be in trouble

A fortune I recently received:

If that tells me anything about my future, I think I want to stay now.

Saturday, April 19, 2014

Best of mcguire: ruthless simplicity

A few days ago, the article Boring Systems Build Badass Businesses appeared on Hacker News. In one comment, idlewan wrote, "It's not about the end result, it's about using better tools to get to it," arguing the case for specialized tools, in this case some CSS compiler. In response, mcguire wrote:

How old is your CSS compiler?

After gaining a fair amount of experience with various technologies over the years (by which I mean, getting burned (especially by things that seem wonderful and then cease to exist for a variety of reasons or change their nature drastically)), that has become the single most important question I ask about a new technology. If it's been around for ten years, it'll probably still be around in ten years. If it hasn't then it may not, especially in a form recognizable as related to today's.

C and Java are, well, not horrible, but not good. However, C has been C for my entire career. And I'm fairly sure that if I write something in simple, plain Java, in five or ten years when the whatzit needs some significant work then that work won't start with a complete rewrite.

(or at least he would have, if he'd bothered to reread and perhaps edit his comment.)

Faced with a response of, "...as long as you are satisfied with the current feature set, the code isn't going to disappear...I'll still be able to use it 5 years down the road...", he responded:

I was merely using the CSS compiler as an example. For something like that, just put a copy in your source repository and forget about it—in the worst case, it'll be fine for a couple of generations of browsers, until the "best current practices" have gone beyond deprecation. Hopefully, you won't mind throwing away your proto-CSS source at that point.

In the mean time, I'm sitting here looking at a JRuby on Rails application using JRuby 1.1.4 and a suitably ancient version of Rails (2.2, maybe?). This application has been in production for roughly five years and has received essentially no major love during that period; it Just WorkedTM. (Certainly a monument to the skill of the original author.) However, the poor sod who was responsible for it and our one other Rails application, our one and only Rails developer, finally managed to move on to other things.

At this point, security issues and (more likely) the inability to slather on new features have percolated up the chain of command and it has been agreed that Something Must Be Done. Since the upgrade path seems to recapitulate the phylogeny of Rails, our options are to rewrite the applications in a modern Rails and kick the can down the street, hoping it'll work better this time, or rewrite it using our common tools around here (and because I've got something to do with it, in the simplest manner possible). (And against the usual "never rewrite anything, ever" meme—which isn't an option—that project is going reasonably well.)

As for C, yes, it's feature set has been more-or-less frozen, which is good. It's environment is not. I was reasonably happy with the first version of gcc I used; call it 1.37 or so. Do you think you could use gcc 1.37 today?

* * * * *

Oh, and on a New York Times article on the death of Gabriel García Márquez, mcguire commented:

In his novels and stories, storms rage for years, flowers drift from the skies, tyrants survive for centuries, priests levitate and corpses fail to decompose. And, more plausibly, lovers rekindle their passion after a half-century apart."

"Less plausibly"! "Less plausibly"!

Geeze. Someone needs to teach these goobs to write.

But perhaps he's been reading too much Myles na gCopaleen lately.