Exploring Rust
Of the current crop of new programming languages, the one that I've been most interested in is Rust, a "safe, concurrent, practical language" that is intended for "frustrated C++ developers", with features intended to improve "systems" programming currently done in C and C++. Those features are mostly things that have been floating around the programming language community for decades, but have not been applied to a low-level language that also addresses issues like memory layout and garbage-collection avoidance.
The only way to learn a programming language is to write programs using it. Unfortunately, I have only limited natural creativity, so I need a little help in finding programs to write. In this case, the inspiration was Creating and Optimizing a Letterpress Cheating Program in Python, in which Jeff Knupp presents some Python to find all of the words which can be made from a set of letters. (No, it's not really anagrams, but it's close. Yes, I know my naming sucks.)
The first step in the program, and my first step in Rust, is a program to convert the Unix words file into an anagram dictionary. This is my version in Rust; I would greatly appreciate any comments or suggestions. I am currently using Rust 0.5, which is an important point; Rust is changing rapidly. Also, I am not making any claim that this is a particularly good Rust program.
fn main() {
match (file_reader(&Path("/usr/share/dict/words")),
file_writer(&Path("anadict-rust.txt"), [Create,Truncate])) {
(Ok(r), Ok(w)) => { print_dict(w, read_dict(r)); }
(Err(msg), _) => { fail msg; }
(_, Err(msg)) => { fail msg; }
}
}
Beginning at the end, the main function opens a reader on the words file and a writer on a dictionary file. Both of those operations can fail, so the return type for file_reader, for example, is Result<Reader, ~str>; a Reader object implements the interface to read the file while the str represents the error message. In fact, the error message is a ~str, an owned pointer to a dynamically allocated string; a string allocated on the "owned" heap which will be deleted when the (single) owning reference to it goes away. The two variants of the Result are the constructors Ok and Err. (If you are familiar with Haskell or any of the ML languages, Result is a specialization of an Either algebraic data type.)
This function first creates a pair of Results, then uses pattern matching to deconstruct the two results. If both are Ok, the program reads the words file, building a dictionary, and then writes the dictionary out. Otherwise, it fails with an error message. The patterns are tried in order, so if both fail then only the reader message will be produced.
The read_dict function produces a HashMap (on the owned heap) mapping strings to vectors of strings. The value is a list of words made from the same set of letters while the key is the sorted letters. There are some peculiarities here; let me go over the relatively normal processing before getting into them.
fn read_dict(reader : Reader) -> ~HashMap<@~str,@[@~str]> {
let map = ~HashMap(); // [1]
for reader.each_line |line| { // [2]
let line = line.trim();
let length = line.len();
// Original is using pre-strip() line for comparisons
if length >= 2 && length < 19 && line.all(|ch| (char::is_ascii(ch) && char::is_lowercase(ch))) {
let mut chars = str::chars(line); // [3]
std::sort::quick_sort(chars, |a,b| *a <= *b);
let key = str::from_chars(chars);
map.update(@key, @[@line], |old,nw| at_vec::append(old,nw));
}
}
return map;
}
The line marked [1] allocates the Hashmap on the owned heap. The type is inferred, based on the keys and values and the function return type.
The line [2] is a use of Rust's iterator pattern. The function each_line, used as a method on the reader, takes one argument, a function to which is passed the line just read. This function is supposed to return true or false, indicating whether or not the iteration is to continue. However, special syntax and semantics are provided by the for keyword; since each_line has no other arguments, the parentheses are optional, and the function is provided by an anonymous function immediately following each_line. "|line|" is the parameter list for the anonymous function, while the body follows in braces. Another feature of the for keyword is that the anonymous function automatically returns true.
The major processing of the anonymous function filters out words of less than one character or more than 18, or which are made up of non-lowercase ASCII. The key for the HashMap is made by breaking the words into a vector of characters, then sorting and re-forming a string. The value is a vector containing the word, if the map does not already have a matching key, or the combination of the existing list and the new word if it does.
The second argument to the quick_sort function, "|a,b| *a <= *b", is another anonymous function, used to compare characters while sorting. The "mut" keyword on line [3], by the way, marks the chars vector as mutable, allowing it to be sorted.
A tilde indicates a pointer to the owned heap, but an at-sign, @, indicates a pointer into the managed heap. Rust's managed heap is a garbage-collected area local to the one thread. With the current library, most functions such as str::chars and str::from_chars produce owned pointers, a vector of characters and a string, respectively. Also, the HashMap requires its keys and values to be copyable, which owned strings in particular are not. To avoid warnings and excess copying, I am using @~str, a managed pointer to an owned string.
Fortunately, writing the dictionary is simpler. Since it is important that the resulting file is in sorted order, print_dict first gets the sorted keys to the dictionary. For each one, it gets the values, joins them into a string separated by spaces, and writes the formatted line out.
fn print_dict(writer : Writer, dict : &HashMap<@~str,@[@~str]>) {
for sorted_keys(dict).each |key| {
let line = str::connect( dict.get(*key).map(|v| **v), " " );
writer.write_str( fmt!("%s %s\n", **key, line) );
}
}
The * operator in this code dereferences a pointer; the key in the anonymous function is a borrowed (the third type) pointer, pointing to the @~str HashMap key. As a result, "*key" (the argument to the map function, is a @~str, while the value of the anonymous method "|v| **v" is a ~str, a vector of which is the correct argument to str::connect.
Naturally, sorted_keys is not a function from the library.
fn sorted_keys<V:Copy>(dict : &HashMap<@~str,V>) -> ~[@~str] {
let mut keys = vec::with_capacity( dict.size() );
for dict.each_key |key| { keys.push(key); }
std::sort::quick_sort(keys, |a,b| *a <= *b);
return keys;
}
"<V:Copy>" is a type parameter to sorted_keys; V is the type variable, which must satisfy the Copy typeclass, which is in turn needed by the HashMap. This type signature simply indicates that this function does not care about the type of values of the map.
Pattern matching, algebraic data types, anonymous and higher-order functions, type inference, parametric polymorphism, type classes, a confusing variety of memory allocation strategies, a somewhat chaotic library.... Rust is entertaining, at least.
The code is available on Github.