Letterpress cheating in Rust: Parallelism
Or, how to turn an 8 second runtime to a 27 second runtime by throwing a stack of CPU's at it.
The anagram programs I have been using to explore Rust are CPU bound. They take command line arguments, read a file, do some number of seconds (or minutes) of computation, and print out a number. In other words, in modern terms, it is begging for parallelization1.
After a little thought (very little, as will probably become obvious), there are two ways of doing it:
- Split the anagram dictionary into \(N\) non-overlapping segments and feed those to \(N\) tasks along with the input letters. (Rust supports lightweight threads running under multiple CPU's, so-called "M:N threading".) Each task then runs all of the 30,000,000 queries implied by my standard input letters against its fragment of the dictionary, computing its subset of the anagram words as a LinearMap. Then each task communicates its result map back to the main thread, which accumulates the results and prints the final answer.
- Have each of the \(N\) tasks get its own copy of the complete dictionary. The main task then sends each request computed from the input letters (30,000,000 queries, remember) to one of the tasks, which accumulates a subset of the anagram words and then communicates those back to the main task.
I'll take a look at the first option here using, inexplicably, the vector, binary search, version of the programs. Here's the main function of the first:
fn main() {
let width = 6;
let args = os::args();
if args.len() < 2 { fail!(~"Usage: anagrams letters"); }
let letters = get_letters(args[1]);
let (response_port,response_chan)
: (Port<~LinearSet<~str>>,Chan<~LinearSet<~str>>) = stream();
let response_chan = SharedChan(response_chan);
for load_dictionary(width).each |&kv_pair{keys : keys,
values : values}| {
let response_chan = response_chan.clone();
let letters = copy letters;
do spawn {
let set = search(letters, keys, values);
response_chan.send(set);
}
}
let mut set = ~LinearSet::new();
for uint::range(0,width) |_| {
let res = response_port.recv();
for res.each |&word| { set.insert(word); }
}
println(fmt!("%u", set.len()));
}
width is the number of sub-tasks to spawn and the next three lines set up the input query letters. The next three lines build up response_port and response_chan, used to communicate the sub-task's results to the main task.
The next block does the work of the program. The load_dictionary function, in this program, reads the dictionary file and splits it into width individual keys and values pairs of vectors. (I don't want to talk about the "kv_pair" structure. Let's just pretend it didn't happen.) The next step is to set up each sub-task's copy of the response_chan (via the call to clone) and a copy of the input letters. Then it calls spawn with an anonymous function; spawn executes the function in a separate task; that function here performs the 30,000,000 lookups and returns the resulting map.
The reason that letters has to be copied before calling spawn is that it is an owned value and the new anonymous function will take ownership of it in the new task; subsequent tasks wouldn't be able to get a copy of letters. Or, more correctly, the compiler will make whiney noises, since the type system is watching for things like that.
How do the sub-tasks report their results back to the main task? That is the purpose of response_port and response_chan. Rust tasks communicate by sending messages, with Chan's being the sending end and Port's being the receiving end of the message stream.
Creating that stream is handled by the second block of three lines, fundamentally by the call to core::comms::stream. The basic stream contains messages from exactly one task, destined for exactly one other task. In order to send results from multiple sub-tasks, the Chan end is wrapped in a SharedChan, which adds the clone method used further down. clone duplicates the SharedChan, allowing it to be used by multiple tasks to send messages to a single Port.
Sending the result is handled by response_chan.send(set), where set is a ~LinearSet<~str>; the important part is that set is an owned structure. Before the send, the sub-task owns it, and after the send, the main, receiving, task owns it. Receiving is handled by response_port.recv(). The main task uses the received result to accumulate its own, complete, set of results for all tasks and prints out the size of the set.
So, how well does this work?
Program | Duration (secs) | Speedup |
---|---|---|
anagrams-vectors-wide | 11.75 | 4 |
anagrams-hashmap-wide | 9.38 | 5 |
anagrams-vectors | 7.97 | 6 |
anagrams-hashmap | 5.98 | 8 |
The program we've been discussing is anagrams-vectors-wide; anagrams-hashmap-wide is a LinearMap-based version with identical structure. (Yeah, I don't get to name things.) anagrams-vectors and anagrams-hashmap are the serial versions. They're indeed faster; the speedup is a factor of the slowest version of the program, the first Python version. Interestingly, if you multiply the time taken by the serial version by 6, the number of parallel tasks, you get roughly the amount of user plus system CPU time taken by the parallel version. Anyway, the performance difference is almost certainly tasking overhead, although I have no idea what exactly is causing it at the moment. But more on that in the next post.
1. My $10 word of the day. I hope I spelled it right; it's not in my dictionary. It's also hard to pronounce.