Letterpress cheating in Rust: Parallelism part 2

Posted on April 16, 2013 by Tommy McGuire

Last time I wrote, "Or, how to turn an 8 second runtime to a 27 second runtime by throwing a stack of CPU's at it", but I did not actually describe that case. It turns out that the second option for parallelizing the anagram program is the way to do that. As a reminder, the second option is:

Have each of the \(N\) tasks get its own copy of the complete dictionary. The main task then sends each request computed from the input letters (30,000,000 queries, remember) to one of the tasks, which accumulates a subset of the anagram words and then communicates those back to the main task.

The code for this version is somewhat similar to that for option 1, so rather than presenting the whole main function I am just showing the key bits.

The first block creates the worker tasks and sets up streams to allow the master task to send requests.


    let mut request_chans : ~[Chan<~[~[int]]>] = ~[];
    for uint::range(0, width) |_| {
        let (request_port,request_chan) = stream();
        request_chans.push(request_chan);
        // Set up and start worker task
        let response_chan = response_chan.clone();
        do spawn {
            let (keys,values) = load_dictionary();
            response_chan.send( search(keys, values, &request_port) );
        }
    }

As a reminder, Rust tasks communicate only by sending and receiving messages; a Port is the sending side and a Chan is the receiving side. In this code, the main task creates an empty vector of Chan's and enters a loop where it

creates a stream,
stores the Chan for the main task's use,
duplicates the Chan used to report results from the worker task to the master, and
finally spawns a worker task.

The worker task loads a copy of the keys and values vectors (remember that the "vectors" version of the program uses binary search rather than a hash table to store and look-up matching letter sequences), and calls a search function to perform the useful work. When done, the results are returned to the master task via the response_chan.

The search function looks like this:


fn search(keys         : &[~[int]],
          values       : &[~[~str]],
          request_port : &Port<~[~[int]]>) -> ~LinearSet<~str> {
    let klen    = keys.len();
    let mut set = ~LinearSet::new();
    loop {
        let key_set = request_port.recv();
        if key_set.len() == 0 { break; }
        for key_set.each |key| {
            let j = bisect::bisect_left_ref(keys, key, 0, klen);
            if j < klen && keys[j] == *key {
                for values[j].each |&word| { set.insert(copy word); }
            }
        }
    }
    return set;
}

Notice that the type of request_port is a borrowed pointer to Port<~[~[int]]>: a Port on which the code can receive vectors of vectors of ints. The reason for the levels of vectors is simple: the key used for each look-up is a vector of ints, created from a string of letters. Unfortunately, sending each key individually (all 33,000,000 of them) would result in an enormous amount of communications overhead; this code receives a collection of keys (the badly named key_set) to look up at once. The master task signals that the worker should complete by sending an empty key_set.

The master task hands out requests to the workers with the following code:


    let mut t = 0;
    let mut key_set = ~[];
    for uint::range(2,letters.len() + 1) |i| {
        for combinations::each_combination(letters,i) |combo| {
            key_set.push( vec::from_slice(combo) );
            if key_set.len() >= depth {
                let mut ks = ~[];
                ks <-> key_set;
                request_chans[t].send(ks);
                t = (t + 1) % width;
            }
        }
    }
    if !key_set.is_empty() { request_chans[t].send(key_set); }
    for request_chans.each |chan| { chan.send(~[]) };

The variable t is used to identify which worker should receive a request; this code allocates the work round-robin style. It also accumulates depth keys into each request.

The dance in the innermost block is needed to avoid having key_set losing its value on each iteration. The send call transfers ownership of the message to the receiving task, so chan.send(key_set) would leave key_set without any meaningful value. This code instead creates a new variable, ks, containing an empty vector. ks and key_set then swap values, resetting key_set to the empty vector, and ks is used to send the request to the worker—ks is not used again, so there is no problem with it losing ownership while key_set is used again, on the next iteration.

The second-to-last line sends any remaining keys to the next worker thread, and the final line sends the empty request to inform the workers that no further requests will be made.

How well does this work?

Language	Program	Time (sec)	Speedup
Python	alternatives/presser_one.py	49.12	1
Rust	anagrams-vectors-tasks	27.13	2
Python	alternatives/presser_two.py	12.84	4
Rust	anagrams-vectors-wide	11.75	4
Rust	anagrams-hashmap-wide	9.38	5
C	alternatives/anagrams-vectors	8.05	6
Rust	anagrams-vectors	7.97	6
Rust	anagrams-hashmap	5.98	8
Python	alternatives/presser_three.py	5.94	8
C	alternatives/anagrams-hash	0.94	52

The program here, anagrams-vectors-tasks, takes about 27 seconds, making it the second-slowest of the current crop and much worse than the 8 seconds used by the sequential version of the same algorithm. Unlike the other parallel versions, anagrams-vectors-wide and anagrams-hashmap-wide, it also does not keep the CPU's busy; rather than seeing 100% utilization of five or six CPUs, I see 40 or 50%.

I believe Rust's send and receive implement a rendezvous: if a sender is not waiting with a message, the receiver blocks until a sender comes along; if a receiver is not waiting, the sender blocks. This is good in some cases, where it makes some optimizations possible. For example, it is possible to avoid invoking the thread scheduler if the sending thread can change its execution context and continue as the receiving thread. On the other hand, a rendezvous is bad here, because after the master task assigns a group of keys to the last worker task, it wraps around to send the next request to the first worker and must wait until the first worker task is finished with the previous group of keys, at which time that worker will call recv.