Rust Iterators

Posted on February 21, 2013 by Tommy McGuire

Labels: notation, rust, programming language

UPDATE 31 JULY 2013: THIS POST IS OBSOLETE. As of 0.7, Rust is moving to external iterators, rather than the "internal iterators" described in this post. Welcome to Rust language development. And if you're wondering why I haven't updated this post, I am still working on a version of each_combination (and each_permutation, in the vec module; I cannot seem to find a way to do it that both implements the iterator trait and does not allocate a new vector for each combination or permutation.

In my last post, I described my first Rust program, a simple utility to build an anagram dictionary based on part of Jeff Knupp's Python Letterpress cheating program. This time, I would like to take a closer look at a slightly unusual part of Rust by presenting an implementation of part of Python's itertools module that is needed by the remainder of the anagram-esqe program.

In order to find all the words that can be built from a set of characters, Jeff uses combinations, a function which "[returns] r length subsequences of elements from the input iterable. Combinations are emitted in lexicographic sort order. So, if the input iterable is sorted, the combination tuples will be produced in sorted order." According to the documentation, combinations('ABCD', 2) produces "AB", "AC", "AD", "BC", "BD", "CD". The Rust standard library doesn't include anything like this function.

What Rust does have is a style of iterators based on higher-order functions. For example, to iterate across all the elements of a vector, the function vec::each takes a vector and a function to be applied to each element: each(v, |elt| { frob(elt) }) (or alternatively, the "method" version is an element of a typeclass with an implementation for vectors, allowing it to be called as v.each(|elt| { frob(elt) })). Now, the way this function works is that the function argument should return true or false after processing each element; if it returns false, the iteration terminates. In other words, the frob function should return a boolean in those examples.

Rust provides a couple of special syntactic forms that make these higher-order functions easier to use: do and for expressions. The syntax for the for expression is:

"for" expr [ '|' ident_list '|' ] ? '{' block '}' ;

The do expression differs only in the keyword. do is the simpler of the two; to illustrate, the Rust manual describes the following two expressions as identical:


f(|j| g(j));

do f |j| {
    g(j);
}

The higher-order function is f, j is the parameter to the anonymous function, which in turn calls g with j. One feature of this syntactic form is that a return expression in the body of the anonymous function returns from the function enclosing the do expression, not the inner anonymous function. Likewise, break and loop are handled specially, allowing the expression to behave more like a control structure. (The latter is similar to continue in many other languages, in this use.)

The for expression is similar, but more specialized. In addition to the specialized handling of return, the anonymous function forming the body of the for expression implicitly returns true, continuing the iteration. The Rust tutorial provides this example:


for each([2, 4, 8, 5, 16]) |n| {
    if *n % 2 != 0 {
        println("found odd number!");
        break;
    }
}

The signature of each is fn each<'r,T>(v: &'r [T], f: &fn(&'r T) -> bool); v is the vector argument and f is the function. (For the moment, ignore the "&'r" bit.)

How does this interact with combinations? How about an each-like function that takes arguments describing what and how many to combine, and a function to apply to each combination as it is generated? Without too much introduction, this is what I came up with:


pub pure fn each_combination<T:Copy>(values : &[T],
                                     r : uint,
                                     fun : &fn(combo : &[T]) -> bool) {
    let length = values.len();
    if r == 0 || r > length { return; }
    let max_indices0 = length - r;
    let mut indices = vec::from_fn(r, |i| i);
    let mut combination = vec::from_fn(r, |i| values[i]);
    loop {
        if !fun(combination) { return; }
        let mut i = r - 1;
        indices[i] += 1;
        while i > 0 && indices[i] > max_indices0 + i {
            i -= 1;
            indices[i] += 1;
        }
        if indices[0] > max_indices0 { break; }
        combination[i] = values[indices[i]];
        for uint::range(i + 1, r) |i| {
            indices[i] = indices[i-1] + 1;
            combination[i] = values[indices[i]];
        }
    }
}

This function first tests whether any iterations are needed, then creates a pair of vectors to handle those needed. indices is a vector of integers, the size of the combinations that are needed, containing the indices into the values argument for the current combination. The combination vector contains the r elements from values (specifically, copies of them) for the current combination. Specifially, the loop tries to maintain combination[i] = values[indices[i], for i in 0 to r-1. The contents of the indices vector are incremented lexicographically, with the first, while, inner loop locating the right-most index to be incremented and the second, for loop, fixing up the subsequent indices as well as the combination vector. The overall loop terminates when the left-most index, indices[0] is too large to allow the remainder of indices to fit into values. Or when fun returns false after being invoked with a combination.

There are a couple of minor points to notice about this function. First, it is marked pure, indicating that it only modifies data owned by its own stack frame; meaning that aside from internal manipulations, it is (ideally) referentially transparent. Second, notice that indices, combination, and the helper i are declared mut, meaning that these variables are mutable; otherwise variables in Rust are immutable.

Third, the type parameter of each_combination, T. Because elements from values are copied into combination, T must be copyable, indicated by the Copy typeclass. Requiring that, in general, seems sub-optimal, so I created a second, much shorter, function that uses each_combination but does not require T to be copyable.


pub pure fn each_combination_ref<'v,T>(values : &'v [T],
                                    r : uint,
                                    fun : &fn(combo : &[&'v T]) -> bool) {
    each_combination(vec::from_fn(values.len(), |i| &values[i]), r, fun);
}

This function creates a vector containing borrowed pointers to the elements of the original values, then calls each_combination on it. Since the borrowed pointers are being copied, not the contents of the vector, T no longer needs to be copyable and as a side effect (if T is large enough, say) this function may be more efficient. The function parameter must be different, of course, in that it acts on a borrowed pointer to a vector containing borrowed pointers to the value elements. That is where Rust's type system gets interesting.

Check out the type of values and the type of combo, the parameter to the function fun. The type of values is &'v [T], indicating that it is a borrowed pointer to a vector containing T's, with a lifetime variable of v. The type of combo is &[&'v T], which is a borrowed pointer to a vector containing borrowed pointers to T's, here the inner borrowed pointers have the same lifetime variable, v. [Note: the syntax for lifetime variables will be changing in Rust 0.6.]

Those types are telling the Rust compiler that the contents of combo can be used anywhere that doesn't live any longer than values. Equivalent C code could easily leak the pointers into values, leaving them dangling if that vector were freed.

Niko Matsakis described the for loop protocol in detail, along with a significant problem and proposed solution: Revised for loop protocol.

Update 4/5/2013: Updated to use rust 0.6 lifetime syntax. "&v/T" became "&'v T". Also, the library modules have the lifetime, "'v", as the first of the type parameters as shown above, although that is not mentioned in the manual or RELEASES.txt.