Reimplementing ashurbanipal.web in Rust

Posted on July 14, 2015 by Tommy McGuire

Labels: ashurbanipal, software development, rust

Ashurbanipal is a project that I have been working on recently. The idea is to build a text recommendation engine based on the contents of the text, not on reviews, other purchases, or any other external data. I have an initial prototype up and running, based on 22,000-ish books on the Project Gutenberg 2010 DVD, at dpg.crsr.net. To use the prototype, go to the page, find a book that you know you like (using the search field at the upper left), and get style-based, topic-based, and combined recommendations for books which are in some way similar to your choice.

The Ashurbanipal project itself is written in Java, because most of the natural-language processing tools I have seen are written in Java. Also, I have been mostly using the language for the last few years, so I have the active skill-set. As a result, when I went to write the front-end interface for dpg.crsr.net, I wrote the web services that supply the data as Java servlets (actually, JSR-311 Java rest-y things) and ran it under Tomcat. The problem I am having at the moment is that I'm cheap.

The host I'm using is the second-smallest droplet from Digital Ocean, which is twice as expensive as the smallest droplet. Unfortunately, the important difference between the droplets is that one has 512MB of memory and the other has 1GB. Now, the data that the server uses are themselves pretty big:

The metadata for each text is about 6MB,
the stylometric data used for style recommendations is 21MB, and
the topic data for topic recommendations is 26MB.

Those are all in one-record-per-line, tab-separated text files. The actual memory use once they have been parsed into the server's data structures are likely larger. That, plus the footprint of Java and the memory management overhead means I've given the Tomcat process a 512MB heap size. It works well. But I really would like to go down to the $5/month smaller droplet. Because I'm cheap.

So I figured, what the heck, all I really need is a HTTP server library; the request processing is pretty simple. What happens if I use a lower-level language to write the server, without a garbage collector? Rust came to mind pretty quickly. For one thing, the language itself has recently stabilized, so I'm not too concerned about having to rewrite my "production" (ahem) code. (On the other hand the standard library has not gotten to the same level of stability; in practice that means that some features in the library are not available in the production builds of the Rust compiler.)

Rust has an excellent community scurrying around writing modules for the language, including the hyper HTTP library and several higher-level, lightweight web application frameworks built on top of it. A few minutes testing revealed that Rustful was somewhat faster than what seems to be the other contenders, Iron and Nickel. (Maybe. I dunno. YMMV. Don't shoot the messenger. I based this on the hello-world apps for each. Really.)

The results are pretty decent. On my 64-bit laptop, the application server currently looks like it consumes around 150M of memory while running and under load and the latency and throughput seem higher based on my completely unscientific tests. I haven't yet swapped out the application server, since I'm still validating the results. (The recommendations return almost exactly the same data, modulo floating point differences, but the book lookup produces slightly different results for reasons I will explain later.)

The rest of this post will walk through some parts of the Rust code for your edification and amusement.

Recommendations

The basic interface for the recommendation code is pretty simple. I have a type alias for Project Gutenberg etext numbers and another for the floating point scores used for recommendations.


pub type Etext = usize;
pub type Score = f64;

pub trait Recommendation : Sync {
    /// Return a vector of (etext number, score) pairs if possible.
    /// The vector will be sorted by etext_number.
    fn scored_results(&self, etext_no : Etext) -> Option<Vec<(Etext,Score)>>;

    /// Return a vector of (etext number, score) pairs if possible,
    /// sorted by score.
    fn sorted_results(&self, etext_no : Etext) -> Option<Vec<(Etext,Score)>> {
        self.scored_results(etext_no).map( |mut results| {
            results.sort_by( |&(_,l),&(_,r)| l.partial_cmp(&r).unwrap() );
            results
        })
    }
}

The Recommendation interaface requires an implementation of one method, scored_results, which returns a list of etext-number, score pairs, sorted by etext-number. It provides an additional utility function sorting those results by score. (Because floating point numbers are not totally ordered (NaN is not a number), I needed to use the partial_cmp method from the partial ordering trait. But, since the scores are never NaN, I do not worry about the partial aspect and treat it as a total ordering.

(I am speaking about the line,


l.partial_cmp(&r).unwrap()

l and r are the scores from the pairs being compared, which are floating point numbers. They implement the method


fn partial_cmp(&self, other: &Rhs) -> Option;

from the trait PartialOrd. This method returns an Ordering (one of Less, Equal, or Greater) wrapped in an Option since the two values may not be comparable. The unwrap method discards the Option, returning the Ordering value if it is there and causing the process to fail if it is not. This isn't the best style, but I don't have NaNs in my data so I am going with it for now.)

Three data structures implement the Recommendation trait: Style, Topic, and Combination. Style uses a vector of floating-point values to represent the location of each text in "style space", and uses the Euclidian distance between two text to measure they stylistic similarity.


type Proportion = f64;

pub struct Style {
    data           : Vec<Vec<Proportion>>,
    etext_to_index : HashMap<Etext,usize>,
    index_to_etext : Vec<Etext>,
}

For speed in processing, the vectors of data are stored in a matrix, requiring two mappings to translate etext-numbers to row indices.

The method that reads the data into the Style structure illustrates several features about Rust:¹


    pub fn read<P : AsRef<Path>>(path: P) -> Style {
        let (etexts, vectors) : (Vec<Etext>,Vec<Vec<Score>>) =
            BufReader::new( File::open(path).unwrap() ).lines()
            .map( |line| {
                let line                 = line.unwrap();
                let mut elements         = line.split('\t');
                // The first element of each line is the etext number.
                let etext_no: Etext      = elements.next().unwrap().parse().unwrap();
                // The remaining elements are part-of-speech data for the etext.
                let etext_data: Vec = elements.map( |s| s.parse().unwrap() ).collect();
                (etext_no, etext_data)
            } ).unzip();

        Style {
            data           : vectors,
            // Create the mappings from vector index to etext number, and vice versa.
            etext_to_index : etexts.iter()
                // duplicate etext_nos
                .cloned()
                // associate each etext_no with a row number
                .enumerate()
                // flip the pair, to map each etext_no to a row number
                .map(|(x,y)| (y,x))
                // collect into hashmap
                .collect(),
            index_to_etext : etexts,
        }
    }

The first is that it is very type-happy. Not necessarily in having a lot of type annotations; it's type inference algorithm usually works pretty well. But rather in that it uses types for fanciness. For example, this method is polymorphic in it's argument: it will accept anything that can be converted into a file system path. One example of such is a string, but there are others, all of which are magically transformed appropriately when they are passed in.

The second is that Rust's standard library is very fond of iterators. The first half of this method reads each line in turn from the input file, splits the line on each tab character, grabs the first element of the line as its etext-number, and collects the rest of the line as the vector of floating-point stylometric data. The results for each line is an etext-number, vector pair; the iterator being used to read the lines is unzipped to convert the iterator of pairs into a pair of collections, the vector of etext numbers mapping row indices to etexts and the vector of vectors of data. The second half of the method builds the Style data structure containing the data, the index-to-etext mapping, and a map of etext-numbers to indices.

Did I mention that Rust is type-happy? The calls to unzip and collect are another brilliant example. They convert an iterator into a collection or pair of collections, but they know how to do that because the compiler knows what the types of the results are supposed to be. Any collection implementing the trait FromIterator can be used at the receiver of collect, including Vec and HashMaps (which require an iterator that yields (key, value) pairs).

Computing the scored results are another example of the utility of iterators, in this case over the collection of all etexts' stylometric data and across each etext's data.


    fn scored_results(&self, etext_no : Etext) -> Option<Vec<(Etext,Score)>> {

        let row = match self.etext_to_index.get(&etext_no) {
            None      => return None,
            Some(idx) => &self.data[*idx],
        };

        let x = self.data.iter()
            // Compute the distance from row to v.
            .map( |v| distance(v,row) )
            // Associated each distance with its index.
            .enumerate()
            // Replace the index with the etext number.
            .map( |(i,d)| (self.index_to_etext[i], d) )
            // Create the result vector.
            .collect();

        Some(x)
    }

fn distance(v1 : &Vec<Proportion>, v2 : &Vec<Proportion>) -> Score {
    assert_eq!(v1.len(), v2.len());
    let sq = v1.iter()
        // Match each element with that from the other vector.
        .zip( v2.iter() )
        // Compute (elt1 - elt2)^2.
        .map( |(x,y)| Score::powi(x-y,2) )
        // Accumulate the value.
        .fold(0 as Score, Add::add);
    Score::sqrt(sq)
}

In the first part, this method returns None if the supplied etext number cannot be found. Otherwise, it identifies the row of data associated with the etext. In the second part, it goes over the rows in the data, computing the distance between the given row and each in turn. The map on the iterator arranges to have each etext number associated with the corresponding score and the result in returned. The function distance uses its own iterators to compute the Euclidian distance between two rows. (Note the use of Add::add in the fold; I am not actually sure how that works, but Add is the trait used to implement the '+' operation. Nifty.)

The implementation of the topic recommendations is very, very similar to this, except that it uses bitsets for data (and I had to write my own, quick, bitset implementation because that in the standard library is unstable and thus unusable) and the Jaccard distance between rows.

Combined recommendations are even simpler than style or topics; it uses the Style and Topic structures to return the individual recommendations, then it merges the two vectors (that are sorted by etext number, remember) by multiplying the two scores to compute the combination recommendation.

¹ I have used the unwrap method many times in this parsing code; doing so is safe because (a) I control the input data files as well as the code, (b) this parsing happens once, as the server is starting, so blowing chunks all over the floor is a fine response, and I don't have a better option. (Option in the standard library includes an expect method, which would allow me to supply a better error message, but it is currently unstable.²) Live with it.

² I have no idea where this is coming from; I blame someone borrowing my fingers to write it without permission. As /u/steveklabnik1 and /u/dbaupp point out over on Reddit, Option does have expect(), it is stable, and I (or someone who looks like me) may or may not be thinking about Result::expect, which doesn't exist in 1.1 stable at all but should soon. I'm going back to bed now.³

³ Ok, so after fooling around a bit more I have found a couple things I don't like about .unwrap(), .expect(), and try!.

In the first place, unwrap and expect both invoke panic! in Rust library code so that you get error messages like,


thread '<main>' panicked at 'missing etext number', /home/rustbuild/src/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/option.rs:330

Now, for this project panicking is the appropriate response here and an error message like that is fine, but, still, the line number in libcore is causes a certain chafing. On the other hand, actual, real error handling would be far too much work. Oh, and try! returns a Result, but when invoked inside a closure returns it from the closure rather than the enclosing function. (I think. The compiler's message is pretty unpleasant after midnight.) And converting a Vec<Option<T>> to an Option<Vec<T>> is tedious. I'm going to have to think about that one.

So I broke down and created some macros:


#[macro_export]
macro_rules! unpack {
    ($e:expr,$m:expr) => ( match $e {
        Ok(v) => v,
        Err(e) => panic!(format!("{}: {}", $m, e.to_string())),
        })
}

#[macro_export]
macro_rules! expect {
    ($e:expr,$m:expr) => ( match $e { Some(e) => e, None => panic!($m), })
}

#[macro_export]
macro_rules! panic_unless {
    ($m:expr,option: $e:expr) => ( match $e { Some(v) => v,
                                              None => panic!($m),
    } );
    ($m:expr,result: $e:expr) => ( match $e { Ok(v) => v,
                                              Err(e) => panic!(format!("{}: {}",
                                                                       $m,
                                                                       e.to_string())),
    } )
}

The first two have actually been replaced by the third; they do essentially the same thing as .expect() but report the line where the macro was invoked. So, the Style constructor and data parsing function now looks like:


    pub fn read<P : AsRef<Path>>(path : P) -> Style {
        let (etexts, vectors) : (Vec<Etext>,Vec<Vec<Score>>) =
            BufReader::new( panic_unless!("style data", result: File::open(path)) ).lines()
            .map( |line| {
                let line                 = line.unwrap();
                let mut elements         = line.split('\t');
                // The first element of each line is the etext number.
                let etext_no: Etext = panic_unless!("etext number",
                                                    option: elements.nth(0)
                                                    .and_then(|s| s.parse().ok())
                                                    );
                // The remaining elements are part-of-speech data for the etext.
                let etext_data: Vec<Proportion> = elements
                    .map( |s| panic_unless!("style data", result: s.parse::<Proportion>()) )
                    .collect();
                (etext_no, etext_data)
            } ).unzip();

        Style {
            data           : vectors,
            // Create the mappings from vector index to etext number, and vice versa.
            etext_to_index : etexts.iter()
                // duplicate etext_nos
                .cloned()
                // associate each etext_no with a row number
                .enumerate()
                // flip the pair, to map each etext_no to a row number
                .map(|(x,y)| (y,x))
                // collect into hashmap
                .collect(),
            index_to_etext : etexts,
        }
    }

By the way, if you haven't read it, Andrew Gallant's Error Handling in Rust is a great introduction to the right way to do error handling, rather than the spastic version I have here.

Next section: the web server.