Reimplementing ashurbanipal.web in Rust, pt. 3

Posted on July 16, 2015 by Tommy McGuire

In the last couple of posts, I described the background and Rust code to compute text recommendations from the Ashurbanipal data. In this post, I'm going show the code that the engine uses to supply the metadata associated with etexts to the front-end. As a reminder, this is using Rustful, a lightweight web-application framework in Rust, based on the hyper Rust HTTP library. Ashurbanipal is a prototype a text recommendation engine based on the contents of the text, not on reviews, other purchases, or any other external data. The prototype is based on 22,000-ish books on the Project Gutenberg 2010 DVD, at dpg.crsr.net. To use the it, go to the page, find a book that you know you like (using the search field at the upper left), and get style-based, topic-based, and combined recommendations for books which are in some way similar to your choice.

In the last post, I described using the rustc_serialize library to easily convert internal data structures to JSON. That same technique applies to Text, the structure containing the metadata for each etext.


#[derive(RustcEncodable)]
pub struct Text {
    pub etext_no:          Etext,
    pub link:              String,
    pub title:             String,
    pub author:            String,
    pub subject:           String,
    pub language:          String,
    pub release_date:      String,
    pub loc_class:         String,
    pub notes:             String,
    pub copyright_status:  String,
    pub score:             Option,
}

pub struct Metadata {
    metadata:     HashMap<Etext,Text>,
}

Each of the fields other than score and etext_no are metadata available from Project Gutenberg about the books. (Some fields may be empty, however. Your mileage may very. Do not operate heavy equipment after use.) The Metadata structure is simply a hashmap from etext numbers (the Etext type) to Text objects.

Creating the metadata map is fairly similar to parsing the stylometric or topical data. In this case, each line of the file represents one text and each tab-separated column one piece of metadata in a given order. And again, this uses the collect function to turn an iterator yielding etext number, Text pairs into a HashMap.


    pub fn read<P : AsRef<Path>>(path:P) -> Metadata {
        let texts: HashMap<Etext,Text> = 
            BufReader::new( File::open(path).unwrap() ).lines()
            .skip(1)
            .map( |line| {
                let line  = line.unwrap();
                let elements: Vec<&str> = line.split('\t').collect();
                let etext_no: Etext = elements[0].parse().unwrap();
                let t = Text {
                      etext_no:          etext_no,
                      link:              elements[1].to_string(),
                      title:             elements[2].to_string(),
                      author:            elements[3].to_string(),
                      subject:           elements[4].to_string(),
                      language:          elements[5].to_string(),
                      release_date:      elements[6].to_string(),
                      loc_class:         elements[7].to_string(),
                      notes:             elements[8].to_string(),
                      copyright_status:  elements[9].to_string(),
                      score:             None,
                };
                ( etext_no, t )
            } ).collect();

        Metadata { metadata: texts, }
    }

The one interesting method of Metadata was mentioned in passing last time, add_metadata. This method is passed a vector of etext number, score pairs, and returns a vector of filled-in Text objects. Once again, it uses iterators and the polymorphic collect.


    pub fn add_metadata<'a>(&'a self, rows: &Vec<(Etext,Score)>, start: usize, limit: usize) -> Vec<Text> {
        rows.iter()
            // limit rows to given window
            .skip(start).take(limit)
            // collect metadata for chosen texts
            .map( |&(e,s)| (self.get(e),s) )
            // filter out texts with no metadata
            .filter( |&(ref o,_)| o.is_some() )
            // combine the metadata and scored result
            .map( |(ref o,s)| o.unwrap().score(s) )
            // produce a vector
            .collect()
    }

The other use of the metadata is in allowing users to search for their favorite texts. As I mentioned above, the idea is that a user would enter part or all of their favorite book's title, author name, or subject keyword and the engine will present a list of possible matches, from which the user can choose the book they are thinking of. My basic reference for this task is the book, Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, although I probably shouldn't mention that here since this is such a mash-up. Don't blame them for my crap. The search is implemented by an Index structure.


type ScoredResult = (Etext,Score);
type ScoredDictionary = HashMap<String,Vec<ScoredResult>>;

pub struct Index {
    index: ScoredDictionary,
}

This data structure is a "reverse index" (which is strangely the same thing as an index), a map from keywords to a list of etexts mentioning that keyword and a score value trying to indicate the weight the book puts on the keyword. The list is known as a postings list.

Looking up keywords based on the user's request starts by splitting the query string on spaces and then merging the results from the various words. Finally, the results are sorted by score and returned to the user by a Rustful handler.


    pub fn get_entries(&self, s: &str) -> Vec<ScoredResult> {
        let mut results = Vec::new();
        for key in s.split(' ').map( encode ) {
            self.accept_or_merge_postings(&mut results, &key);
        }
        // Sort results by score, decreasing.
        results.sort_by( |l,r| {
            match l.1.partial_cmp(&r.1) {
                Some(o) => o.reverse(),
                // floating point numbers are a pain in the ass.
                None    => unimplemented!()
            }
        });
        results
    }

Because the initial results are empty, going straight to a merge would immediately result in empty results. Not good. Instead, accept_or_merge_postings accepts the next set of results from the lookup if the current results are empty. This is likely sub-optimal, but I didn't put a whole lot of thought into this. Sorry. (What, the recommendations aren't enough?)


    fn accept_or_merge_postings(&self, results: &mut Vec<ScoredResult>, key: &String) {
        match self.index.get(key) {
            Some(postings) => {
                if results.len() == 0 {
                    for posting in postings.iter() {
                        results.push(posting.clone());
                    }
                } else {
                    merge_postings(results, postings);
                }
            }
            None => { }
        }
    }

Once you have some results, merging the postings for remaining words requires running over the two vectors of scored results simultaneously. I didn't use iterators for this; I suppose I could have but I'm not sure.


fn merge_postings(results: &mut Vec<ScoredResult>, postings: &Vec<ScoredResult>) {
    let mut r = 0;
    let mut p = 0;
    while r < results.len() && p < postings.len() {
        if results[r].0 > postings[p].0 {
            p += 1;
        } else if results[r].0 < postings[p].0 {
            results.remove(r);
        } else /* results[r].0 == postings[p].0 */ {
            results[r].1 += postings[p].1;
            r += 1;
            p += 1;
        }
    }
    while r < results.len() {
        results.remove(r);
    }
}

The .0 and .1's are Rust's notation for accessing the two components of the ScoredResult pairs. When the current result's etext number is greater than the current posting's, the p index is incremented. When, vice-versa, the current posting's etext number is greater than the current result's, the current result is removed from the result set (since it does not occur in the postings list). And, when the two etext numbers agree, the result's score is incremented by the posting's score and both indices are incremented. This process implements a boolean and merge; given a previous set of results and a new postings list, the final set will be those etext numbers that occur in both.

Computing the index from the Project Gutenberg metadata is somewhat complex. The first step is to isolate the title, author, and subject string for each text, separate it into words, and then create a weighting for each based on 3 for title, 2 for author, and 1 for subject. (No, this doesn't work particularly well. Again, sorry.) Once that is done, the giant pre-postings list is sorted so that all of the entries for a given keyword are together, with all of the etext numbers within each keyword also together. Then, the list can be broken into equivalence classes on keyword and etext number, and the entries in each class can be combined by summing their scores. The result is inserted in to the map, creating postings lists with etext number, score pairs for each keyword. Each postings list is then sorted by etext number for convenient merging above.


    pub fn new(metadata: &Metadata) -> Index {
        // Compute a vector of keyword, etext_no, simple score triples.
        let mut postings: Vec<(String,Etext,Score)> = Vec::new();
        for (&etext_no, text) in metadata.iter() {
            postings.extend( text.title.split(' ').map(encode).map( |t| (t, etext_no, 3.0) ) );
            postings.extend( text.author.split(' ').map(encode).map( |a| (a, etext_no, 2.0) ) );
            postings.extend( text.subject.split(' ').map(encode).map( |s| (s, etext_no, 1.0) ) );
        }
        // Sort postings by keyword, then by etext_no.
        postings.sort_by(compare);
        // Accumulate scores for keyword, etext_no, then insert
        // etext_no and combined score into index under keyword.
        let mut index = HashMap::new();
        for cls in equivalence_classes(&postings, &equal) {
            if cls.len() > 0 {
                let key = cls[0].0.clone();
                let etext_no = cls[0].1;
                let score = cls.iter().fold(0.0 as Score, |a,p| a + p.2);
                index.entry(key).or_insert(Vec::new()).push((etext_no,score));
            }
        }
        // Sort stored postings lists by etext_no.
        for (_,postings) in index.iter_mut() {
            postings.sort_by(|l,r| l.0.cmp(&r.0));
        }
        Index { index: index }
    }

You may have noticed something that I did not mention in both creating the Index and up in getting entries from the index: an encode function. The problem is spelling; if you use the raw words from the title, author, and subject metadata, you will have a hard time finding Shakespeare's plays unless you remember the 'e' on the end of his name. To alleviate the problem, the keywords in the index and the words used to look up the postings lists are encoded using a Soundex algorithm. The algorithm I have found to be most successful is the New York State Information and Intelligence System phonetic code, NYSIIS. Unfortunately, there was no implementation of this algorithm in Rust, so I wrote one. It's horrible. I suggest not looking at it. It does seem to work, however.

My own implementation of NYSIIS makes the comparison between this search code and the Java version more difficult, since I did not look at the NYSIIS implementation in Apache Commons Codec while I was writing it.