Tuesday, July 28, 2015

More Abstracted Algebra in Rust

Or, Does 'Group' distribute over 'Ring'?

As I mentioned previously, I have been reading Alexander Stepanov and Daniel Rose's From Mathematics to Generic Programming. It's actually quite interesting, if you kind of fuzz over the math to get to the programming parts. It's not that I'm against math, heavens no, but I have seen many other, and some better, presentations of abstract algebra (I recommend A Book of Abstract Algebra by Charles Pinter.) and of the history of algebra and number theory.

The best parts of From Mathematics to Generic Programming are about programming, specifically about generic programming using examples from algebra. In fact, I do wish Stepanov and Rose had spent more time on the history and theory behind the STL rather than Galois. I would recommend this book to anyone getting into more advanced C++ or perhaps Java and Ada, or anyone getting into Rust. The generics system is built tightly into Rust and an important part of the language environment.

The topic of this post is solutions to their Exercises 8.7 and 8.8, from the section Matrix Multiplication and Semirings. If you followed the last post on this stuff, you'll already know about generalizing addition, multiplication, and matrix multiplication operations to create a generic power function: a recurrence operation that applies a unary operation to an initial value, then to the resulting value, and so on. In this section of the book, Stepanov and Rose take the matrix multiplication operation from there and generalize it to other underlying operations to compute the transitive closure of a social network graph and the distance between the shortest paths in a network.

To begin with, I have to back off a bit. (Why? Why do I always do this? It's probably a long story.)

Generic algebra

This is sort of like math, so let's start with some basic definitions:

Term Definition Rust
Magma An algebraic structure consisting of a set of values and an operation closed over that set.
pub trait Magma {
    type Elt: Clone;
    fn op(Self::Elt, Self::Elt) -> Self::Elt;
}
Associativity (x `op` y) `op` z == x `op` (y `op` z)
pub trait Associative { }
Semigroup A magma whose operation is associative.
pub trait Semigroup: Magma + Associative { }
Monoid A semigroup equipped with an identity element such that x `op` id == id `op` x == x.
pub trait Monoid: Semigroup { fn identity() -> Self::Elt; }
Group A monoid that comes with an inverse operation such that x `op` x.inverse() == x.inverse() `op` x == id.
pub trait Group: Monoid {
  fn inverse(Self::Elt) -> Self::Elt;
}
Commutativity The arguments to the operation go to work on weekdays, such that x `op` y == y `op` x.
pub trait Commutative: Magma { }
Abelian Group A group whose operation is commutative. (Who names this stuff? It's ridiculous. Oh, Niels Henrick Abel. Nice.)
pub trait AbelianGroup: Group + Commutative { }

Now, if you look back at the previous post, where I had simpler definitions for semigroup and implementations for the primitive Rust types along with the definition of the power operation that used them, you'll see that none of this is actually anywhere near as complicated than it looks. Most of the traits are simply raw, type-level assertions; markers to indicate that addition is associative, for example. (It would be nice to be able to ensure that any operation declared as associative actually satisfied the definition, but I don't think I can do that without dependent types and that would be as complicated as it looks. Or more.

For example, here's the addition and multiplication operations defined for all of the primitive types, roughly the same as before:

pub struct Multiplication<Elt>(PhantomData<Elt>);
pub struct Addition<Elt>(PhantomData<Elt>);

macro_rules! add_mult_operations_for {
    ( $( $t:ty ),* ) => {
        $(
            impl Magma for Addition<$t> {
                type Elt = $t;
                fn op(x: $t, y: $t) -> $t { x + y }
            }
            impl Magma for Multiplication<$t> {
                type Elt = $t;
                fn op(x: $t, y: $t) -> $t { x * y }
            }
        )*
    }
}
add_mult_operations_for!(usize,u8,u16,u32,u64,i8,i16,i32,i64,f32,f64);

And, as for what you can do with them, here is the Egyptian-multiplication-algorithm-based implementation of the power operation:

fn power_accumulate_semigroup<N,S>(mut r: S::Elt, mut a: S::Elt, mut n: N) -> S::Elt
    where N: Natural,
          S: Semigroup
{
    loop {
        assert!(n > N::zero());
        if n.odd() {
            r = S::op(r.clone() ,a.clone());
            if n == N::one() { return r; }
        }
        n = n.half();
        a = S::op(a.clone(), a.clone());
    }
}

pub fn power_semigroup<N: Natural, S: Semigroup>(mut a: S::Elt, mut n: N) -> S::Elt {
    assert!(n > N::zero());
    while !n.odd() {
        a = S::op(a.clone(), a.clone());
        n = n.half();
    }
    if n == N::one() { return a }
    let y = S::op(a.clone(), a.clone());
    return power_accumulate_semigroup::<N,S>(a, y, (n - N::one()).half());
}

(If you're interested in the Natural traits, see that previous post; I don't want to go into it here.)

The definition above actually makes serious use of associativity to execute the repeated operations in O(log n) rather than O(n) time. Below, I make use of the other algebraic definitions to handle edge cases for the Natural argument.

pub fn power_monoid<N: Natural, M: Monoid>(a: M::Elt, n: N) -> M::Elt {
    assert!(n >= N::zero());
    if n == N::zero() { M::identity() }
    else              { power_semigroup::<_,M>(a, n) }
}

pub fn power_group<N: Integral, G: Group>(mut a: G::Elt, mut n: N) -> G::Elt {
    if n < N::zero() {
        n = -n;
        a = G::inverse(a);
    }
    power_monoid::<N,G>(a, n)
}

So far, what I have is just additional formalization of the things I presented in that previous post. Not very interesting, eh? Heck, I'm already bored. So, let's introduce some new definitions.

Term Definition Rust
Distribution One way two operations can interact:
x `m` (y `a` z) = (x `m` y) `a` (x `m` z), and
(y `a` z) `m` x = (y `m` x) `a` (z `m` x).
pub trait DistributesOver<A>: Magma where A: Magma { }

macro_rules! mult_distributes_over_add_for {
    ( $( $t:ty ),* ) => { $(
        impl DistributesOver<Addition<$t>> for Multiplication<$t> { }
        )*
    }
}
mult_distributes_over_add_for!(usize,u8,u16,u32,u64,i8,i16,i32,i64,f32,f64);
Absorption There is a special, choosen, elemnt which, when used as an argument to an operation, always results in itself: x `op` 0 == 0 `op` x == 0.
pub trait Absorbing : Magma { fn absorbing() -> Self::Elt; }

macro_rules! absorbing_multiplication_for {
    ( $( $t:ty ),* ) => { $(
        impl Absorbing for Multiplication<$t> {
          fn absorbing() -> Self::Elt { 0 as $t }
        } )*
    }
}
absorbing_multiplication_for!(usize,u8,u16,u32,u64,i8,i16,i32,i64,f32,f64);
Semiring An algebraic structure with two operations (here called Ad and Mu), both monoids. Ad should be commutative and Mu should have an absorbing element (possibly equal to Ad's identity?) and Mu should distribute across Ad. Also, Ad::identity != Mu::identity.
pub trait Semiring {
    type Elt;
    type Ad: Monoid<Elt=Self::Elt> + Commutative<Elt=Self::Elt>;
    type Mu: Monoid<Elt=Self::Elt> + Absorbing<Elt=Self::Elt> +
             DistributesOver<Self::Ad>;
}
Ring A semiring where Ad includes an inverse. (Completely unimportant for the rest of this post, but hey, somebody might need it sometime.)
pub trait Ring: Semiring where Self::Ad: Group { }

What we have here, besides a failure to communicate, is the definition of a semiring, the algebraic structure needed to generalize matrix multiplication. If you remember your linear algebra, or the code from the previous post, you know that matrix multiplication requires element-wise multiplication and then addition to compute the values in the result matrix. In fact, normal matrix multiplication uses what I'm going to call the Numerical semiring:

pub struct Numerical<Elt>(PhantomData<Elt>);

macro_rules! numerical_semiring_for {
    ( $( $t:ty ),* ) => { $(
        impl Semiring for Numerical<$t> {
            type Elt = $t;
            type Ad = Addition<$t>;
            type Mu = Multiplication<$t>;
        } )*
    }
}
numerical_semiring_for!(usize,u8,u16,u32,u64,i8,i16,i32,i64,f32,f64);

Technically, as Stepanov and Rose point out, you don't need a semiring for matrix multiplication; specifically, you don't need the identities for Ad and Mu. But I'm not going through this mess again to get a "weak semiring". In fact, for all of this algebraic nonsense, mathematical structures are only bags of properties; my semiring structure should be replaced by something that defines Ad and Mu in terms of Magmas and the actual requirements (associativity, commutativity, etc.) added on as required by the algorithm. Maybe next time.

Matrix multiplication

The matrix I presented last time was simple; this time I came up with something less simple: I added a special value to act as an identity for multiplication.

#[derive(Debug)]
pub enum Matrix<T> {
    Identity,
    Matrix {
        rows: usize,
        cols: usize,
        m: Vec<Vec<T>>,
    }
}

Matrix multiplication gets much more verbose when defined in terms of a semiring. The basic flow is this: if one of the arguments is the identity value, the result is the other. If neither is the identity, then it creates a new result matrix (an n x m matrix multiplied by a m x p matrix produces a n x p matrix) where the cells from the two matrices are combined as aik `Mu` bkj and the results are summed using the Ad operation with Ad::identity as the zero.

pub fn matrix_multiply<S>(a: Matrix<S::Elt>, b: Matrix<S::Elt>) -> Matrix<S::Elt>
    where S: Semiring, S::Elt: Clone
{
    match (a,b) {
        (Matrix::Identity, b) => b,
        (a, Matrix::Identity) => a,
        (Matrix::Matrix{rows: ref arows, cols: ref acols, m: ref am},
         Matrix::Matrix{rows: ref brows, cols: ref bcols, m: ref bm}) => {
            let n = *arows;
            let m = *acols;
            assert_eq!(m, *brows);
            let p = *bcols;
            let mut results: Vec<S::Elt> = Vec::new();
            for i in 0..n {
                for j in 0..p {
                    let cell = (0..m)
                        .map( |k| {
                          <S::Mu as Magma>::op( am[i][k].clone(), bm[k][j].clone() )
                        })
                        .fold( <S::Ad as Monoid>::identity(), <S::Ad as Magma>::op );
                    results.push(cell);
                }
            }
            Matrix::new(n, p, &results)
        }
    }
}

One thing to note: when you pick out S::Mu::op and S::Ad::identity, you need to specify which specific trait's operation and identity with identifier as trait syntax. I'm not entirely sure why, since I don't think the declarations provided allow for multiple definitions.

Matrix multiplication naturally forms an algebraic operation itself.

pub struct MatrixMultiply<S:Semiring>(PhantomData<S>);

impl<S> Magma for MatrixMultiply<S> where S: Semiring, S::Elt: Clone {
    type Elt = Matrix<S::Elt>;
    fn op(x: Matrix<S::Elt>, y: Matrix<S::Elt>) -> Matrix<S::Elt> {
        matrix_multiply::<S>(x,y)
    }
}

impl<S> Associative for MatrixMultiply<S> { }
impl<S:Semiring> Semigroup for MatrixMultiply<S> where S::Elt: Clone { }

Given a semiring S, multiplication of matrices over S::Elts forms a semigroup itself. Which is good, because I'll need that later.

At this point, I could go back through and repeat the Fibonacci example using this generalized matrix multiplication, but I won't. I'm off to new things.

Social networks

Assume you have seven people and the relationship of who is friends with whom among the seven. If you represent the friendship relation as a matrix of boolean values (true if the two people are friends and false otherwise), you can use matrix multiplication on a boolean semiring to compute the transitive closure of the relation, which describes the social network of the group of people.

Here is the friendship matrix:

    let friends: Matrix<bool> = Matrix::new(7,7, &vec!(
        true,  true,  false, true,  false, false, false,
        true,  true,  false, false, false, true,  false,
        false, false, true,  true,  false, false, false,
        true,  false, true,  true,  false, true,  false,
        false, false, false, false, true,  false, true,
        false, true,  false, true,  false, true,  false,
        false, false, false, false, true,  false, true,
        ));

The boolean semiring I need, the {Or,And}-semiring, replaces addition with boolean or and multiplication with boolean and. To start with, I need to define the two boolean operations.

Or is the boolean or operation, implemented by Rust's || operator. It is associative, commutative, and has an identity in the false value.

pub struct Or;

impl Magma for Or {
    type Elt = bool;
    fn op(x: bool, y: bool) -> bool { x || y }
}

impl Associative for Or { }

impl Semigroup for Or { }

impl Monoid for Or {
    fn identity() -> bool { false }
}

impl Commutative for Or { }

And is the boolean and operation, &&. It is associative; has an identity, true; has an absorbing element, false. (The absorbing part isn't mentioned in Stepanov and Rose, although the requirement is present in their defining axioms for the semiring. A lack of formalism, I suspect, led them to overlook it.) And also distributes over Or.

pub struct And;

impl Magma for And {
    type Elt = bool;
    fn op(x: bool, y: bool) -> bool { x && y }
}

impl Associative for And { }

impl Semigroup for And { }

impl Monoid for And {
    fn identity() -> bool { true }
}

impl Absorbing for And {
    fn absorbing() -> bool { false }
}

impl DistributesOver<Or> for And { }

The result is the boolean semiring:

pub struct BooleanSemiring;

impl Semiring for BooleanSemiring {
    type Elt = bool;
    type Ad = Or;
    type Mu = And;
}

Given all of the above, the code to actually produce the transitive closure is trivial, using the power_semigroup function.

    let result = power_semigroup::<_,MatrixMultiply<BooleanSemiring>> ( friends, 6 );

By the way, the 6 comes from 7, the number of people in the group, minus 1.

The result is:

[
 true true true true false true false
 true true true true false true false
 true true true true false true false
 true true true true false true false
 false false false false true false true
 true true true true false true false
 false false false false true false true
]

There are two cliques in this network, one containing Ari(0), Bev(1), Cal(2), Don(3), and Fay(5) and another containing Eva(4) and Gia(6). If the two cliques were not disjoint, the graph would contain all true values, but they are and there is no transitive friendship relation between Eva and Gia and the rest of the group.

Directed graphs

Suppose you have a directed graph with weighted links between some of the nodes. You can represent this graph as a matrix, with the weight between each node and itself as 0 and the weight between unconnected nodes as infinity:

    let graph: Matrix<f64> = Matrix::new(7, 7, &vec!(
        0.0,  6.0,  INF,  3.0,  INF,  INF,  INF,
        INF,  0.0,  INF,  INF,  2.0,  10.0, INF,
        7.0,  INF,  0.0,  INF,  INF,  INF,  INF,
        INF,  INF,  5.0,  0.0,  INF,  4.0,  INF,
        INF,  INF,  INF,  INF,  0.0,  INF,  3.0,
        INF,  INF,  6.0,  INF,  7.0,  0.0,  8.0,
        INF,  9.0,  INF,  INF,  INF,  INF,  0.0,
        ));

You can use repeated matrix multiplication over a very peculiar semiring to compute a matrix with the minimum path weight between any node and any other node. The peculiar semiring? It's tropical. (I really have no idea where these names come from. Tropical? Ridiculous. I do not have a daiquiri in my hand. There is no sand here. It's freaking hot outside, but I'm currently in Amarillo, TX, so it's a dry heat with a lot of wind so that I can taste the landscape. Bleah.)

pub struct TropicalSemiring;

impl Semiring for TropicalSemiring {
    type Elt = f64;
    type Ad = Min;
    type Mu = Addition<f64>;
}

In this tropical semiring, Min (as in min(a,b) = a if a < b; b otherwise) takes the place of addition and addition takes the place of multiplication. (That is one of the sources of peculiarity.)

Min itself makes a pretty simple algebraic structure. It is associative, commutative, and has an identity (infinity). Addition distributes over it since a + min(b,c) = min(a+b, a+c).

pub struct Min;

impl Magma for Min {
    type Elt = f64;
    fn op(x: f64, y: f64) -> f64 { f64::min(x,y) }
}

impl Associative for Min { }
impl Commutative for Min { }

impl Semigroup for Min { }

impl Monoid for Min {
    fn identity() -> f64 { f64::INFINITY }
}

impl DistributesOver<Min> for Addition<f64> { }

On the other hand, addition proves to be a problem. It's the standard addition operation described above, but the semiring definition requires whatever replaces the multiplication operation to be absorbing. How do you get an element such that e + x = x + e = e? Zero is not it, one is not it; are there any other interesting numbers? Fortunately, IEEE 754 floating point numbers come to our rescue here: it provides representations for positive and negative infinity, and infinity is almost as sticky as NaN. Cast your eyes at this chart: IEEE 754 Infinity and NaN Arithmetic Rules. In the upper-left quarter is addition; infinity plus any other number (except negative infinity and NaN) is always infinity. (Infinity plus negative infinity is NaN; any operation on NaN is also Not a Number, so infinity under addition is almost as sticky as NaN and, although the axioms are not strictly satisfied for negative infinity and NaN, it is close enough for government work. Especially since it isn't used here.)

/// Only true for addition on floating point numbers; see IEEE754.
macro_rules! absorbing_addition_for {
    ( $( $t:ty: $v:expr ),* ) => { $(
        impl Absorbing for Addition<$t> { fn absorbing() -> Self::Elt { $v } }
        )*
    }
}
absorbing_addition_for!(f32: f32::INFINITY, f64: f64::INFINITY);

Once I have all of the definitions in place, using the power function to compute the transitive closure of the graph link weights is easy:

    let result = power_semigroup::<_,MatrixMultiply<TropicalSemiring>>(graph, 7-1);

The final result is:

[
 0 6 8 3 8 7 11
 23 0 16 26 2 10 5
 7 13 0 10 15 14 18
 12 18 5 0 11 4 12
 35 12 28 38 0 22 3
 13 17 6 16 7 0 8
 32 9 25 35 11 19 0
]

So, travelling from A(0) to B(1) (in the top row, second from the left) has weight 6; travelling from B to A has weight 23 (second row, leftmost column).

That is exercise 8.8; exercise 8.9, which calls for not only finding the shortest distance but the shortest paths, should be fairly easy; simply record which nodes you use when you reduce with min. Maybe next time.

Conclusions

Throughout this post, I used semirings even though I did not need all of their properties; in fact, the absorbing thing was a bit of a problem. At one point in the book, Stepanov and Rose write,

Categorical Theories versus STL

For a long time, people believed that only categorical theories [a theory is a set of propositions; in a categorical theory, all models, which define the operations in the theory and make the axioms true, of the theory are isomorphic] were good for programming. When the C++ Standard Template Library (STL) was first proposed, some computer scientists opposed it on the grounds that many of its fundamental concepts, such as Iterator, were underspecified. In fact, it is this underspecification that gives the library its generality. While linked lists and arrays are not computationally isomorphic, many STL algorithms are defined on input iterators and work on both data structures. If you can get by with fewer axioms, you allow for a wider range of implementations.

I seriously hope they are talking about the problem of dealing with bundles of propositions such as semirings, where all of the requirements for the bundle are not used by a given algorithm. In that case, it is easy for a structure to be underspecified, since if a property is not used then it never needs to be mentioned in the code. The alternative, that if it is used but cannot easily be described (like the associativity trait), it may only be implicit, is a dandy source of hard-to-find bugs. Making things like associativity explicit, but trivial, is only a bandaid. It tries to remind the programmer that water is needed, but cannot make them drink.

Tuesday, July 21, 2015

Link o' the day: Churchill and FDR on Basic English

And then I found this missive in the collection of the Franklin D. Roosevelt Presidential Library and Museum. (It contains a copy of the note Churchill sent on 20 April 1944, the reply from FDR June 5, 1944, and some other related things.)

Churchill (So much for automatic text recognition; the original was typed by an English speaker even if the text page doesn't look like it.):

When I was with you in the United States last August you expressed to me your interest in Basic English. The Cabinet Committee which I appointed here to consider the possibilities of Basic and means of promoting its wider use have reported and we have adopted the recommendations they have made. I thought it might be of interest to you to see the Report and am sending you copies.

[...]

If the United States authorities feel able to give their powerful support to the promotion of Basic English as a means of international intercourse, I feel sure that that would ensure its successful development. My conviction is that Basic English will then prove to be a great boon to mankind in the future and a powerful support to the influence of the Anglo-Saxon peoples in world affairs.

FDR:

...Incidentally, I wonder what the course of history would have been if in May 1940 you had been able to offer the British people only "blood, work, eye water and face water", which I understand is the best that Basic English can do with five famous words....

As an aside, the website for the FDR library at Marist College seems radically broken.

Sunday, July 19, 2015

Abstracted Algebra in Rust

or, Why Rust is better than C++

or, How To Break a Coding Interviewer

This week I have been reading Alexander Stepanov and Daniel Rose's From Mathematics to Generic Programming. (I've had Sepanov and Paul McJones' Elements of Programming on the unread pile for quite a while, but so far it has been the Australian sparkling wine of the bookshelf: something for laying down and avoiding. I'll have to unearth it, once I have recovered from this.) From Mathematics to Generic Programming is the sort of book intended to convince you that programming is not only applied formal logic but also applied abstract algebra, and it does a pretty fair job. If you're unfamiliar with Stepanov, he is one of the people behind generic programming in general and both Ada and the C++ Standard Template Library specifically. Do be warned, though, that so far the book has been rather heavy on abstract algebra and its history, and that of number theory, and thin on programming. On the other hand, the programming bits have been very good, particularly those on optimizing an algorithm.

While reading chapter 7, Deriving a Generic Algorithm, the authors pose an interesting example and I thought it might be entertaining to try to convert the example into Rust.

To begin, consider multiplication. (No, I'm serious. Just keep going for a minute.) Multiplication is repeated addition; a*n is a + a + ... + a, with n a's and n-1 + signs. Implemented naively, it is a O(n) algorithm, right? But there is an alternative algorithm, Egyptian multiplication, which is O(log n). I don't want to go into how this algorithm works, or the derivation of the following version; see Count like an Egyptian or From Mathematics to Generic Programming for yourself. But this is the optimized, accumulative, iterative version, in Rust:

fn odd(n: usize) -> bool { n & 0x01 == 1 }
fn half(n: usize) -> usize { n >> 1 }
fn mult_acc(mut r: usize, mut n: usize, mut a: usize) -> usize {
    loop {
        if odd(n) {
            r += a;
            if n == 1 {
                return r;
            }
        }
        n = half(n);
        a += a;
    }
}

The important part of this algorithm is that you have an element, a, an operation, + (that obeys some algebraic laws), and a counter, n, that indicates how many times to apply the operation to the element. And an efficient algorithm for doing so. Now, what can you do with this?

Natural numbers

To start with, what are the requirements needed from the counters? It needs a zero, an additive identity (purely for predicate purposes), and a one (also for predicates), and a couple of operations: a predicate indicating that a value is odd and integral division by 2. Further, it needs to be ordered (see predicates), and it will be convenient later to have a subtraction operator. In Rust, those requirements are described by traits:

pub trait Zero { fn zero() -> Self; }1
pub trait One { fn one() -> Self; }
pub trait Odd { fn odd(&self) -> bool; }
pub trait Half { fn half(&self) -> Self; }
pub trait Natural: cmp::Ord + Sub<Output=Self> + Zero + One + Odd + Half { }

macro_rules! natural_for {
    ( $( $t:ty ),* ) => {
        $(
            impl Zero for $t { fn zero()      -> $t   { 0 } }
            impl One  for $t { fn one()       -> $t   { 1 } }
            impl Half for $t { fn half(&self) -> $t   { *self >> 1 } }
            impl Odd  for $t { fn odd(&self)  -> bool { *self & 0x01 == 1 } }
            impl Natural for $t { }
        )*
    }
}
natural_for!(usize,u8,u16,u32,u64,i32,i64);

The trait Natural there indicates that a type implementing it also has to implement Ord and Sub from the standard library along with Zero, One, Odd, and Half. All of those either already are implemented by the primitive types or are pretty easy. I have taken the liberty of using a macro to implement this set of traits for a collection of useful counting types: usize, u8, u16, etc.

The collected Natural trait provides all the operations needed for the counter type, to implement the Egyptian multiplication algorithm in a general form. On the other hand, the algorithm operates on a type (for a) and an associated operation. These two form an algebraic structure, a semigroup.

Semigroups

trait Semigroup {
    type Elt: Clone;
    fn op(Self::Elt, Self::Elt) -> Self::Elt;
}

For a semigroup, the operation should be associative: (x `op` y) `op` z == x `op` (y `op` z). Unfortunately, I have not figured out how to enforce that. Maybe later. On the other hand, the type of elements should be closed under the operation, which is enforced by having the return type of the operation be Self::Elt.

Note that the operation, op, is not a trait method; it does not have a &self argument. In other words, it's a static function, not bound to any objects implementing the trait, although it does operate on the type associated with the trait, Elt.

On numbers (as in, integers), there are two common semigroups: addition and multiplication. This is actually one of the stumbling points of simple approaches to describing abstract algebra in code: because a value can only have one type, a simple method would only show one operation, addition say, rather than allowing an unlimited number of semigroup definitions over the same types with different operations. Fortunately, Rust has very frightening and complicated way out:

struct MultiplicativeSemigroup<Elt>(PhantomData<Elt>);
struct AdditiveSemigroup<Elt>(PhantomData<Elt>);

macro_rules! add_mult_semigroups_for {
    ( $( $t:ty ),* ) => {
        $(
            impl Semigroup for AdditiveSemigroup<$t> {
                type Elt = $t;
                fn op(x: $t, y: $t) -> $t { x + y }
            }
            impl Semigroup for MultiplicativeSemigroup<$t> {
                type Elt = $t;
                fn op(x: $t, y: $t) -> $t { x * y }
            }
        )*
    }
}
add_mult_semigroups_for!(usize,u8,u16,u32,u64,i8,i16,i32,i64,f32,f64);

This code defines additive and multiplicative semigroups over the generic type Elt. The structures, weirdly, don't contain any data fields. Instead, they will be used purely at the type level in the following code. (Beware! You have been warned! Here be dragons! And ferrets looking to get into your trousers.) The PhantomData bit tells the compiler that I know that the structure does not use the type Elt. To quote from std::marker::PhantomData<T>:

PhantomData<T> allows you to describe that a type acts as if it stores a value of type T, even though it does not. This allows you to inform the compiler about certain safety properties of your code.

Though they both have scary names, PhantomData<T> and "phantom types" are unrelated. ������

Yes, those seem to be little unicode ghosts. I'm not the only one with a weird sense of humor.

I have again taken the liberty of implementing both semigroups for the common primitive types.

You may be wondering where Rust's Add and Mul traits appear in all this. Currently, for this code, they are implicit in the operations + and *, and the types I have defined the two semigroups for. However, when I get down to doing more complicated stuff, they will indeed be needed.

Power!

Once you have the traits Natural and Semigroup, and implementations of those traits, you can implement the Egyptian multiplication algorithm in a generic form:

fn power_accumulate_semigroup<N,S>(mut r: S::Elt, mut a: S::Elt, mut n: N) -> S::Elt
    where N: Natural,
          S: Semigroup
{
    loop {
        assert!(n > N::zero());2
        if n.odd() {
            r = S::op(r.clone(), a.clone());
            if n == N::one() { return r; }
        }
        n = n.half();
        a = S::op(a.clone(), a.clone());
    }
}

fn power_semigroup<N: Natural, S: Semigroup>(mut a: S::Elt, mut n: N) -> S::Elt {
    assert!(n > N::zero());
    while !n.odd() {
        a = S::op(a.clone(),a.clone());
        n = n.half();
    }
    if n == N::one() { return a }
    let y = S::op(a.clone(),a.clone());
    return power_accumulate_semigroup::<N,S>(a, y, (n - N::one()).half());
}

Yes, I know, mutable function arguments are, stylistically, sketchy in Rust. But that's what the authors have and I'm just going with their work.

As an aside, the associativity of the semigroup operation is important here, because the Egyptian multiplication algorithm works by reordering the evaluation of the operations.

The first function there, power_accumulate_semigroup, should be visibly the same as the original algorithm above, with some changes related to:

  • the operator, which is fetched out of the type argument S,
  • the use of clone (and the related requirement of the Clone trait on Semigroup::Elt) which is needed because primitive types are implicitly copyable and passed by value while the generic type Elt is not but is being passed by value here, and
  • the excess verbosity required by using generic Naturals for the counter.

The second function, power_semigroup, is a wrapper for the accumulative version, providing simplified parameters and optimizing the arguments to power_accumulate_semigroup. I do not want to go into the derivation of that, either. See Stepanov and Rose.

Why go to this much effort? Because power_semigroup provides a way of repeatedly applying an operation to a value, and is as generic as it can be over both the types of the values and the operations (and over the types of the counter describing how many times to apply the operation). With the additive and multiplicative semigroups, you can now write code like

    power_semigroup::<_,MultiplicativeSemigroup<usize>>(2, 5)

and get back the result 32 = 25. Or, you can write

    power_semigroup::<_,AdditiveSemigroup<u64>>(2, 5)

and get back the result 10 = 2 * 5.

Hardly seem worth it? Perhaps not, but that's the nature of abstraction: you go through a lot of rigmarole (my spelling checker suggested 'oleomargarine' there) to get something idiotically simple in trivial cases but remarkably powerful in complex ones. So, wait for it.

There is one thing you cannot do with this monstrosity. Calling

    power_monoid::<_,AdditiveSemigroup<i64>>(2, 0)

results in a run-time error; because this is dealing with a semigroup, there are some weird limits on what you can do. In particular, we cannot currently describe what 2*0 or 2^0 (for the multiplicative semigroup) means. But it is possible to extend our algebraic structure.

Monoids

(What was that about mathematical notation being clean, elegant, and beautiful?)

A monoid is a semigroup with an identified element, an identity. Calling the identity e, the additional required behavior of the operation is (x `op` e) == (e `op` x) == x. Following the style above, I define a Monoid trait and supply implementations for the additive and multiplicative operations. (Yeah, AdditiveSemigroup and MultiplicativeSemigroup are looking more and more like poor names.)

trait Monoid: Semigroup { fn identity() -> Self::Elt; }

macro_rules! add_mult_monoids_for {
    ( $( $t:ty ),* ) => {
        $(
            impl Monoid for AdditiveSemigroup<$t>       { fn identity() -> $t { 0 as $t } }
            impl Monoid for MultiplicativeSemigroup<$t> { fn identity() -> $t { 1 as $t } }
        )*
    }
}
add_mult_monoids_for!(usize,u8,u16,u32,u64,i8,i16,i32,i64,f32,f64);

The additive identity is 0, because x + 0 = 0 + x = x, and the multiplicative identity is 1, because x * 1 = 1 * x = x. As before, I supply an implementation for all of the standard primitive types. Using the Monoid trait, I can then define a function power_monoid:

fn power_monoid<N: Natural, M: Monoid>(a: M::Elt, n: N) -> M::Elt {
    if n == N::zero() { M::identity() }
    else              { power_semigroup::<_,M>(a, n) }
}

With this function, the operation that failed above now works; supplying an operation count argument of zero results in the appropriate identity value;

    power_monoid::<_,MultiplicativeSemigroup>(2, 0)

produces 1 = 20. Likewise, the additive operation would produce 0 = 2*0.

Unfortunately, there is still another problem. Calling

    power_monoid::<_,AdditiveSemigroup<i64>>(2, -2);

Would result in another runtime error, because neither the semigroup nor the monoid support a negative operation count. (What does that even mean? Screw it, this is mathematics. Damn the torpedoes; full speed ahead!)

Extending Natural to Integral

Before the power operation can be extended, the Natural trait needs to be extended to support negation; otherwise simply writing the function call would be an error. Once again, defining mathematical structures as traits is the necessary task:

trait Integral: Natural + Neg<Output=Self> { }

macro_rules! integral_for {
    ( $( $t:ty ),* ) => { $( impl Integral for $t { } )* }
}
integral_for!(i32,i64);

As with the traits Mul and Add, I am using Rust's standard Neg trait to express negation; anything implementing the trait Integral has to implement Natural and have the negation operation. The number of types for which I can implement the Integral trait is significantly limited. Floating point values seem wrong, and none of Rust's unsigned primitive types will work, leaving only i32 and i64.

Groups

A group, as an algebraic structure, is a monoid with an operation to compute the inverse of each element, relative to the operation. The specific requirement for the inverse is, (x `op` x-1) == (x-1 `op` x) == e. This is the trait that describes groups:

trait Group: Monoid { fn inverse(&Self::Elt) -> Self::Elt; }

macro_rules! add_groups_for {
    ( $( $t:ty ),* ) => {
        $(
            impl Group for AdditiveSemigroup<$t> {
                fn inverse(n: &$t) -> $t { -(*n as $t) }
            }
        )*
    }
}
add_groups_for!(i8,i16,i32,i64,f32,f64);

macro_rules! mult_groups_for {
    ( $( $t:ty ),* ) => {
        $(
            impl Group for MultiplicativeSemigroup<$t> {
                fn inverse(n: &$t) -> $t { (1 as $t) / (*n as $t) }
            }
        )*
    }
}
mult_groups_for!(f32,f64);

And once again I provided an implementation of the Group for the types where it makes sense. Unfortunately, that is becoming limited: negative numbers, the additive inverses of positive numbers, are not supported by unsigned types and 1/n, the multiplicative inverse of a number n, is only supported by floating point types (since I don't have rationals handy). However, once Natural and Monoid have been extended to Integral and Group, a function power_group can be defined.

fn power_group<N: Integral, G: Group>(mut a: G::Elt, mut n: N) -> G::Elt {
    if n < N::zero() {
        n = -n;
        a = G::inverse(&a);
    }
    power_monoid::<_,G>(a, n)
}

This function is a wrapper for power_monoid, which wraps power_semigroup, which does the heavy lifting here. In this wrapper, if the count n is less than zero, it is inverted to be greater than zero and the base argument a is also inverted, by the operation that depends on the definition of the group for the operation. For addition, a becomes -a; for multiplication, a becomes 1/a.

As a result, the following function invocations work:

    power_group::<_,AdditiveSemigroup<i64>>(2, -2i64)

produces -4 = 2*-2, and

    power_group::<_,MultiplicativeSemigroup<f64>>(2.0, -2i64)

produces 0.25 = 2-2. So, now we have the ability to invoke an operation on a element in a very general but well-defined way. Wouldn't it be nice if we could do something cool with it?

Fibonacci

The Fibonacci sequence (If you don't know what that is, who Fibonacci was, or why it exists at all, go look it up. I'll wait, but this post is getting too long anyway.) is normally defined as:

F0 = 0
F1 = 1
Fi = Fi-1 + Fi-2

This leads to a simple recursive implementation which is spectacularly inefficient, and a host of techniques like memoization and iterative versions that are more efficient. As a result, finding the nth Fibonacci number is O(n). Or is it? What if we could use our Egyptian multiplication trick, plus slathering on some other stuff, to get an O(log n) algorithm?

Matrix multiplication

Although I have no intention to explore the reasoning, it turns out to be possible to find the nth Fibonacci number using matrix multiplication:

\[ \left[ \begin{array}{c}v_{i+1} \\ v_{i}\end{array} \right] = \left[ \begin{array}{cc}1 & 1 \\ 1 & 0\end{array} \right] \left[ \begin{array}{c}v_{i} \\ v_{i-1}\end{array} \right] \]

...and therefore...

\[ \left[ \begin{array}{c}v_n \\ v_{n-1}\end{array} \right] = \left[ \begin{array}{cc}1 & 1 \\ 1 & 0\end{array} \right]^{n-1} \left[ \begin{array}{c}1 \\ 0\end{array} \right] \]

(If the TeX doesn't show up correctly, check out the damn book. Trust me, it looks beautiful.)

In other words, we can compute the nth Fibonacci number by raising a certain matrix to a power. [...] Matrices are a multiplicative monoid, so we already have an O(log n) algorithm---our power algorithm from Section 7.6.

Therefore, all we need to do is whip up a quick matrix data structure...

struct Matrix<T:Clone> {
    m: Vec<Vec<T>>,
}

impl<T: Clone + Zero> Matrix<T> {
    fn zero(n: usize, m: usize) -> Matrix<T> {
        let mut rows = Vec::with_capacity(n);
        for _ in 0..n {
            let row = vec!(T::zero(); m);
            rows.push(row);
        }
        Matrix{ m: rows }
    }
}

impl<T:Clone + Zero> Clone for Matrix<T> {
    fn clone(&self) -> Matrix<T> {
        let m = self.m.len();
        let n = self.m[0].len();
        let mut res = Matrix::zero(m, n);
        for i in 0..m {
            for j in 0..n {
                res.m[i][j] = self.m[i][j].clone();
            }
        }
        res
    }
}

..., provide a spiffy multiplication operation...

impl<T> Mul for Matrix<T> where T: Clone + Zero + Add<Output=T> + Mul<Output=T>
{
    type Output = Matrix<T>;

    fn mul(self, rhs: Matrix<T>) -> Matrix<T> {
        let n = self.m.len();
        let m = self.m[0].len();        // lhs    is n x m
        let p = rhs.m[0].len();
        assert_eq!(rhs.m.len(), m);     // rhs    is m x p
        let mut ab = Matrix::zero(n,p); // result is n x p
        for i in 0..n {
            for j in 0..p {
                ab.m[i][j] =
                    (0..m)
                    .map( |k| self.m[i][k].clone() * rhs.m[k][j].clone() )
                    .fold( T::zero(), Add::add );
            }
        }
        ab
    }
}

(Note the requirement that T, the type of the elements in the matrix, have addition and multiplication (and a zero).)

...and then make a multiplicative semigroup for matrices:

impl<T> Semigroup for MultiplicativeSemigroup<Matrix<T>>
     where T: Clone + Zero + Add<Output=T> + Mul<Output=T>
{
    type Elt = Matrix<T>;
    fn op(x: Matrix<T>, y: Matrix<T>) -> Matrix<T> { (x) * (y) }
}

And with that I can print out all the Fibonacci numbers that you need (or at least all of those that will fit in a 64-bit integer).

    let mut fib_l: Matrix<i64> = Matrix::zero(2,2);
    fib_l.m[0][0] = 1;
    fib_l.m[1][0] = 1;
    fib_l.m[0][1] = 1;

    let mut fib_r: Matrix<i64> = Matrix::zero(2,1);
    fib_r.m[0][0] = 1;

    for i in 2..n {
        let fib_l = fib_l.clone();
        let fib_r = fib_r.clone();
        let res =
            power_semigroup::<_,MultiplicativeSemigroup<Matrix<i64>>>(fib_l, i-1)
            * fib_r;
        println!("{} {}", i, res.m[0][0]);
    }

That code prints out from the second to the sixty-second Fibonacci number; I cannot use it for the 0th or first number because I never added the identity necessary to make matrix multiplication a monoid; I'm lazy and it would require either knowing what size matrix to produce or having a specialized identity matrix. And 64-bits is only enough for 62 Fibonacci numbers.

2 1
3 2
4 3
5 5
6 8
7 13
...
57 365435296162
58 591286729879
59 956722026041
60 1548008755920
61 2504730781961
62 4052739537881

To sum up, I'll quote Stepanov and Rose:

The process we went through---taking an efficient algorithm, generalizing it (without losing efficiency) so that it works on abstract mathematical concepts, and then applying it to a variety of situations---is the essence of generic programming.

Subtitles

How is Rust better than C++? (I'm not sure it is better than C??+.) Two reasons:

  • Stepanov and Rose's C++ code uses an additional operator parameter to pass the necessary algebraic structure operations into the generic functions; my Rust code here uses type arguments to do the same thing. In other words, in C++, they have

    template <Regular A, Integer N, SemigroupOperation Op>
    A power_semigroup(A a, N n, Op op) {...}
    

    While in Rust, I have

    fn power_semigroup<N: Natural, S: Semigroup>(mut a: S::Elt, mut n: N) -> S::Elt {...}
    

    Here, Natural and Regular have the same function and their Integer and SemigroupOperation correspond to Semigroup. Using their power_semigroup requires three arguments; mine, two.

  • In the code examples above, Regular, Integer, and SemigroupOperation are C++ concepts, according to the authors. Unfortunately, as of C++11, as used by the authors, C++ did not include concepts, so for the code in the book, they #define Regular, etc., as "typename". (I don't follow C++ much; I have no idea about C++15/7.) To tell the truth, I'm not entirely sure I understand the concept of concepts, but it appears they act as a type system for C++ templates (the C++ template sublanguage is untyped and interpreted by the compiler) and that they act like Rust and Haskell traits: specifying an interface that must be implemented by types that use the generic function. Rust, on the other hand, has traits.

    One side effect of this is that Rust potentially has better compiler error messages; a trait can be typechecked when it is defined and implemented, while a C++ template is, for the most part, typechecked only when it is instantiated, and the resulting error messages are specific to the instantiation. Rust's errors aren't great, but they aren't 27 pages of angle brackets and line noise.

How do you break a coding interviewer?

Suppose you are in an interview, or on a phone interview, and the person on the other end asks you to write the code to find the nth Fibonacci number without all that icky recursion stuff. Memorize this post, and you can start your answer with, "You know what a semigroup is, right? And Egyptian multiplication?" The interviewer will be bleeding from the eyes and ears before you get through a few paragraphs.

Footnotes

1 As of this writing, Rust's own Zero and One traits are unstable.

2 Stepanov and Rose do not have this assertion there, but have more general code, seen below. But, I don't think their code makes algebraic sense here, and replacing it with what I have does not seem to change the outcome.

        assert!(n >= N::zero());
        if n == N::zero() { return r; }

Thursday, July 16, 2015

Reimplementing ashurbanipal.web in Rust, pt. 3

In the last couple of posts, I described the background and Rust code to compute text recommendations from the Ashurbanipal data. In this post, I'm going show the code that the engine uses to supply the metadata associated with etexts to the front-end. As a reminder, this is using Rustful, a lightweight web-application framework in Rust, based on the hyper Rust HTTP library. Ashurbanipal is a prototype a text recommendation engine based on the contents of the text, not on reviews, other purchases, or any other external data. The prototype is based on 22,000-ish books on the Project Gutenberg 2010 DVD, at dpg.crsr.net. To use the it, go to the page, find a book that you know you like (using the search field at the upper left), and get style-based, topic-based, and combined recommendations for books which are in some way similar to your choice.

  1. Part 1: Computing recommendations.
  2. Part 2: Curse of the web server.
  3. Part 3: You are here.

In the last post, I described using the rustc_serialize library to easily convert internal data structures to JSON. That same technique applies to Text, the structure containing the metadata for each etext.

#[derive(RustcEncodable)]
pub struct Text {
    pub etext_no:          Etext,
    pub link:              String,
    pub title:             String,
    pub author:            String,
    pub subject:           String,
    pub language:          String,
    pub release_date:      String,
    pub loc_class:         String,
    pub notes:             String,
    pub copyright_status:  String,
    pub score:             Option,
}

pub struct Metadata {
    metadata:     HashMap<Etext,Text>,
}

Each of the fields other than score and etext_no are metadata available from Project Gutenberg about the books. (Some fields may be empty, however. Your mileage may very. Do not operate heavy equipment after use.) The Metadata structure is simply a hashmap from etext numbers (the Etext type) to Text objects.

Creating the metadata map is fairly similar to parsing the stylometric or topical data. In this case, each line of the file represents one text and each tab-separated column one piece of metadata in a given order. And again, this uses the collect function to turn an iterator yielding etext number, Text pairs into a HashMap.

    pub fn read<P : AsRef<Path>>(path:P) -> Metadata {
        let texts: HashMap<Etext,Text> = 
            BufReader::new( File::open(path).unwrap() ).lines()
            .skip(1)
            .map( |line| {
                let line  = line.unwrap();
                let elements: Vec<&str> = line.split('\t').collect();
                let etext_no: Etext = elements[0].parse().unwrap();
                let t = Text {
                      etext_no:          etext_no,
                      link:              elements[1].to_string(),
                      title:             elements[2].to_string(),
                      author:            elements[3].to_string(),
                      subject:           elements[4].to_string(),
                      language:          elements[5].to_string(),
                      release_date:      elements[6].to_string(),
                      loc_class:         elements[7].to_string(),
                      notes:             elements[8].to_string(),
                      copyright_status:  elements[9].to_string(),
                      score:             None,
                };
                ( etext_no, t )
            } ).collect();

        Metadata { metadata: texts, }
    }

The one interesting method of Metadata was mentioned in passing last time, add_metadata. This method is passed a vector of etext number, score pairs, and returns a vector of filled-in Text objects. Once again, it uses iterators and the polymorphic collect.

    pub fn add_metadata<'a>(&'a self, rows: &Vec<(Etext,Score)>, start: usize, limit: usize) -> Vec<Text> {
        rows.iter()
            // limit rows to given window
            .skip(start).take(limit)
            // collect metadata for chosen texts
            .map( |&(e,s)| (self.get(e),s) )
            // filter out texts with no metadata
            .filter( |&(ref o,_)| o.is_some() )
            // combine the metadata and scored result
            .map( |(ref o,s)| o.unwrap().score(s) )
            // produce a vector
            .collect()
    }

The other use of the metadata is in allowing users to search for their favorite texts. As I mentioned above, the idea is that a user would enter part or all of their favorite book's title, author name, or subject keyword and the engine will present a list of possible matches, from which the user can choose the book they are thinking of. My basic reference for this task is the book, Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch├╝tze, although I probably shouldn't mention that here since this is such a mash-up. Don't blame them for my crap. The search is implemented by an Index structure.

type ScoredResult = (Etext,Score);
type ScoredDictionary = HashMap<String,Vec<ScoredResult>>;

pub struct Index {
    index: ScoredDictionary,
}

This data structure is a "reverse index" (which is strangely the same thing as an index), a map from keywords to a list of etexts mentioning that keyword and a score value trying to indicate the weight the book puts on the keyword. The list is known as a postings list.

Looking up keywords based on the user's request starts by splitting the query string on spaces and then merging the results from the various words. Finally, the results are sorted by score and returned to the user by a Rustful handler.

    pub fn get_entries(&self, s: &str) -> Vec<ScoredResult> {
        let mut results = Vec::new();
        for key in s.split(' ').map( encode ) {
            self.accept_or_merge_postings(&mut results, &key);
        }
        // Sort results by score, decreasing.
        results.sort_by( |l,r| {
            match l.1.partial_cmp(&r.1) {
                Some(o) => o.reverse(),
                // floating point numbers are a pain in the ass.
                None    => unimplemented!()
            }
        });
        results
    }

Because the initial results are empty, going straight to a merge would immediately result in empty results. Not good. Instead, accept_or_merge_postings accepts the next set of results from the lookup if the current results are empty. This is likely sub-optimal, but I didn't put a whole lot of thought into this. Sorry. (What, the recommendations aren't enough?)

    fn accept_or_merge_postings(&self, results: &mut Vec<ScoredResult>, key: &String) {
        match self.index.get(key) {
            Some(postings) => {
                if results.len() == 0 {
                    for posting in postings.iter() {
                        results.push(posting.clone());
                    }
                } else {
                    merge_postings(results, postings);
                }
            }
            None => { }
        }
    }

Once you have some results, merging the postings for remaining words requires running over the two vectors of scored results simultaneously. I didn't use iterators for this; I suppose I could have but I'm not sure.

fn merge_postings(results: &mut Vec<ScoredResult>, postings: &Vec<ScoredResult>) {
    let mut r = 0;
    let mut p = 0;
    while r < results.len() && p < postings.len() {
        if results[r].0 > postings[p].0 {
            p += 1;
        } else if results[r].0 < postings[p].0 {
            results.remove(r);
        } else /* results[r].0 == postings[p].0 */ {
            results[r].1 += postings[p].1;
            r += 1;
            p += 1;
        }
    }
    while r < results.len() {
        results.remove(r);
    }
}

The .0 and .1's are Rust's notation for accessing the two components of the ScoredResult pairs. When the current result's etext number is greater than the current posting's, the p index is incremented. When, vice-versa, the current posting's etext number is greater than the current result's, the current result is removed from the result set (since it does not occur in the postings list). And, when the two etext numbers agree, the result's score is incremented by the posting's score and both indices are incremented. This process implements a boolean and merge; given a previous set of results and a new postings list, the final set will be those etext numbers that occur in both.

Computing the index from the Project Gutenberg metadata is somewhat complex. The first step is to isolate the title, author, and subject string for each text, separate it into words, and then create a weighting for each based on 3 for title, 2 for author, and 1 for subject. (No, this doesn't work particularly well. Again, sorry.) Once that is done, the giant pre-postings list is sorted so that all of the entries for a given keyword are together, with all of the etext numbers within each keyword also together. Then, the list can be broken into equivalence classes on keyword and etext number, and the entries in each class can be combined by summing their scores. The result is inserted in to the map, creating postings lists with etext number, score pairs for each keyword. Each postings list is then sorted by etext number for convenient merging above.

    pub fn new(metadata: &Metadata) -> Index {
        // Compute a vector of keyword, etext_no, simple score triples.
        let mut postings: Vec<(String,Etext,Score)> = Vec::new();
        for (&etext_no, text) in metadata.iter() {
            postings.extend( text.title.split(' ').map(encode).map( |t| (t, etext_no, 3.0) ) );
            postings.extend( text.author.split(' ').map(encode).map( |a| (a, etext_no, 2.0) ) );
            postings.extend( text.subject.split(' ').map(encode).map( |s| (s, etext_no, 1.0) ) );
        }
        // Sort postings by keyword, then by etext_no.
        postings.sort_by(compare);
        // Accumulate scores for keyword, etext_no, then insert
        // etext_no and combined score into index under keyword.
        let mut index = HashMap::new();
        for cls in equivalence_classes(&postings, &equal) {
            if cls.len() > 0 {
                let key = cls[0].0.clone();
                let etext_no = cls[0].1;
                let score = cls.iter().fold(0.0 as Score, |a,p| a + p.2);
                index.entry(key).or_insert(Vec::new()).push((etext_no,score));
            }
        }
        // Sort stored postings lists by etext_no.
        for (_,postings) in index.iter_mut() {
            postings.sort_by(|l,r| l.0.cmp(&r.0));
        }
        Index { index: index }
    }

You may have noticed something that I did not mention in both creating the Index and up in getting entries from the index: an encode function. The problem is spelling; if you use the raw words from the title, author, and subject metadata, you will have a hard time finding Shakespeare's plays unless you remember the 'e' on the end of his name. To alleviate the problem, the keywords in the index and the words used to look up the postings lists are encoded using a Soundex algorithm. The algorithm I have found to be most successful is the New York State Information and Intelligence System phonetic code, NYSIIS. Unfortunately, there was no implementation of this algorithm in Rust, so I wrote one. It's horrible. I suggest not looking at it. It does seem to work, however.

My own implementation of NYSIIS makes the comparison between this search code and the Java version more difficult, since I did not look at the NYSIIS implementation in Apache Commons Codec while I was writing it.

Wednesday, July 15, 2015

Reimplementing ashurbanipal.web in Rust, pt. 2

In the last post, I described the background and Rust code to compute text recommendations from the Ashurbanipal data. In this post, I'm going show the web server code that uses those recommendation structures and supplies data to the browser front-end. As a reminder, this is using Rustful, a lightweight web-application framework in Rust, based on the hyper Rust HTTP library. Ashurbanipal is a prototype a text recommendation engine based on the contents of the text, not on reviews, other purchases, or any other external data. The prototype is based on 22,000-ish books on the Project Gutenberg 2010 DVD, at dpg.crsr.net. To use the it, go to the page, find a book that you know you like (using the search field at the upper left), and get style-based, topic-based, and combined recommendations for books which are in some way similar to your choice.

  1. Part 1: Computing recommendations.
  2. Part 2: You are here.
  3. Part 3: My kingdom for metadata.

The main function for the application server is pretty simple; in fact it is based on the Rustful example code almost unchanged.

fn main() {
    let args : Vec<String> = env::args().collect();
    if args.len() < 4 { panic!("Usage: ashurbanipal_web pos-data topic-data metadata"); }

    let router = insert_routes! {
        TreeRouter::new() => {
            "style"            => Get : RecQuery::Style,
            "topic"            => Get : RecQuery::Topic,
            "combination"      => Get : RecQuery::Combination,
            "lookup/:etext_no" => Get : RecQuery::TextLookup,
            "lookup"           => Get : RecQuery::TextSearch
        }
    };

    let rec_state = RecState::new(&args[1], &args[2], &args[3]);

    println!("serving...");

    let server = Server {
        content_type : content_type!(Application / Json; Charset = Utf8),
        global       : (rec_state,).into(),
        handlers     : router,
        host         : FromStr::from_str("127.0.0.1:8080").unwrap(),
        log          : Box::new( rustful::log::StdOut ),
        server       : "ashurbanipal_web(Rust)".to_string(),
        ..Server::default()
    };

    if let Err(e) = server.run() {
        println!("could not start server: {}", e.description());
    }
}

In the last post, I mentioned how the collect method is a polymorphic translator between an iterator and a collection; std::env::args is a Rust function returning an iterator over the command-line arguments to the program. The first block ensures that we have been passed something like the right arguments, which should be paths to the data files.

The second block of code is a Rust macro (identified by the trailing '!'), which simplifies creating the routing tree used by the application. This application server listens for GET requests for /style, /topic, /combination, /lookup/etext-number, and /lookup. Matching each query produces one of the possible values for RecQuery, which will be seen below. One odd thing about Rust macros: there cannot be a comma after the last element of a sequence (as in, following TextSearch there). In other parts of Rust, such as structures, a trailing comma is fine.

Following that, the third block of code creates a RecState, which contains the recommendation structures described last time (as well as metadata storage).

Finally, this code creates a Server and runs it, listens for connections on the localhost interface, port 8080. The contents of the Server structure configure the web application:

  • Since this server produces JSON data for the benefit of the web front end, the default content-type is Application/json; charset=utf8.
  • The data for the server is stored in global storage, accessible from any request handler (see below).
  • Request handling is supplied by the router.
  • The server identifies itself in a header as ashurbanipal_web.

In the router above, the URLs the server is listening for are associated with values of RecQuery:

pub enum RecQuery {
    Style,
    Topic,
    Combination,
    TextLookup,
    TextSearch,
}

impl Handler for RecQuery {
    fn handle_request(&self, context: Context, response: Response) {
        let &RecState(ref style, ref topic, _, _) = context.global.get().unwrap();
        match *self {
            RecQuery::Style       => handle_recommendation_query(style, context, response),
            RecQuery::Topic       => handle_recommendation_query(topic, context, response),
            RecQuery::Combination => handle_recommendation_query(&Combination::new(style, topic), context, response),
            RecQuery::TextLookup  => handle_text_query(context, response),
            RecQuery::TextSearch  => handle_text_search(context, response),
        }
    }
}

The RecQuery type, which contains no special values, implements the Handler trait from Rustful, requiring an implementation of the handle_request method. This method recovers the data from the server's global storage and dispatches to one of three functions to actually produce results, handle_recommendation_query, handle_text_query, and handle_text_search. I will discuss the final two when I get to the metadata used by the project, but for the moment I am focusing on the recommendations themselves.

One odd bit here is the line,

            RecQuery::Combination => handle_recommendation_query(&Combination::new(style, topic), context, response),

In the combination case, I have to create a new Combination structure based on the style and topic recommendation structures. Happily, creating this structure is very lightweight; creating a new one for each request is required because I could not find a way to reference the same style and topic structures in creating a persistent Combination. The requirements of Rustful and the lifetimes of the Style, Topic, and Combination never could come together. (I am aware that I could have used Rust's Box to store the structures on the heap, but I preferred not to do that. What can I say.)

In any case, handle_recommendation query is:

fn handle_recommendation_query(r : &Recommendation, context: Context, mut response: Response) {
    let &RecState(_, _, ref metadata, _) = context.global.get().unwrap();
    let start = optional("start", 0, &context);
    let limit = optional("limit", 20, &context);
    match required("etext_no", &context) {
        Some(etext_no) => {
            match r.sorted_results(etext_no) {
                Some(rows) => {
                    let recommendation = Recommendations {
                        count : rows.len(),
                        rows  : metadata.add_metadata(&rows, start, limit)
                    };
                    response.set_status(StatusCode::Ok);
                    response.into_writer().send( json::encode(&recommendation).unwrap() );
                }
                None => {
                    response.set_status(StatusCode::NotFound);
                    response.into_writer().send("no matching etext");
                }
            }
        }
        None => {
            response.set_status(StatusCode::BadRequest);
            response.into_writer().send("parameter required: etext_no");
        }
    };
}

This function can be broken down into four sections, roughly from outside in:

  1. The first recovers the data structures from global storage and the arguments passed in the HTTP request. The start and limit parameters provide windowed access to the results, rather than transmitting the whole, large, result array; they are optional with defaults of starting at element 0 and returning 20 elements. Required is the etext number to make recommendations from.
  2. If the necessary etext number is missing or otherwise invalid, this code sets a Bad Request status and replies with an error message.
  3. If the supplied etext number produces some recommendations, this response is put into a Recommendations structure along with added metadata (which process also handles the start and limit) and returned as JSON with a status of Ok.
  4. If, however, the etext number is a valid number but cannot be associated with a text, a Not Found status is returned.

The two methods required and optional parse the Strings picked out of the request's query parameters and return them after converting them to a more useful type using the polymorphic function parse. Parse works with any data type implementing the FromString trait, so this code tries very hard to produce whatever value you want.

fn required<T:FromStr>(v : &str, context : &Context) -> Option<T> {
    context.query.get(v).and_then( |s| s.parse::<T>().ok() )
}

fn optional<T:FromStr>(v : &str, default : T, context : &Context) -> T {
    required(v, context).unwrap_or(default)
}

Required produces an Option, to be explicit about the case where the argument is missing with a None. A failure to parse the value produces an Error object, which can be converted to an Option via the ok method. Since both the hashmap method get and parse/ok here produce Option values, they can be chained together with and_then, which will return the first None it sees, or Some containing the final, appropriate value.

Optional uses required to pick out the value, but it also must be passed a default value to be used when the parameter is missing or fails to parse.

A final interesting piece here is the Recommendations structure:

#[derive(RustcEncodable)]
struct Recommendations<'a> {
    count : usize,
    rows  : Vec>,
}

Remember, the handler produces JSON data. To do so, it uses the rustc_serialize library, which is not a part of the Rust distribution. In fact, it is a library for serialization and deserialization that includes a compiler plug-in which handles the RustcEncodable derivation annotation. This plug-in evaluates the annotated structure, Recommendations, to automatically provide an implementation of the Encodable trait from rustc_serialize. Once that is done, a reference to a Recommendations can be passed to the library's json::encode method to convert the value into JSON for the HTTP response.

In this case, the recommendations include the total number of computed recommendations and the data for a selected window of those recommendations. The actual data for a text come from a Text structure (which also derives RustcEncodable), which I will get into next time.