Wednesday, March 19, 2014

Quote o' the day: excessive?

Araq wrote on proggit,

Disclaimer: I'm nimrod's "one guy".

  • Nimrod is being dogfooded too, the compiler and all of its infrastructure are written in Nimrod.
  • Nimrod is not only me, we're 3-5 depending on how you count. It's true that I'm the BDFL and that there is no company behind it and getting paid for it would really help a lot.
  • What does it offer over Rust? Lots of things just like Rust offers quite a lot over Nimrod, but these comparisons will be more meaningful when both reached version 1.0. Short list: exceptions tracking, an effects system, typesafe bitflags, excessive compile time evaluation, a taint mode, and quite a different concurrency model. The latter took a long time to develop and its implementation in its infancy, but I'm mildly optimistic we found a good "worse is better" solution, which is safe enough in practice while still remaining quite simple.

I can't really argue with that.

Saturday, March 15, 2014

Link o' the day: A Note on Distributed Computing

I just had cause to look this up, and I had a devil of a time finding it.

"A Note on Distributed Computing", Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. SMLI TR-94-29, Sun Microsystems Laboratories, Inc., Nov. 1994.

We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure....

We conclude by discussing what is required of both systems-level and application-level programmers and designers if one is to take distribution seriously.

This is one of the most important papers in network protocol history, and one that nobody seems to have read.

Saturday, March 1, 2014

Automatic memory management: hybrid reference counting

When last we left, I had demonstrated that one form of garbage collection, reference counting, had a significant problem: it could not collect cycles of objects. If two objects point at each other, each will always have a reference count greater than zero even if nothing else points to them.

Various smart people created various fixes for the cycle problem. One persistent suggestion is to use weak pointers, pointers which do not increase the destination's reference count but which must be checked before they can be followed because the destination may no longer exist. Weak pointers work well if the structure being built includes designated, special references like back pointers in a doubly-linked list or parent pointers in a tree. On the other hand, in a general graph structure weak pointers seem less useful. One algorithm, mentioned in Garbage Collection by Jones and Lins attempts to automatically break cycles using weak pointers. They also mention that it does not actually work.

A better option is a cycle detector: a separate subsystem that scans objects that appear to be live, checking them for real, external pointers. Typically, cycle detectors work by removing the counts for the internal links in a graph of objects, leaving only valid external links. Python has a fine example of a cycle detector. (Which link I found via a Stack Overflow answer; some old documentation still seems to be relevant.) The Python documentation describes both the Python and lower-level C interfaces of its cycle detector. By the way, if you take a look at any of those links, pay attention to the handling of objects with __del__ methods. But more on that later.

A slightly different cycle collector is described in Garbage Collection, specifically one by Rafael Lins himself.

Lins' cycle collector

Lins' algorithm hooks into the normal reference counting mechanism when an object's reference count is decremented. If the count becomes zero, the object is indeed garbage and can be deleted. If the reference count is non-zero, however, the object is acting sketchy. It is not behaving normally. It is standing out. And you know what happens to nails that stick out.

Anyway, when an object's count is reduced but remains above zero, it is inserted into a control set maintained by the collector. It is also marked with a special color, purple.

Coloring is a frequent technique in collection algorithms. In Garbage Collection, the typical colors used are black, grey, and white, but the colors indicate different things for different algorithms; in the case of Lins' algorithm, black indicates an object is known to be in use, grey indicates an object that may be garbage, and white indicates that an object is known to be garbage. Purple means sketchy.

Cycle detection works in three phases, for each object found in the control set that is still purple (objects can be recolored black by being used by the mutator):

  1. The object is marked grey, and all of its children recursively have their counts decremented and are marked grey. This step subtracts the effects of all of the pointers between objects in the sub-graph consisting of this object and its children---any reference count greater than zero must represent a pointer from outside this group of objects. Note that no deletion happens now. The objects may still be alive and if so, will have their reference counts restored.
  2. Starting from the same element of the control set, the algorithm scans grey objects, testing their current reference count. If the count is greater than zero, the object has an external pointer. The object and its children are recolored black and their reference counts restored. If the count is equal to zero, the object is colored white.
  3. Again starting from the element of the control set, white elements are deleted.

The third step as presented in Garbage Collection has issues. The algorithm there is:

collect_white(S) =
  if colour(S) == white
    for T in Children(S)
      collect_white(*T)
    free(S)

The two problems are, first, if the structure is indeed cyclic then this algorithm will recur infinitely, and second, a deleted object may be visited by a later stage of collect_white. Many memory management systems would not find the latter to be incorrect since freeing an object still leaves its storage under the care of the manager, but it does become a problem when an underlying allocator such as malloc is used. Weirdly, the former does not seem to be present in Lins' original paper.

There are some preliminary complications that need to be handled before we can look at this collector in Rust.

Type management and allocations

Back in the last post, the Rc and Box types were parameterized by the type T, the type of the user data. Doing so has a number of advantages: it is always completely type safe since the type is tracked from the creation of the Boxes and Rc's, through borrowing the object, to the deletion step. Further, deletion is simplified because a Box<T> can be allocated as an owned pointer, converted to an unsafe pointer for most of its life, and returned to an owned pointer to be deleted, correctly handling any deletion process for T.

Unfortunately, the collector I am working on here will be looking at objects of many different types. To see why a collector cannot deal with only a single type, consider what happens when a type T contains a reference to a type U, which in turn contains a reference to T: we are back in the realm of uncollectable garbage. Fortunately, Rust provides some assistance for us. (What the heck, it has to track this stuff.)

The key I am using is the TyDesc pointer, a pointer to a type descriptor provided by the std::unstable::intrinsics module. (Does that make you nervous? It should. I'm using Rust 0.9; this is probably all changed in the head of master. Fortunately, intrinsics seems to have been reasonably stable since 0.7 and the facilities have been present and used by the arena module since 0.5.)

TyDesc (I'm not going to bold every instance of Rust code anymore; when I last fiddled with the Maniagnosis fonts, all the boldy things became annoyingly busy.) is a structure containing a number of useful things including the size and alignment requirements for the type being described and the type's drop_glue, a function to be called to invoke the type's destructor (which may be explicit as in an instance of the Drop trait, or not as in the case of a type containing an owned pointer to a vector or something.

Rust makes extensive use of the Resource Allocation Is Initialization (RAII) technique, particularly for memory management in its standard library. That is the whole "owned pointer" thing. And no, Rust's types are not parameterized with an allocator like C++'s standard templates. (And I hope they never will be; that is insane.) As a result, drop_glue is very important. The alternative for a memory management system like this would likely be to provide separate implementations of a great deal of the standard library. Rust's original @ sigil, the associated "garbage collector," managed strings, and the at_vec module are all an example of the result. Rust isn't Haskell; multiple implementations of the standard prelude in the standard library are not really appreciated. Consistent, if not deterministic, handling of destructors must be a requirement for a memory management system in Rust.

To start, Box will no longer be parameterized (explicitly, anyway) by the user's type T. Instead, it will contain the reference count, a color, and a pointer to T's type descriptor.

#[deriving(Eq,ToStr)]
enum Color { Black, Grey, White, Purple, }

struct Box {
    color  : Color,
    count  : uint,
    tydesc : *TyDesc,
}
drawing of a Box

The Box and its object will be allocated in a single call to malloc, leaving the object immediately following the Box header. Because of the object's alignment requirements, there may need to be some padding between the Box and the object itself; the size of the Box and this padding along with the location of the object can be calculated using the size of the Box and the type descriptor.

fn padded(size : uint, align : uint) -> uint { (size + align - 1) & !(align - 1) }
fn box_offset(t : *TyDesc) -> uint { unsafe { padded( intrinsics::size_of::<Box>(), (*t).align ) } }

Allocating and initializing the Box and object is...interesting. Welcome to systems programming!

impl Box {
    unsafe fn unsafe_new(t : T) -> *mut Box {
        let tydesc = intrinsics::get_tydesc::<T>();
        let offset = box_offset(tydesc);
        let block  = libc::malloc( (offset + (*tydesc).size) as u64 ) as *u8;
        intrinsics::move_val_init( cast::transmute(block),
                                   Box { color  : Black,
                                         count  : 1,
                                         tydesc : tydesc, } );
        intrinsics::move_val_init( cast::transmute( ptr::offset(block, offset as int) ),
                                   t );
        block as *mut Box
    }
...
}

move_val_init takes a pointer to an uninitialized block of memory plus a Rust value, and moves the value into the block without deinitializing the block or doing any other magic. cast::transmute is used to tell the type system that I really, really know what I'm doing. (It's a lie.) That is, to mean in the first use that block is a pointer to a Box-sized area, and in the second that the offset calculation is a pointer to a object-sized area. ptr::offset is used to compute the location of the object in a manner similar to C's array indexing; block plus offset times the size of *block. In this case, block is a pointer to bytes, which matches the offset calculation.

Deallocation is similarly interesting.

    unsafe fn unsafe_free(ptr : *mut Box) {
        let tydesc_t = (*ptr).tydesc;
        ((*tydesc_t).drop_glue)( ptr::offset(ptr as *u8, box_offset(tydesc_t) as int) as *i8 );
        libc::free( ptr as *libc::c_void );
    }

unsafe_free invokes the object type's drop_glue with a pointer to the object, as a pointer to (inexplicably) a signed i8. Then, it frees the entire Box and object block. This step is a pretty hairy issue.

If you look at Tim Peter's comment on python-dev, or a fair number of comments on python-dev in March, 2000, you will find a great deal of discussion about Python's cycle detector and its relationship to Python objects' finalizers.

If you would prefer to avoid reading Tim Peter's weird typographic tics in ancient Python mail, here's the takeaway: garbage collection and user-defined destructors are not a good combination in roughly the same way that licking a server rack is not a good idea. In either case, it doesn't taste very good and you end up with difficult-to-explain cuts. The bottom line is that Python does not automatically free objects that are participating in cycles and that have user-defined finalizers.

The problem is two-fold: on the one hand, finalizers can (sanely!) be used to clean up other resources such as file descriptors, but the nondeterministic nature of collection makes that much less attractive. (You will likely run out of file descriptors before you run out of memory.) On the other, finalizers can also be used to do weird things like resurrecting objects just before they would be deleted. These things may not be valid if some other, pointed-to, child, object has already been deleted.

Further, not calling the object's drop glue is not an option, since allowing the object to have owned pointers to other non-collected objects is important to avoiding the at_vec issue, and calling drop_glue seems to be the only way to free such owned pointers. I think it might be reasonable to prevent allocated objects from having user-defined finalizers, but I do not know how to do that at the moment. As a result, I am invoking the drop_glue before unconditionally releasing the object. (No resurrections here!)

Fortunately, Rust already offers plenty of opitions for non-collected objects with appropriate RAII semantics, which ameliorate the first problem and I am perfectly willing to put the second under the heading of "Don't do stupid things", with the corollary that one ought not to mix destructors and garbage collection. As a result, I don't think the Python answer to the question is particularly applicable to Rust.

I will cover the remaining reference counting operations later, but to illustrate one, the method used to get a reference to the object makes significant use of the type descriptor.

    unsafe fn unsafe_get<T>(&self) -> *T {
        // TODO: Ensure T matches tydesc?
        let ptr : *u8 = cast::transmute(self);
        ptr::offset( ptr, box_offset(self.tydesc) as int ) as *T
    }

This method takes a type T and returns a pointer to what is hoped to indeed be a T. Currently, that is not verified at runtime. However the smart-pointer type is parameterized by T and is the only thing that can get to this method, so any invocation of this method has to be called with the right T. That is my story and I'm sticking to it.

Lins' operations

initial state of Lins' algorithm
Initial state when collecting: objects on the right are live; objects on the left are garbage.

The basic operations of Lins' algorithm are mark_grey, scan, scan_black, and collect_white. Currently, these are all implemented by the Box struct.

    unsafe fn mark_grey(&mut self) {
        if self.color != Grey {
            self.color = Grey;
            let children = self.children();
            for &ch in children.iter() {
                (**ch).count -= 1;
                (**ch).mark_grey();
            }
        }
    }

mark_grey acts to remove the effect of pointers from this object to other objects from the other object's reference counts. As a result, when it is complete any non-zero counts represent pointers from outside the (possibly cyclic) graph that this object participates in.

state after mark_grey
After mark_grey: all objects have been marked grey.

I will get into the children method later; it is a complex topic. For the moment, understand that it returns a collection of pointers to references to Boxes that represents the set of children of the object associated with this Box. Alternatively, it returns a collection of uninterpreted values, each representing an object directly referenced by this object, but that is saying the same thing. (See tracing below. Do you like systems programming yet?)

mark_grey is almost a direct translation of the algorithm, as is scan.

    unsafe fn scan(&mut self) {
        if self.color == Grey {
            if self.count > 0 {
                self.scan_black();
            } else {
                self.color = White;
                let children = self.children();
                for &ch in children.iter() {
                    (**ch).scan();
                }
            }
        }
    }

scan_black restores the reference counts for objects that are still alive. Following the scan/scan_black step, the objects with non-zero external counts and their descendants account for both pointers from outside this graph and inside while objects without external references are colored white.

    unsafe fn scan_black(&mut self) {
        if self.color != Black {
            self.color = Black;
            let children = self.children();
            for &ch in children.iter() {
                (**ch).count += 1;
                (**ch).scan_black();
            }
        }
    }
state after scan
After scanning: live objects are black, garbage is white.

Note that all of the sketchy objects are scanned by the collector. It is not readily apparent that a run of Lins' stages on any specific object will result in a collection of the cycle, since there may be multiple internal references that would keep it alive. However, I believe that in any cyclic directed graph, there will be at least one node representing a key object that, under this algorithm, would result in the collection of the cycle. I should probably check Lins' paper.

This, specifically, is the difference between this algorithm and the improvement published in the The Garbage Collection Handbook: this algorithm runs each stage in order on each element of the control set while the improved version runs on the entire control set at each stage. This algorithm may repeatedly mark and unmark an object before it is conclusively determined to be garbage.

collect_white is written thusly:

    unsafe fn collect_white(&mut self, collected : &mut HashSet<*mut Box>) {
        if self.color == White {
            collected.insert(self as *mut Box);
            self.color = Purple;
            let children = self.children();
            for &ch in children.iter() { (**ch).collect_white(collected); }
            self.color = White;
            task_collector().remove(self);
        }
    }

This method uses a temporary purple color change to prevent the infinite recursion issue, and collects the white objects onto another structure for deletion to avoid the reference-after-free issue. The final line, task_collector().remove(self), removes this box from the control set maintained by the collector. (It also foreshadows more Rust follies, too.)

All of these operations are described and written recursively. mark_grey, scan, and scan_black can be trivially rewritten iteratively, using an auxiliary stack. I believe collect_white can be as well, with more complexity.

The collector


pub struct HybridRcCollector {
    control_set : HashSet<*mut Box>,
    trace_fns   : HashMap<*TyDesc,fn (*u8) -> ~[*mut Box]>,
}

impl HybridRcCollector {
    fn new() -> HybridRcCollector {
        HybridRcCollector { control_set : HashSet::new(),
                            trace_fns   : HashMap::new(), }
    }
...

For the moment, the important element of the HybridRcCollector structure is the control_set, the collection of sketchy objects. I will get to trace_fns in a bit.

The two primary operations on the former are insert and remove.

    fn insert(&mut self, ptr : *mut Box) { self.control_set.insert(ptr); }
    fn remove(&mut self, ptr : *mut Box) { self.control_set.remove(&ptr); }

In addition to the call to remove in collect_white, there are calls in unsafe_delete, the Box method used to handle decrementing an object's reference count.

    // Part of impl Box
    unsafe fn unsafe_delete(ptr : *mut Box) {
        if ptr.is_not_null() {
            (*ptr).count -= 1;
            if (*ptr).count == 0 {
                task_collector().remove(ptr);
                Box::unsafe_free(ptr);
            } else {
                (*ptr).color = Purple;
                task_collector().insert(ptr);
            }
        }
    }

This is the method that actually decrements a reference count under normal circumstances, deletes the object if the count is zero, and inserts the object into the collector's control_set if its count has been decremented but is still greater than zero. The test for a null pointer is important later; let me present the GC method itself before I get into it.

    unsafe fn unsafe_gc(&mut self) {
        let mut deleting : HashSet<*mut Box> = HashSet::new();
        loop {
            // there has to be a better way to get elements out of a set
            match self.control_set.iter().nth(0) {
                Some(&ptr) => {
                    self.control_set.remove(&ptr);
                    if (*ptr).color == Purple {
                        (*ptr).mark_grey();
                        (*ptr).scan();
                        (*ptr).collect_white( &mut deleting );
                    }
                }
                None => { break; }
            }
        }
        for &ptr in deleting.iter() {
            let children = (*ptr).children();
            for &ch in children.iter() {
                // convert ch to a mutable pointer
                let ch = ch as *mut *mut Box;
                *ch = ptr::mut_null();
            }
            Box::unsafe_free(ptr);
        }
    }

unsafe_gc is an implementation of Garbage Collection's gc_control_set algorithm. It loops over the elements in the control_set, removing those elements, and then calling the three major operations on each. The end result is that the live object graph is in the same state is was before, while garbage objects from cycles have been collected in deleting. At that point, all of the contents of deleting are deleted.

Note before actually calling Box::unsafe_free, the object's child references are set to null.

Remember discussing finalizers? The problem here is when an object A and an object B point to each other. Deleting A without taking special measures will drop the reference count on B, calling its destructor. Then, deleting B results in a second call to the destructor and seven years' bad luck all around, even if you avoid the infinite recursion. To break that cycle while still ensuring that every object's destructor is called, I null the object's reference-counted-child pointers.

The test in unsafe_delete prevents that method from doing anything when it is called from the drop method of a reference-counted pointer after that point; this breaks the recursion. When A is freed, its destructor is called, but B is not deleted. Instead, B is assumed to also be in the deleting collection; it will be cleanly removed when its turn arrives.

Be aware that if the object has a user-defined destructor that attempts to dereference reference-counted pointers in the object, bad things will happen. The rest of this code ensures safety, that doesn't.

Smart pointers: Hrc

Given what has gone before, the type-safe smart pointer type is pretty trivial. It has the same borrow methods as the Rc type from the last post and roughly the same Drop and Clone implementations.

#[no_send]
pub struct Hrc<T> { ptr : *mut Box, }

impl <T> Hrc<T> {
    #[inline]
    pub fn borrow<'l>(&'l self) -> &'l T {
        unsafe { cast::transmute( (*self.ptr).unsafe_get::<T>() ) }
    }

    // Bad programmer! Bad! No biscuit!
    #[inline]
    pub fn borrow_mut<'l>(&'l mut self) -> &'l mut T {
        unsafe { cast::transmute( (*self.ptr).unsafe_get::<T>() ) }
    }
}

#[unsafe_destructor]
impl<T> Drop for Hrc<T> {
    fn drop(&mut self) { unsafe { Box::unsafe_delete(self.ptr as *mut Box); } }
}

impl <T> Clone for Hrc<T> {
    #[inline]
    fn clone(&self) -> Hrc<T> {
        unsafe { (*(self.ptr)).unsafe_inc(); }
        Hrc { ptr : self.ptr }
    }
}

Now, about that children() thing.

Tracing

Going back to the taxonomy from the last post, this hybrid reference counter and cycle detector is actually a stationary tracing collector. The next hard part to deal with is tracing.

In a traditional collector, tracing starts from roots consisting of the stack, static variables, and registers. In this algorithm, tracing starts from the collector's control set (hybrid reference counting makes a slightly different use of tracing than mark-sweep or copying collectors). Tracing then follows pointers embedded in objects. If you thought that last post had all the vocabulary you might want, you were wrong. There are roughly two ways of identifying roots and pointers in objects: conservatively or precisely.

  • Conservative collection involves looking through memory accepting anything that looks like a pointer as one. Typically, that means an n-bit sequence of memory with appropriate alignment and a value in the heap range.
  • Precise collection uses clearly identified pointers, specifically those that are pointing to heap storage. This can best be done with compiler support, since the compiler knows pointer locations in objects and in function stack frames as well as the types associated with pointers.

One sort of middle ground can be found by using tags associated with pointers. Without compiler support, however, some variant of conservatism is almost the only option. However, pure conservatism would be problematic in Rust. Take for example the Cells used in the mutator simulator. Conservatively tracing a cell would mean following a pointer to the location of the owned vector (outside of any memory managed by a collector), then scanning it for heap pointers.

There are another couple of possibilities. One that I haven't explored is to use Rust's really unstable-looking reflection support, also in std::unstable::intrinsics, which might make it possible to find pointers in a data structure by type-directed scanning. That approach feels like it would be slow, however.

For now, I am forcing the user to supply a tracing function for types stored in reference-counted boxes by requiring the trait Trace.

An implementation of Trace should provide a static method that takes a reference to the same type and returns a vector of TracePtr's.

pub struct TracePtr { priv ptr : **mut Box, }

trait Trace {
    fn trace(ptr : &Self) -> ~[TracePtr];
}

There really has to be a better way to do that than return a vector.

TracePtr values are provided by the trace method in Hrc.

    // method in Hrc
    pub fn trace(&self) -> TracePtr {
        TracePtr{ ptr : &self.ptr }
    }

Some implementations of Trace are trivial.

impl Trace for int {
    fn trace(_ : &int) -> ~[TracePtr] { ~[] }
}

While others are less so. This is an implementation for objects consisting of a vector of smart pointers:

    impl Trace for ~[Cell] {
        fn trace(this : &~[Cell]) -> ~[TracePtr] {
            this.iter().map( |&C(ref e)| e.trace() ).collect()
        }
    }

The Trace trait is a requirement for an object being allocated in a smart pointer.

pub fn allocate<T:Trace>(t:T) -> Hrc<T> {
    task_collector().register::<T>();
    Hrc { ptr : unsafe { Box::unsafe_new(t) } }
}

Yes, I just smuggled the actual allocator interface, allocate, through in the middle of a completely unrelated section. Sue me.

The fancy trick here is the line, task_collector().register::<T>(). register is a method in HybridRcCollector that is only interested in the type parameter, T. It looks up the type descriptor for the type T, gets a pointer to the trace function implemented by T, and uses the former to store the latter in the trace_fns map.

    // method in HybridRcCollector
    unsafe fn register<T:Trace>(&mut self) {
        let tydesc = intrinsics::get_tydesc::<T>();
        let trace_fn : fn (&T) -> ~[TracePtr] = Trace::trace;
        self.trace_fns.insert(tydesc, cast::transmute(trace_fn));
    }

Here's the neat thing: because of the way I've defined Trace, TracePtr, and Hrc::trace, the functions stored in trace_fns are in fact of the type fn (*u8) -> ~[**mut Box], as long as the type used to get the key to trace_fns is the same as the type of the object pointed to by the argument of the function. (Did I mention that this was systems programming? There are a few other assumptions, such as that the ptr in TracePtr will be the only bytes of the object. These sorts of things are common in C programming and are the cause of some of the weird requirements of the C standards. I hope that the Rust standard follows along.)

Once the trace_fns map is filled correctly, the Box's children method can built in stages, first by recovering the trace function.

    // in HybridRcCollector
    fn get_trace_fn<'l>(&'l self, tydesc : *TyDesc) -> &'l (fn (*u8) -> ~[**mut Box]) {
        self.trace_fns.get(&tydesc)
    }

And then calling it with the object attached to the Box, to produce a vector of pointers to child Boxes.

    // in Box
    unsafe fn children(&self) -> ~[**mut Box] {
        let collector = task_collector();   // "does not live long enough" if used in the expression
        let trace = collector.get_trace_fn(self.tydesc);
        let t = ptr::offset( cast::transmute(self), box_offset(self.tydesc) as int );
        (*trace)(t)
    }

task_collector?

There is one remaining complication, and fantastic Rust-ism, necessary for this memory manager. Many of the Box methods, as well as user code, need to access the HybridRcCollector structure. Fortunately, this code is serial, meaning that it runs in a single task, and Rust provides a hook to hang task-local data on: the local_data module.

(I stole most of this from the rand module, which has the TaskRng task-local random number generator.)

local_data_key!(HYBRID_RC_COLLECTOR_KEY : ~HybridRcCollector)

#[no_send]
pub struct TaskCollector {
    priv collector: *mut HybridRcCollector,
}

pub fn task_collector() -> TaskCollector {
    local_data::get_mut(HYBRID_RC_COLLECTOR_KEY, |alloc| match alloc {
            None => {
                let mut hrc = ~HybridRcCollector::new();
                let ptr = &mut *hrc as *mut HybridRcCollector;
                local_data::set(HYBRID_RC_COLLECTOR_KEY, hrc);
                TaskCollector { collector : ptr }
            }
            Some(hrc) => {
                TaskCollector { collector : &mut **hrc }
            }
        })
}

A TaskCollector is a pointer to the task's collector, lazily initialized when task_collector is first called. The TaskCollector itself is mostly a facade calling the unsafe HybridRcCollector's methods.

impl TaskCollector {
    pub fn insert(&self, ptr : *mut Box) { unsafe { (*self.collector).insert(ptr); } }
    pub fn remove(&self, ptr : *mut Box) { unsafe { (*self.collector).remove(ptr); } }

    pub fn gc(&self) { unsafe { (*self.collector).unsafe_gc() }; }

    pub fn collecting(&self) -> bool { unsafe { (*self.collector).collecting } }

    pub fn register<T:Trace>(&self) { unsafe { (*self.collector).register::<T>() } }
    pub fn get_trace_fn<'l>(&'l self, tydesc : *TyDesc) -> &'l (fn (*u8) -> ~[*mut Box]) {
        unsafe { (*self.collector).get_trace_fn(tydesc) } }
}

Currently, a Rust program blows up amusingly if the HybridRcCollector has a Drop destructor. However, if it were to work, that would be an excellent place to put a final call to the garbage collector, to clean up the allocated memory as the task is exiting.

Result

How does it work? Pretty well.

$ valgrind ./hybrid-reference-count-trace-test 
...
[ creates: 399553, deletes: 300018, links: 200216, unlinks: 31167, ops: 1000000 ]
361
==20660== 
==20660== HEAP SUMMARY:
==20660==     in use at exit: 40 bytes in 1 blocks
==20660==   total heap usage: 805,675 allocs, 805,674 frees, 58,457,125 bytes allocated
==20660== 
==20660== LEAK SUMMARY:
==20660==    definitely lost: 0 bytes in 0 blocks
==20660==    indirectly lost: 0 bytes in 0 blocks
==20660==      possibly lost: 0 bytes in 0 blocks
==20660==    still reachable: 40 bytes in 1 blocks
==20660==         suppressed: 0 bytes in 0 blocks

361 is the length of the control set just before the collector is run as the task is finishing.

Future work

I have a few things to do before I can slap some lipstick on this pig and release it into the wild.

  • My copy of the Garbage Collection Handbook showed up, and although I have not read it yet, I did note that it describes a different algorithm for cyclic reference counting, with better asymptotic behavior than Lins'. I need to take a look at that algorithm.
  • I haven't spent a lot of time optimizing this. The recursive methods should probably be converted to iterative behavior, and there are a couple of other things that can probably be improved, both in terms of execution behavior and space overhead. One example is recursive implementations of Lins' algorithms. Another is the use of temporary vectors by the Trace trait.

On the other hand, I think I can get something like a mark-sweep collector out of the techniques I have here, which would possibly have better, the same, or worse overhead along with better or worse collection delays.

Wednesday, February 19, 2014

Quote o' the day: This is a great error message

When you really hork something in Rust, you get the most wonderful error messages:

task '<main>' failed at 'No entry found for key: &(0x609830 as *())', /home/mcguire/soft/rust/src/rust-0.9/src/libstd/hashmap.rs:497


The ocean ate the last of the land and poured into the smoking gulf, thereby
giving up all it had ever conquered. From the new-flooded lands it flowed
again, uncovering death and decay; and from its ancient and immemorial bed it
trickled loathsomely, uncovering nighted secrets of the years when Time was
young and the gods unborn. Above the waves rose weedy remembered spires. The
moon laid pale lilies of light on dead London, and Paris stood up from its damp
grave to be sanctified with star-dust. Then rose spires and monoliths that were
weedy but not remembered; terrible spires and monoliths of lands that men never
knew were lands...

fatal runtime error: thread-local pointer is null. bogus!

Quote from "The Crawling Chaos" by Elizabeth Berkeley and Lewis Theobald, Jun. (a.k.a. Winifred Virginia Jackson and H. P. Lovecraft).

Wednesday, February 12, 2014

Automatic memory management: reference counting

Last December, when I wrote about Flex- and BisonModule, I mentioned that it was one of my favorite hacks. But I did not describe what the hack was.

Python uses reference-counted memory management. When I wrote BisonModule, that became the chief obstacle: while flex behaves very simply, bison takes control of the program and calls back into Python code to read tokens and to build the parse tree. As a result, managing the lifetimes of symbol objects is a bit tricky. My first attempt dug rather deep into bison's runtime and used the fact that you can poke around with negative sub-expression references ("$1" in the right-hand part of a rule, for example) to access previous symbols. The result was ugly and stood a very good chance of leaking memory like a sieve.

So I scrubbed that and started over; I created an array of Python objects for tokens and symbols and arranged for this array to "own" the objects. Every new symbol or token object was added to the array, with a reference count of 1, for the array ownership. Adding an object to the parse tree incremented the reference count for that object and setting the tree root provided an external pointer and reference count bump to keep the tree from being deleted. Getting rid of the extraneous objects was as simple as walking the array when the parse completed and decrementing each object's reference count. Objects that then had zero counts were garbage and immediately deleted; objects that were part of the tree had non-zero counts and were returned to Python. The algorithm is simple and does not require intimate knowledge of bison's internals, but it does have a weaknesses---all of the tokens and symbols produced in intermediate steps in the parse are kept around until the parse is completed.

It also does not leak memory, even when using Python's reference counting.

Automatic memory management

(There was a beautiful segue here, somewhere.)

There are three fundamentally different styles of managing dynamically-allocated memory in a computer program:

  • Manually. Manual memory management involves calling new, malloc, or something similar, to allocate a new block of memory, and free, delete, or something like that to release it. Manual memory management is not difficult, really, but it is a horrendous pain in the rump. It is one of the primary factors that makes systems programming mostly bookkeeping. I don't intend to write much about manual memory management here.

  • Semi-manually. Or, alternatively, semi-automatic memory management. Oddly enough, this is one of the best approaches, if the problem allows it. Of course, the only technique that I know about in this category is the arena, where an allocator maintaining a large block of memory hands out that memory in small pieces for use by the program; rather than deleting the individual pieces, the allocator frees the entire block in one operation. With this scheme, allocations within the region are cheap: bump up a high-water-mark in the region and return a pointer to the space between the old and new values of the mark. Deallocation is also cheap, since nothing is done with the small pieces until the entire region is discarded. For some investigation of their performance, see "Reconsidering Custom Memory Allocation", by Emery Berger, Benjamin Zorn, and Kathryn McKinley.

    Unfortunately, arena-based management does not always apply. It works best when the program has (possibly nested) phases: at the beginning of a phase, the program can create a region, use it during the phase, and discard it at the end of the phase. It works less well when allocations of different lifespans are intermingled or when the program does not have clearly identifiable and separate phases. Two successful examples are Apache httpd, which supplies an arena whose lifespan matches the processing of a single request to the code handling the request, and the passes of a multi-pass compiler.

    Rust currently includes an arena memory manager module, although I haven't explored it closely. At the moment, I do not intend to look at semi-manual management further.

  • Automatically. Automated memory management puts the work of dynamic memory management, specifically, the work of deallocating memory, off onto a software system, saving the programmer effort at a greater or lesser cost of efficiency and convenience. This topic is what I want to look at today.

Before I go further, I need some specialized vocabulary. If you are peering through automatic-memory-manager-colored glasses, the world is a strange place.

The program that you or I would think of as doing all the work is just a mutator. A mutator does one of four things: it allocates an object; deallocates an object; links two objects, adding a pointer from one object to another; and unlinks two objects, removing a link from the first to the second. (Strictly speaking, the last may not be necessary; I am including it for extra verisimilitude.) And an object is a block of memory allocated and used by the mutator. These objects are allocated from a heap, the chaotic source of all memory.

Continuing with my list above, there are several different kinds of automated memory managers:

  • Reference counting. Yes, reference counting is an automatic memory management scheme, although it requires a considerable amount of help from the mutator. Reference counting works by associating a counter with every allocated object; this counter contains the number of pointers that refer to the allocated object. As long as the reference count is greater than or equal to one, the object may be reachable and potentially in use. But as soon as the reference count goes to zero, the object is garbage and may be deleted. Copying a pointer requires incrementing the reference count and removing a pointer requires decrementing it; those actions need to be taken by the mutator.

  • Tracing garbage collectors. In any memory management scheme, there will be roots, pointers outside the managed heap that point at objects inside the heap. Further, objects inside the heap may contain pointers to other objects inside the heap. A path of pointers from a root is required for any object to be considered live, to be available for use by the mutator. Tracing garbage collectors periodically follow, or trace, the pointers from the roots to identify all of the live objects. Or, really, to identify the objects which are not live, the garbage. These garbage objects can then be freed or reused.

    There are two broad categories of tracing garbage collectors:

    • Stationary garbage collectors, in which an object, once allocated, does not change its location in memory. A mark-sweep collector is a typical example; these work by walking through the live objects from the roots and marking them, then freeing the remaining, unmarked objects in a sweeping phase.

    • Moving garbage collectors, which may move an object once it has been allocated. A two-space copying collector is a typical example of a moving collector, where objects are initially allocated in one half of the heap. Then, when that half, or space, has filled, objects are traced from the roots and copied into the second space. Once all of the live objects have been moved to the second space, the original space contains only garbage, which is freed.

    Ignoring some very significant second-order effects, the primary distinction between stationary and moving collectors is that the former only need to identify one pointer to an object, to determine that it is live. The latter must identify all pointers to an object, in order to update them with its new location.

    My terms are a bit not-standard. Stationary tracing collectors are frequently referred to as mark-sweep collectors and moving collectors as copying collectors, after their prototypical algorithms, mark-sweep and two-space copying.

There are hordes of variations on each of these categories; the originals of most date back to Lisp in 1960, and many people have done a great deal of exploration; some of the most recent changes in the Java Virtual machine have been new garbage collectors. The reference I am currently using is Garbage Collection: Algorithms for Automatic Dynamic Memory Management, by Richard Jones and Rafael D. Lins, a book that I've had for years. It describes a great many variations on the basic themes. My copy of The Garbage Collection Handbook: The Art of Automatic Memory Management, by Jones, Antony Hosking, and Eliot Moss, is on its way from Amazon as I type; I hope it is as good and more up-to-date.

In the mean time, let's talk reference counting. Here is a simple reference counter in Rust:

Reference counting in Rust

#[no_send]
pub struct Rc<T> { priv ptr : *mut Box<T>, }

A value of type Rc<T> represents a pointer to a box containing a value of type T. The type is marked no_send to prevent it from being shared with another thread; there is no locking here. Also, an unsafe '*' pointer is used to prevent the behavior of Rust's other pointer types from getting in the way.

A Box is:

struct Box<T> {
    count : uint,
    value : T,
}

The count is the number of references from Rc pointers while value is the value. (Note that the Rust standard library's own Rc type goes through an extra level of indirection for mutable values, in order to support run-time borrow checking and to match the language's other safety guarantees.)

The basic operations

The basic operations are creating a new Rc pointer and Box, and borrowing the value in the Box.

// unsafe operations
impl<T> Rc<T> {
    pub unsafe fn unsafe_new(t : T) -> Rc<T> {
        Rc { ptr : cast::transmute( ~Box{ count : 1, value : t } ) }
    }
}

// safe operations
impl <T> Rc<T> {
    pub fn new(t:T) -> Rc<T> { unsafe { Rc::unsafe_new(t) } }

    pub fn borrow<'l>(&'l self) -> &'l T { unsafe { &(*self.ptr).value } }

    // DANGER! DANGER!
    pub fn borrow_mut<'l>(&'l mut self) -> &'l mut T {
        unsafe { &mut (*self.ptr).value }
    }
}

unsafe_new creates a new box with a given value and a count of one, and returns a pointer to it wrapped in an Rc type. cast::transmute converts the ~, owning pointer into a *, unsafe pointer; using an owned pointer makes it easy to free the Box when needed, but the Rc value does not own the box and the raw owned-pointer behavior would get in the way.

The new function provides a safe, simple wrapper for creating boxes and Rc values, and the borrow method allows the mutating code access to the value in the box. The lifetime on borrow indicates that the borrowed pointer must not outlive the Rc value; by construction, the Rc value will not outlive the Box containing the value.

borrow_mut is a bad idea. It allows a mutator to change the value in the box. Rust's standard library Rc type goes to considerable lengths to ensure that it cannot be used to violate Rust's safety. This thing doesn't. Don't do this. But for my simple example here, it's fine.

The magic

The magic happens on dropping and cloning Rc values.

#[unsafe_destructor]
impl<T> Drop for Rc<T> {
    fn drop(&mut self) {
        if self.ptr.is_not_null() {
            unsafe {
                (*self.ptr).count -= 1;
                if (*self.ptr).count == 0 {
                    let _ : ~Box<T> = cast::transmute(self.ptr);
                    self.ptr = ptr::mut_null();
                }
            }
        }
    }
}

impl <T> Clone for Rc<T> {
    fn clone(&self) -> Rc<T> {
        unsafe { (*self.ptr).count += 1; }
        Rc { ptr : self.ptr }
    }
}

The Drop trait specifies a destructor or finalizer for a Rust type. In this case, the Rc being removed results in a decrement of the count in the Box. Then, if the count is zero, the box is converted back into an owned value, which is immediately destroyed. Because the type T has been tracked all through the Rc code, this ensures that any finalization of the T value happens.

A "bad interaction" between Drop and @-pointers forces the #[unsafe_destructor] annotation; otherwise Rust rejects a Drop implementation for a parameterized type.

The Clone trait copies the Rc value by creating a new Rc value pointing to the same box, in which the count is incremented.

And that is all she wrote, as far as simple reference counters go.

Rust includes a reference counting memory management module in its standard library. That module is roughly the basis of this one, but mine is considerably simplified.

Mutants! Mutants everywhere!

In order to play with this and other memory managers, I need a mutator. Theoretically, I could write a number of real, semi-useful programs that use the managers and try them in a live situation. But I'm not going to because I'm lazy. Instead, I am going to use a quick and dirty mutator harness to exercise the manager, to see if it fails spectacularly.

The first requirement is for the mutator simulator is an implementation of the four operations.

enum Op { Create, Delete, Link, Unlink, }

pub trait Store {
    fn perform_create(&mut self);            // Create
    fn perform_delete(&mut self);            // Delete
    fn perform_link(&mut self);              // Link
    fn perform_unlink(&mut self) -> bool;    // Unlink
}

An object implementing the trait Store provides methods (using a heap) to create an object, delete an object, and link and unlink objects. Because the object chosen for the operation may not have links, perform_unlink returns a bool indicating whether it did anything. All of the other operations either succeed or cause the mutator simulator to fail.

The mutator

A Mutator itself contains an implementation of Store, a random-number generation framework, and a set of operation counts.

pub struct Mutator<'s,S> {
    store   : &'s mut S,
    range   : Range<f32>,
    rng     : TaskRng,
    c_count : uint,
    d_count : uint,
    l_count : uint,
    u_count : uint,
    o_count : uint,
}

impl<S> ToStr for Mutator<S> {
    fn to_str(&self) -> ~str {
        format!("[ creates: {}, deletes: {}, links: {}, unlinks: {}, ops: {} ]",
                self.c_count, self.d_count, self.l_count, self.u_count, self.o_count)
    }
}

The only output of the test program is the number of operations that have been performed. Any actual value will be provided by valgrind, an instrumentation framework that can identify a plethora of memory management bugs.

Creating a new Mutator is trivial, as expected.

impl<'s, S:Store> Mutator<'s,S> {
    pub fn new(s:&'s mut S) -> Mutator<'s,S> {
        Mutator {
            store   : s,
            range   : Range::new(0f32, 1f32),
            rng     : rand::task_rng(),
            c_count : 0,
            d_count : 0,
            l_count : 0,
            u_count : 0,
            o_count : 0,
        }
    }

Once the mutator has been configured, it can repeatedly choose and execute an operation.

    pub fn perform_operation(&mut self) {
        match self.choose() {
            Create => { self.store.perform_create();     self.c_count += 1; },
            Delete => { self.store.perform_delete();     self.d_count += 1; },
            Link   => { self.store.perform_link();       self.l_count += 1; },
            Unlink => { if self.store.perform_unlink() { self.u_count += 1; } },
        }
        self.o_count += 1;
    }

The mutator chooses the operation to perform randomly; in this code with a 40% of creating a new object, 30% chance of deleting an object, 20% chance of linking two objects, and 10% chance of unlinking them.

    fn choose(&mut self) -> Op {
        let x = self.range.ind_sample( &mut self.rng );
        if      x < 0.4f32 { Create }
        else if x < 0.7f32 { Delete }
        else if x < 0.9f32 { Link }
        else               { Unlink }
    }
}

The store

The final element of the mutator is a simulation of a heap of memory: a collection of cells of type T. For this heap simulation, there are two basic operations, peek, which returns the contents of a cell in the heap, actually removing it from the heap, and poke, which stores an object into a cell in the heap. In order to mimic the memory access behavior of a real program, for loose definitions of "mimic", a random cell is fetched for each call to peek.

trait RandomHeap<T> {
    fn peek(&mut self) -> T;
    fn poke(&mut self, t : T);
}

impl<T> RandomHeap<T> for ~[T] {
    fn peek(&mut self) -> T {
        let i = rand::task_rng().gen_range(0, self.len());
        return self.swap_remove(i)
    }
    fn poke(&mut self, t : T) { self.push(t); }
}

(In peek, rand::task_rng() returns a task-local random number generator.)

The test program

The representation of a memory cell capable of being linked and unlinked, using my simple Rc, is as Rc<~[...]>, a box containing an array of something. The something is ultimately going to be more references, so to prevent an infinite type-level recursion, I introduce a Cell type:

enum Cell { C( Rc<~[Cell]> ), }

Cell and its constructor, C, do not do anything other than breaking up Rc<~[Rc<...>]>.

As a next step, I implemented Store for a vector of Rc cells, making use of the random heap operations.

impl Store for ~[Rc<~[Cell]>] {
    fn perform_create(&mut self) { self.poke( Rc::new( ~[] ) ); }

    fn perform_delete(&mut self) { self.peek(); }

    fn perform_link(&mut self) {
        let mut l : Rc<~[Cell]> = self.peek();
        let r     : Rc<~[Cell]> = self.peek();
        l.borrow_mut().push( C(r.clone()) );
        self.poke( l );
        self.poke( r );
    }

    fn perform_unlink(&mut self) -> bool {
        let mut res = false;
        let mut l = self.peek();
        {
            let contents = l.borrow_mut();
            if contents.len() > 0 {
                match contents.pop() {
                    C(r) => { self.poke(r); res = true; },
                }
            }
        }
        self.poke(l);
        res
    }
}

And finally, I create the Store, the Mutator, introduce an initial population of 100 cells, and set the mutator off to work.

fn main() {
    let mut s : ~[Rc<~[Cell]>] = ~[];
    let mut m = Mutator::new(&mut s);
    for _ in range(0, 100) { m.store.perform_create(); }
    for _ in range(0, 1000000) { m.perform_operation(); }
    println(m.to_str());
}

Results

If the mutator simulator is run without the ability to link objects together, it allocates and frees a fair amount of memory without losing track of any.

[ creates: 398879, deletes: 301084, links: 200009, unlinks: 0, ops: 1000000 ]
==20013== 
==20013== HEAP SUMMARY:
==20013==     in use at exit: 40 bytes in 1 blocks
==20013==   total heap usage: 800,686 allocs, 800,685 frees, 45,327,266 bytes allocated
==20013== 
==20013== LEAK SUMMARY:
==20013==    definitely lost: 0 bytes in 0 blocks
==20013==    indirectly lost: 0 bytes in 0 blocks
==20013==      possibly lost: 0 bytes in 0 blocks
==20013==    still reachable: 40 bytes in 1 blocks
==20013==         suppressed: 0 bytes in 0 blocks

The mutator made approximately 40000 create operations, approximately 30000 deletes, and 20000 attempts to link objects (I emasculated the operation itself, not its invocation by the simulator). The remaining ten thousand operations were attempts to unlink objects, which failed.

The remainder of the output is courtesy of valgrind, and indicates that only one of the 800,686 objects allocated was not freed before the program ended. (That remaining 40 byte object is in the Rust runtime somewhere).

If, however, the simulator has the ability to link objects, valgrind tells a different story.

[ creates: 399155, deletes: 300946, links: 199940, unlinks: 31003, ops: 1000000 ]
==19863== 
==19863== HEAP SUMMARY:
==19863==     in use at exit: 124,616 bytes in 2,547 blocks
==19863==   total heap usage: 803,544 allocs, 800,997 frees, 45,679,158 bytes allocated
==19863== 
==19863== LEAK SUMMARY:
==19863==    definitely lost: 240 bytes in 3 blocks
==19863==    indirectly lost: 124,336 bytes in 2,543 blocks
==19863==      possibly lost: 0 bytes in 0 blocks
==19863==    still reachable: 40 bytes in 1 blocks
==19863==         suppressed: 0 bytes in 0 blocks

Here, the program has lost 2,546 objects, amounting to about 124kb of memory; not a lot on this run, but enough.