"(How to write a (Lisp) interpreter (in C++) (or make Peter Norvig cry, whichever comes first))"

Posted on October 11, 2010 by Tommy McGuire

Labels: c++, toy problems, lisp, programming language

[11/30/2010] Anthony C. Hay's Lisp interpreter in [more than] 90 lines of C++ is very sweet, and much more functional than mine. Check it out!

Last week, I was pointed (by reddit) to Peter Norvig's (How to Write a (Lisp) Interpreter (in Python)) and the follow-on, (An ((Even Better) Lisp) Interpreter (in Python)), and for no readily discernible reason was compelled to attempt a translation. Naturally, I chose C++, a language I have less than a year's experience with, which period was more than a couple of years ago, and which I do not especially like. So, in a spirit similar to that needed to properly enjoy tucking a live weasel into my trousers, I set to work.

What follows is the important part of the result, the eval function. Value::p and Environment::p are both pointers, to a Value and an Environment respectively. (The necessity of using the weird ::p idiom will be explained shortly, although you might consider the random dynamic_pointer_cast's to be a clue.)

Value::p eval(Value::p x, Environment::p env) {

The first branch handles symbols, the value of which is found in the environment.

if (x->type() == Value::SYMBOL) {                            // Look up the value
      Symbol::p x1 = dynamic_pointer_cast(x);
      return env->find(x1)[x1];

The second branch handles the other scalar values, which in this interpreter are integers and strings. These values are self-evaluating.

} else if (x->type() != Value::LIST) {                       // Everything else scalar is self-evaluating
      return x;

The remaining values are all lists of various flavors. To make it easier to manipulate the values, I broke out repr, which is the STL list representing the list, symbol, which is the symbol at the head of the list, it, which is a STL iterator pointing to the second element of the list, and end, which is a STL iterator pointing to the end of the list.

} else {
      List::p l = dynamic_pointer_cast<List>(x);                 // This is where it gets exciting
      list<Value::p> repr = l->to_list();
      if (repr.size() > 0 && repr.front()->type() == Value::SYMBOL) {
        Symbol::p symbol = dynamic_pointer_cast<Symbol>(repr.front());
        list<Value::p>::iterator it = ++(repr.begin());          // The second element, needed by most branches
        list<Value::p>::iterator end = repr.end();               // The end of the list

The first of the special forms is quote, which returns the remainder of the list.

if ((symbol->to_string()) == "quote") {                  // quote: Return the rest of the list
          return List::create(++it, end);

The second special form is if, which evaluates the next element and tests its boolean value; if true the result is the evaluation of third element, if false, the fourth element. The element chosen is handled by the test and possible extra increment.

} else if ((symbol->to_string()) == "if") {              // if: Yeah, like this is going to work
          if (!(*eval(*it++, env))) { ++it; }
          return eval(*it, env);

The third special form is set!, which updates a new binding in the environment. Environment handling is one thing I stole directly from Norvig; find() returns a map, and the ...[var] = val assignment is a normal operator on that map. Very nice.

} else if ((symbol->to_string()) == "set!") {            // set!: Life does not get worse if I remove this
          Symbol::p var = dynamic_pointer_cast<Symbol>( eval(*it++, env) );
          Value::p val = eval(*it++, env);
          env->find(var)[var] = val;
          return val;

The fourth special form is define, which adds a new binding to the environment. The second element of the list must be a symbol, which will not be evaluated, and the third will be the new value, which will be evaluated.

} else if ((symbol->to_string()) == "define") {          // define: Hey, Mickey! You're so defined...
          Symbol::p var = dynamic_pointer_cast<Symbol>(*it++);   // Yeah, that's not even funny to me
          Value::p val = eval(*it++, env);
          env->update(var, val);
          return val;

The fifth special form is lambda, which is used to create a new function. The second element of the list is a list of argument identifiers, while the third is the body of the function. You will take note that I called out the variables pars and body; if you put both values in the call to Lambda::create, you have to look up the definition of sequence points.

} else if ((symbol->to_string()) == "lambda") {          // lambda: first element is the parameters, next is the body
          List::p pars = dynamic_pointer_cast<List>(*it++);
          List::p body = dynamic_pointer_cast<List>(*it);
          return Lambda::create(pars, body);

The sixth and final special form is begin, which simply evaluates the remaining values of the list and returns the final value.

} else if ((symbol->to_string()) == "begin") {           // begin: evaluate 'em all and let McCarthy sort 'em out
          Value::p value;
          for (; it != end; ++it) { value = eval(*it, env); }
          return value;

And the final branch is function application. The first step is to evaluate all of the values in the list. The first element should be a Lambda value, which is called with the arguments and the environment.

} else {                                                 // application: this is where it gets really exciting
          list::iterator it = repr.begin();            // The beginning of the list
          list evaluated;
          for (; it != end; ++it) { evaluated.push_back( eval(*it, env) ); }
          {
            Lambda::p lambda = dynamic_pointer_cast(evaluated.front());
            List::p args = List::create(++(evaluated.begin()), evaluated.end());
            return (*lambda)(args, env);
          }
        }
      }
    }
  }

For those of you who are long-term Lisp users, the collection of closing braces should make you feel all warm inside, right?

Unlike Norvig's version, which effectively compiles to Python by using Python's lambda, in C++ I had to implement lambda and function application myself. The Lambda class itself maintains a list of parameter names and the body expression. Function application is handled by operator(), in other words by overloading the function call for Lambda values. The operator() method recursively calls eval on the body expression, after creating a new environment mapping the parameter names to argument values.

class Lambda : public Value {
  public:
    typedef shared_ptr<Lambda> p;
    static p create(List::p params, List::p exp) { return p(new Lambda(params,exp)); }
    Value::t type() { return LAMBDA; }
    operator bool() { return true; }
    string to_string() { return "LAMBDA"; }

    virtual Value::p operator()(List::p args, Environment::p outer) {
      return eval(_e, create_env(outer, _p->to_list(), args->to_list()));
    }

  private:
    List::p _p;
    List::p _e;
  protected:
    Lambda(List::p p, List::p e) : _p(p), _e(e) { }

    Environment::p create_env(Environment::p outer, list<Value::p> ¶ms, list<Value::p> &args) {
      list<Value::p>::iterator p_begin = params.begin();
      list<Value::p>::iterator p_end = params.end();
      list<Value::p>::iterator a_begin = args.begin();
      list<Value::p>::iterator a_end = args.end();
      Environment::p env = Environment::create(outer);
      for (; p_begin != p_end && a_begin != a_end; p_begin++, a_begin++) {
        Symbol::p symb = dynamic_pointer_cast<Symbol>(*p_begin);
        env->update(symb, *a_begin);
      }
      return env;
    }
  };

One (among many) downsides of implementing lambdas manually is that it also requires implementing all of the builtin functions.

struct Builtin : public Lambda {
    virtual Value::p operator()(List::p args, Environment::p outer) = 0;
    Builtin(List::p p) : Lambda(p,List::p()) { }
  };

As an example of a builtin function, Plus implements integer addition. Verbose, but tedious.

struct Plus : public Builtin {
    Plus(List::p p) : Builtin(p) { }
    Value::p operator()(List::p args, Environment::p outer) {
      list<Value::p>::iterator end = args->to_list().end();
      long sum = 0;
      for (list<Value::p>::iterator i = args->to_list().begin(); i != end; ++i) {
        sum += dynamic_pointer_cast<Int>(*i)->to_int();
      }
      return Int::p(new Int(sum));
    }
  };

Amazingly, the Environment is not completely hideous. It is fundamentally a loose wrapper around the C++ STL map, using Norvig's tactic of returning the map from find().

class Environment {
  public:
    typedef shared_ptr<Environment> p;
    typedef map<Symbol::p,Value::p,Symbol::less> env_t;
   
    static p create() { return p(new Environment()); }
    static p create(p outer) { return p(new Environment(outer)); }

    virtual void update(Symbol::p s, Value::p v) { _e.insert( make_pair(s,v) ); }
    virtual env_t& find(Symbol::p s) { return _e.count(s) > 0 ? _e : _outer->find(s); }

  private:
    env_t _e;
    p _outer;

    Environment() { }
    Environment(p outer) : _outer(outer) { }
  };

Lisp parsing is actually relatively simple, although C++'s string handling is somewhat lacking. To handle it similarly to Norvig, I had to implement replace to provide some convenience.

void replace(string &s, string from, string to) {
    int idx = 0;
    int next;
    while ((next = s.find(from, idx)) != string::npos) {
      s.replace(next, from.size(), to);
      idx = next + to.size(); 
    }
  }

Fortunately, C++ lets me faff about with string I/O to break the input string into a list of string tokens.

list tokenize(string s) {
    replace(s, "(", " ( ");
    replace(s, ")", " ) ");
    istringstream iss(s);
    list result;
    result.insert(result.end(), istream_iterator<string>(iss), istream_iterator<string>());
    return result;
  }

Not to mention using string I/O to parse integers.

Value::p atom(string token) {
    istringstream oss(token);
    long l;
    if (oss >> l) { return Int::p(new Int(l)); } else { return Symbol::p(new Symbol(token)); }
  }

Fortunately, I did not bother to do much error handling anywhere in the code, so the missing bits of read_from are (ahem) unexceptional.

Value::p read_from(list<string> &tokens) {
    if (tokens.size() <= 0) { /* error */ }
    string token = tokens.front();
    tokens.pop_front();
    if (token == "(") {
      list<Value::p> lst;
      while (tokens.front() != ")") {
        lst.push_back( read_from(tokens) );
      }
      tokens.pop_front();
      return List::create(lst);
    } else if (token == ")") { /* error */
    } else {
      return atom(token);
    }
  }

Anyway, to get back to a discussion I delayed at the top of this post, I avoided memory management almost entirely by using TR1 shared_ptr<T> templates, which almost trivially handle reference counting. To avoid some template syntax nastiness, I adopted an idiom of providing a typedef of "p" in a class C for the type shared_ptr<C>. Now, I really like type-safe collections, and I am duly impressed (and terrified) by some of the bizarre abuses that C++ templates are put to. But shared_ptr's are pretty darn sweet. The only downside is reference counting, which has problems with circular structures. Fortunately, I tend not to create circular structures and could probably deal with them in this Lisp by removing set!.

Amazingly, the interpreter does in fact seem to work; the main function below (sorry, no repl as I used up my I/O quota above)...

int
main(int argc, char *argv[]) {
  Lispic::Environment::p global = Lispic::global_env();
  std::list tokens = Lispic::tokenize(
    "(begin"
    "  (print (+ 2 3))"
    "  (define double (lambda (x) (+ x x)))"
    "  (print (+ 3 3) (double 4))"
    ")"
  );
  std::cout << Lispic::eval(Lispic::read_from(tokens), global)->to_string() << std::endl;
  return 0;
}

...produces the results:

$ ./lispic
5
6
8
8

All this in about 300 lines of code. For greater amusement, you can get the source code for this monstrosity.

If you are still reading, thank you very much for your attention. If you were looking for more about the CAP theorem, bear with me. I think I have some ideas on how to actually formalize it using either Unity (I took a class from Jay Misra about 15 years ago) or Leslie Lamport's TLA+ (I have a Microsoft Channel 9 video). I suspect that it is going to involve temporal logic. Duck and cover!

[Edit] I just remembered Paul Wilson's incomplete An Introduction to Scheme and its Implementation, which is fortunately still available. At least, I think it was written by Paul Wilson, of the garbage collection reading collection.