Parsing with derivatives: introduction

Posted on April 16, 2012 by Tommy McGuire

Labels: data structure, java, programming language

A few posts back, I waxed enthusiastic about Matt Might's derivative-based regular expression engine. However, those regular expressions are nowhere near the exciting part. Thanks to Matt, David Darais, and Daniel Spiewak, parsing context-free languages is not only possible for simpletons like myself, but both easy and flexible. However, it wasn't quite easy enough and is taking some time to bang my head against the necessary times to soften and penetrate. This post here is the first step, presenting the basic, simplified skeleton of a derivative-based context-free parser.

Background

But first, I want to present some background and definitions for those who have not been beaten about the head and shoulders with one or more automata theory classes. These terms make discussing parsing and the whole derivative thing much easier. However, if you have a well-thumbed copy of the dragon book glinting evilly close at hand, feel free to skip further down.

A language is a (finite or infinite) collection of strings. Such a collection of strings might be {"ab"} (which consists of one string), {""} (which also consists of one string, the empty string), or something like {"","ab","abab",...} (which is an infinite collection of zero or more repeats of "ab").

A grammar is a description of a language based on a mathematical formalism which is both simpler and shorter than a listing of strings, especially for very large or infinite languages. A grammar can be used in two ways, either to generate the strings of the language, or to recognize whether a given string is a member of the language or not.

Assorted nut-jobs (uh, mathematicians) have identified different classes of languages, based on the mathematical formalisms used to describe them. One class is the ever-present regular languages, which can be described by regular expressions. (Yes, most regular expression libraries include things like back-references that make them strictly more powerful than regular languages. I'm not talking about them right now.) Speaking of regular expressions as a formalism, they describe regular languages. Further, regular languages can be recognized by the automata-theoretic finite state machines.

A strictly larger class of languages is called context-free. These languages are called context-free because they can be described by grammar rules like: \[ S := \epsilon\ |\ 1\, \cdot\, S\, \cdot\, 2 \]

This context-free grammar consists of one rule; the rule says that the non-terminal symbol \(S\) can be defined by two options. The first option is \(\epsilon\), the empty string; the second is the terminal symbol 1 followed by some expansion of \(S\) itself, followed by the terminal symbol 2. The "context-free" part is the left-hand side of the rules, which must consist of one non-terminal symbol. A context-sensitive grammar would be one that allowed multiple symbols on the left-hand side. But I'm not going to write about those. The capability that separates context-free languages from regular languages is the ability to write recursive rules such as these. Regular expressions cannot be recursive; the only way to create an infinite set of strings uses the Kleene star. The stack automata handle context-free languages. (If you're curious, Turing machines are needed for context-sensitive languages.)

With all that being said, a parser is a piece of software that recognizes whether a string follows the rules of a grammar, i.e. determines whether or not the string is a member of the language. Further, the parser recovers the structure of the string in some sense. For example, the parser may build a parse tree that indicates which rules of the grammar were used to build the string.

The key here is to know that a language is a set of strings, a grammar is a set of rules that describe the strings in a language, and a parser is a chunk of code that embodies a grammar, recognizes the strings of a language and recovers some kind of structure from them. (If you read a previous post, the derivative-based regular expression engine described there is a recognizer, not a parser.) However, all three are equivalent in the sense that the grammar and the parser describe the same language.

So, what is this whole derivative thing? As defined by Brzozowski, the derivative of a language with respect to an input character c is the set of strings in that original language that start with that character c, with that initial c stripped from the beginning of each string. For example, the derivative of {"ab","ba"} with respect to 'a' is {"b"}; the derivative of {"","ab","abab",...} with respect to 'a' is {"b", "bab", "babab",...}. Notice that in the first case, the derivative of {"b"} with respect to 'b' would be {""}, or \(\epsilon\), the language consisting of the empty string. If the end of the input has been reached at this point, the input string is "ab", which is a member of the language and a recognizer would succeed.

The derivative of a grammar with respect to a character is another grammar that describes the set of strings in, the original grammar's language that start with the character, with the character stripped off.

The basic idea of parsing using derivatives is this: as each terminal symbol is read from the input, it is used to compute a new parser that would recognize the prefix seen so far if that is possible and which saves any information necessary from the prefix to recover whatever structure is desired from the string. When the last terminal symbol has been eaten, the current parser is examined to determine if it would accept the empty string; if it would not then there is an error, if it would then the recovered structure can be acquired from the parser. Simple, eh?

For the remainder of this discussion, I am going to assume that the recovered structure is a parse tree. It could easily be something else. For example, the traditional example is hoc, a four-function calculator, and the traditional method is to evaluate each expression as soon as it is parsed. In that case, the recovered structure is the resulting value. However, returning a parse tree that is then evaluated is equivalent but only somewhat less efficient than the traditional scheme.

The basic parser interface

From the discussion above, the basic interface for a Parser is two methods:

Derive a new from a parser with respect to an input character.
Test the parser at the end of the input and recover the parse tree

The Java abstract class for Parsers is defined below. Following Matt Might, the first operation is called derive and the second is deriveNull. Since these parsers are intended to handle context-free languages, the parse may be ambiguous; in other words, there may be more than one way to generate the input string from the rules of the grammar. as a result deriveNull returns a Set of parse trees. Also, because describing (and debugging) these parsers is a bit complex, I am punting on the generic typing of the Set (and the rest of the parsing code) for the moment. I have also simplified the code by only parsing characters, not some general terminal symbol, although that is perfectly possible, too.


public abstract class Parser
{
  abstract public Parser derive(char ch);
  abstract public Set    deriveNull();
}

There are three basic parsers: Empty, Epsilon, and Literal. Empty represents a parser of the empty language, one which accepts no strings. Epsilon is the parser for the language that consists only of the empty string. Literal is the parser that accepts a single character, passed as an argument to the constructor.


public static class Empty extends Parser
{
  public Empty() { }

  @Override public Parser derive(char ch) { return new Empty(); }
  @Override public Set    deriveNull()    { return new HashSet(); }
}

Deriving the Empty parser returns another Empty parser. Calling deriveNull on Empty results in an empty set.


public static class Epsilon extends Parser
{
  private Set trees = new HashSet();

  public Epsilon()        { trees.add(""); }
  public Epsilon(char ch) { trees.add(ch); }

  @Override public Parser derive(char ch) { return new Empty(); }
  @Override public Set    deriveNull()    { return trees; }
}

Attempting to derive the Epsilon parser results in the Empty parser, but deriveNull returns a singleton set of a basic parse tree. The set is defined by the constructor of the Epsilon instance. If the Epsilon is created while building the initial parser, the first constructor should be used, resulting in a parse tree set that contains the empty string. However, if the Epsilon instance is created by the Literal parser below, the parse tree contains a string holding the single character that was seen when the Epsilon was derived from the Literal parser. These instances of Epsilon are responsible for recording the positive parses.


public static class Literal extends Parser
{
  private final char ch;
  
  public Literal(char ch) { this.ch = ch; }

  @Override public Parser derive(char ch) { return this.ch == ch ? new Epsilon(ch) : new Empty(); }
  @Override public Set    deriveNull()    { return new HashSet(); }
}

When deriving the Literal parser, if the incoming character is the same as the Literal's value, the result is the version of Epsilon that records the character that has been seen. Otherwise, the derivative is the Empty parser. On the other hand, since a literal character is not the empty string, deriveNull returns the empty set.

In addition to those three, I need a couple of Parser classes to combine the base instances.


public static class Concat extends Parser
{
  private Parser l1;
  private Parser l2;
  
  public Concat(Parser l1, Parser l2)
  {
    this.l1 = l1;
    this.l2 = l2;
  }

  @Override
  public Parser derive(char ch)
  {
    return new Alternative(
        new Concat( l1.derive(ch), l2 ),
        new Concat( new Delta(l1), l2.derive(ch) )
        );
  }

  @Override
  public Set deriveNull()
  {
    Set set1   = l1.deriveNull();
    Set set2   = l2.deriveNull();
    Set result = new HashSet();
    for (Object o1 : set1)
    {
      for (Object o2 : set2)
      {
        result.add( new Pair(o1,o2) );
      }
    }
    return result;
  }
}

Concat represents the concatenation of two parsers, l2 handling input after l1 has had its chance. The derive method returns a parser that is either the concatenation of the derivation of l1 with l2 or a Delta parser concatenated with the derivative of l2. Both Alternative and Delta are described below. The deriveNull method returns a set containing pairs made from the results of l1.deriveNull() and l2.deriveNull().

Oh, and the comment about the contents of the Java standard generics library forcing every program to grow a Pair? It's true. Pair is pretty much what you think it is.


public static class Alternative extends Parser
{
  private Parser l1;
  private Parser l2;
  
  public Alternative(Parser l1, Parser l2)
  {
    this.l1 = l1;
    this.l2 = l2;
  }

  @Override
  public Parser derive(char ch)
  {
    return new Alternative( l1.derive(ch), l2.derive(ch) );
  }

  @Override
  public Set deriveNull()
  {
    final Set set = new HashSet();
    set.addAll( l1.deriveNull() );
    set.addAll( l2.deriveNull() );
    return set;
  }
}

The Alternative parser, which represents either l1 or l2, is somewhat simpler than Concat; the result of derive is the Alternative combination of the derivatives of l1 and l2, and the result of deriveNull is the union of l1.deriveNull() and l2.deriveNull().

Now, there are a couple of additional, necessary Parser classes.


public static class Delta extends Parser
{
  private Parser l;
  
  public Delta(Parser l) { this.l = l; }
  
  @Override public Parser derive(char ch) { return new Empty(); }
  @Override public Set    deriveNull()    { return l.deriveNull(); }
}

If you look back at my regular expression derivative post, it defined two major functions, D_c and δ. The latter function was used to test whether the first element of a concatenation was empty. The Delta parser performs a similar function here, by effectively being an Epsilon. However, the Delta parser also allows the parse tree to be passed up by deriveNull when parsing is complete.

Examples

Suppose we have the following parser:


      Parser p = new Empty();

This parser represents the empty language, so any derivation will also be the Empty and deriveNull will return the empty set.


      Parser p = new Epsilon();

Epsilon is the language containing only the empty string, so any derivation returns Empty. Calling p.deriveNull(), however, returns the set containing the empty string.


      Parser p = new Literal('a');

Calling p.derive(ch) where ch is anything other than 'a' results in the Empty parser, while 'a' produces the Epsilon parser via the second constructor which stores the 'a'; subsequently calling deriveNull results in a set containing 'a'.


      Parser p = new Concat(new Literal('a'), new Literal('b'));

This parser accepts the string "ab". Calling p.derive('a') results in:

Calling derive('b') on the result produces:

This structure seems a bit hairy, but the presence of the Empty parsers ensures that the result of calling deriveNull produces { <'a','b'> }, a set containing a single pair. I'll address the hair in a subsequent post.

The Reduction parser

A parser that returns a tree made of Pair nodes and character terminals might be entertaining, but it is certainly less than optimal. Reduce parsers are the solution.


public static interface Reduction<T,U>
{
  public U reduce(T t);
}

A Reduction is a function that is applied to each of the elements of the set returned by the sub-parser's deriveNull method. Other than that change to deriveNull, the Reduce parser is almost a pass-through.


public static class Reduce extends Parser
{
  private Parser parser;
  private Reduction reduction;

  public Reduce(Parser parser, Reduction reduction)
  {
    this.parser    = parser;
    this.reduction = reduction;
  }

  @Override
  public Parser derive(char ch)
  {
    return new Reduce(parser.derive(ch),reduction);
  }

  @Override
  public Set deriveNull()
  {
    Set newSet = new HashSet();
    for (Object o : parser.deriveNull())
    {
      newSet.add( reduction.reduce(o) );
    }
    return newSet;
  }
}

Ok, now, how does this help? Suppose you have the following reduction that combines the elements of a Pair into a string:


Reduction reduction = new Reduction()
{
  @Override
  public Object reduce(Object t)
  {
    if (t instanceof Pair)
    {
      t = String.format("%s%s", ((Pair) t).first(), ((Pair) t).second());
    }
    return t;
  }
};

And the following parser:


Parser p = new Reduce(new Concat(new Literal('a'), new Literal('b')), reduction);

Then, the following invocation:


p.derive('a').derive('b').deriveNull()

Produces a set containing the string "ab". Converting a Pair to a string might be a simplistic example, but hopefully how the Reduce works to manage the structure of the resulting parse tree is clear.

Wrap-up

This post is getting long, but there is a lot of work that remains to be done. The code above shows the basic skeleton of a derivative-based parser, but it is nowhere near useful. It can parse simple, finite collections of strings (no Kleene star, so far, remember), although it is not terribly efficient in doing so. However, it cannot parse recursive grammars like my example above: \[ S := \epsilon\ |\ 1\, \cdot\, S\, \cdot\, 2 \]

Adding support for that will require significant changes to the skeleton I have developed so far, but I hope by presenting things serially it will be much more understandable. That is the goal of the next post.

Subsequent posts will try to present a "compaction" phase that will keep the mongo structures produced by the derive method under control, and then add enough features to make it something sort of like a usable parser toolkit. That'll be the big, friendly bow on it at the end.