Parsing with derivatives: compaction

Posted on May 22, 2012 by Tommy McGuire

Labels: data structure, java, programming language

Recently, I have been spending some spare time implementing a parser system for context-free languages based on Matt Might's "Yacc is Dead". Derivative parsing make for a uniquely interesting problem: the ideas are actually conceptually simple while at the same time a bit algorithmically hairy and, strangely, not at all tedious.

Practical regular expression derivatives started the whole mess, presenting yet another implementation of regular expressions; this time inspired by Matt Might and based on the derivative of a regular expression:
The derivative of a regular expression with respect to a character is a new regular expression that matches the tails of all strings beginning with that character and which match the original regular expression [; it] computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.
Parsing with derivatives: introduction provided the basic background for Might's extension from regular expressions to context-free grammars and also showed basic Java code for the basic, skeletal parsers.
Parsing with derivatives: recursion added recursion to the skeletal parsers, enabling them to work for context-free grammars. There are three tricks to handling recursive parsers:
- Lazy evaluation: delaying the evaluation of an expression until the result is actually required (and ensuring that the evaluation is only performed once).
- Computing fixed points of recursive functions.
- Memoizing function results, trading memory for computation.
Work, that is, if you have lots of memory, since the graph structures that the parsers build during parsing have many, many nodes.

This time, I am going to reduce the size of the graph structures built during parsing, using several techniques including what Might terms a compaction step between each derivation.

In the last post, I examined the grammar \(S\ :=\ \epsilon\ |\ S\ \cdot\ '1'\), which represents a possibly empty series of 1's. I found that parsing "111" produced a graph that has 12 nodes, the parse of "11111" has 20, ten 1's have 177 nodes, and twenty 1's have 5395, all before the deriveNull step (which is going to generate more nodes as it forces delayed derivations). Clearly, to be useful, those numbers are going to have to improve.

Step 1

The first and simplest idea is to notice that, for some parser classes, all of the instances are indistinguishable from one another. For example, the code presented so far creates new instances of Empty, although those instances do not have any interesting attributes. All of those instances can be replaced by one singleton, by creating a single instance:


  public static final Empty empty = new Empty();

and making the Empty constructor private while changing all of the uses of new Empty() to Empty.empty.

The same technique works for the basic uses of Epsilon, as seen in the definitions of parsers rather than the derivative computations.


  // singleton
  private static Parser emptyString = new Epsilon();

  // private constructor
  private Epsilon() { trees.add(""); }

(The emptyString value includes an empty string to ensure that when deriveNull is called on it, it returns something rather than Empty's empty set.)

You will notice that in this case, I made the singleton private. To provide access to the emptyString value, I added a static, smarter constructor method; a technique that will be used much more shortly.


  public static Parser epsilon() { return emptyString; }

Step 2

I want to say just two words to you. Just two words. Are you listening? Smarter constructors.


  public static Parser alternative(Parser l1, Parser l2)
  {
    if      (l1 == Empty.empty) { return l2; }
    else if (l2 == Empty.empty) { return l1; }
    else                        { return new Alternative(l1,l2); }
  }

The Alternative factory method now checks to see if either of the alternatives is Empty; if so then the alternative is equivalent to the other parser and that can be used rather than creating a new Alternative.

The Concat factory method is similar, except that if eitherof the parsers is Empty, the result is equivalent to the empty parser.


  public static Parser concat(Parser l1, Parser l2)
  {
    if (l1 == Empty.empty || l2 == Empty.empty) { return Empty.empty; }
    else                                        { return new Concat(l1,l2); }
  }

Step 3

Conceptually, compaction itself is a fairly simple process. The code walks over the graph generated by the parsers after every derivation step, simplifying the structure by replacing some nodes with other, simpler nodes. In practice, though, there are a lot of moving parts.

From the outside in, compaction is invoked by


    p = p.derive(ch).compact();

when the character ch is seen. The compact method is a wrapper for another method, compact(Set<Parser> seen), which is overridden by parsers that require more complex behavior. Typically, that means parsers that have references to other parsers. The parameter seen provides a way to ensure that the recursion only touches each instance once. (The same technique is used by the infrastructure for creating dot graphs of the parser nodes.) The base implementation is in the Parser superclass:


  public Parser compact()
  {
    return this.compact(new HashSet<Parser>());
  }
  
  public Parser compact(Set<Parser> seen)
  {
    if (! seen.contains(this))
    {
      seen.add(this);
    }
    return this;
  }

This base implementation just records seeing the instance and returns. It is used by most of the parser classes that do not have references to other parsers, such as Empty and Epsilon.

Recurrence

Probably the simplest overriding implementation of compact is from Recurrence, the parser class used to "tie the knot" when building recursive parsers.


  @Override
  public Parser compact(Set seen)
  {
    return l.compact(seen);
  }

Since the recurrence does not do anything, it is safe to throw away the node and return the parser that is referenced; other parsers that pointed to the Recurrence will now point to the value that was set in the recurrence (or specifically it's compaction).

Delay

The next-simplest implementation comes from Delay, the inner class of Fix that handles laziness when calling derive and deriveNull.


  @Override
  public Parser compact(Set seen)
  {
    force();
    return derivative.compact(seen);
  }

In the case of Delay, a single step of forcing a derivation can be performed and the Delay node replaced by the new derivative (compacted itself, of course). Removing the Delay nodes is one of the more effective compactions; not only does this discard all of the Delay nodes created by a derive invocation but it also may free up the parser from which the derivative was created.

Alternative

The compaction from Alternative is somewhat similar to the alternative parser factory method.


  @Override
  public Parser compact(Set seen)
  {
    if (! seen.contains(this))
    {
      seen.add(this);
      l1 = l1.compact(seen);
      l2 = l2.compact(seen);
    }
    if      (l1 == Empty.empty) { return l2; }
    else if (l2 == Empty.empty) { return l1; }
    else                        { return this; }
  }

(It is also only the second real use of the seenparameter, as well. The previous few parsers discarded themselves from the graph, so seen was unnecessary.)

The method here first compacts the two alternative branches (if this node has not been visited before) and then uses the same logic as the factory method to potentially eliminate the Alternative node if one or the other of the branches is Empty.

Now, I could have implemented these parsers functionally, by creating a new Alternative here rather than updating the l1 and l2 members, for example. But I didn't. Maybe next time.

Reduce

The compaction for Reduce parsers starts out similarly, by compacting the parser that the Reduce refers to. Further, if the parser is empty, the compaction is also Empty. However, this method also handles the situation of a Reduce parser immediately referencing another Reduce parser.


  @Override
  public Parser compact(Set seen)
  {
    if (! seen.contains(this))
    {
      seen.add(this);
      parser = parser.compact(seen);
    }
    if (parser == Empty.empty)
    {
      return Empty.empty;
    }
    else if (parser instanceof Reduce)
    {
      final Reduce subReduction = (Reduce) parser;
      final Reduction inner = subReduction.reduction;
      final Reduction outer = this.reduction;
      final Reduction combination = new Reduction<Object,Object>()
      {
        @Override
        public Object reduce(Object t)
        {
          return outer.reduce( inner.reduce(t) );
        }
      };
      return new Reduce(subReduction.parser,combination);
    }
    else
    {
      return this;
    }
  }

This Reduce instance (let's call it R1) is pointing to another Reduce instance (R2, say). In that case, the two Reduce nodes can be combined into one, by creating a new Reduce node composing the two reduction functions. The reduction for the new Reduce node is created by pulling out the reduction from R2 (called inner) and arranging for it to be called first in the new reduction, followed by the reduction from R1 (called outer) which is invoked on the result of inner. Java doesn't really make this process easy.

Concat

The compaction for Concat is similar in that it may simplify the graph by introducing reductions in some cases.


  @Override
  public Parser compact(Set seen)
  {
    if (! seen.contains(this))
    {
      seen.add(this);
      l1 = l1.compact(seen);
      l2 = l2.compact(seen);
    }
    if (l1 == Empty.empty || l2 == Empty.empty)
    {
      return Empty.empty;
    }
    else if (l1 instanceof Epsilon && ((Epsilon) l1).size() == 1)
    {
      final Set trees = l1.deriveNull();
      final Object o = trees.toArray()[0];
      final Reduction r =
        new Reduction<Object,Object>()
        {
          @Override public Object reduce(Object t) { return new Pair(o,t); }
        };
      return new Reduce(l2, r);
    }
    else if (l2 instanceof Epsilon && ((Epsilon) l2).size() == 1)
    {
      final Set trees = l2.deriveNull();
      final Object o = trees.toArray()[0];
      final Reduction r =
        new Reduction<Object,Object>()
        {
          @Override public Object reduce(Object t) { return new Pair(t,o); }
        };
      return new Reduce(l1, r);
    }
    else
    {
      return this;
    }
  }

Specifically, if l1 (the left, first sub-parser of the Concat) is an Epsilon containing a single parse tree (more about that later), the concatenation and the l1 subtree is replaced by a Reduce node that stores the parse tree from l1. The reduction uses it to produce Pairs with parse trees from the right branch, l2, in the same way the Concat does when deriveNull is called.

Step 4

The most productive compaction, however, is that for Delta, the pseudo-Empty parser which stores parse trees. Recall that Delta nodes are created when deriving Concat parsers, as the alternative where the left, first parser can be empty; it records the parse trees which are created by the left parser in order to reach its empty state.

The compaction method for Delta is trivial:


  @Override
  public Parser compact(Set seen)
  {
    return Epsilon.epsilon(l.deriveNull());
  }

The Delta nodes retain a reference to the parser, l. In compact, this parser is discarded and the Delta is replaced by an Epsilon instance containing the parse trees returned by l.deriveNull(). Epsilon, which stores parse trees, is extended to include a factory method and constructor that supports a set of trees in addition to the empty string and to single characters.


  public static Parser epsilon(Set trees)
  {
    if      (trees.isEmpty())                         { return Empty.empty; }
    else if (trees.size() == 1 && trees.contains("")) { return emptyString; }
    else                                              { return new Epsilon(trees); }
  }

  private Epsilon(Set trees)
  {
    this.trees.addAll(trees);
  }

Results

I mentioned earlier that the grammar \(S\ :=\ \epsilon\ |\ S\ \cdot\ '1'\), which represents a possibly empty series of 1's, produced a graph that has 12 nodes when parsing "111", 20 parsing "11111", 177 parsing ten 1's, and 5395 parsing twenty 1's.

Now, the parser for \(S\) has constant overhead, producing graphs after compaction that always have 6 nodes. (There is extra storage to record the '1' seen held in the reductions, but the structure of the parser itself is constant.)

Additionally, parsing the expression "1+1*1" with the simple expression parser produces a graph with 18 nodes and still produces the dual parse trees indicating the ambiguity of operator precedence.

These results may not be perfect. I also have not implemented all of the compaction options mentioned by Might, et al., although I have come up with some that they did not need. However, at this point I think that the behavior of the system is acceptable for further work and that larger, more complex grammars are going to be needed to validate any further optimizations. So I believe the next step is to improve the usability of the parser system.

If you want to take a look at the code, check github.