Saturday, April 28, 2012

Parsing with derivatives: recursion

Previously, I posted a description of "the basic, simplified skeleton of a derivative-based context-free parser", one which included literals, alternatives, concatenations, and reductions. Importantly, it did not include recursive parsers and could therefore only parse fixed strings. This time, I am going to extend the code from that post to handle recursion.

To handle a recursive grammar, I need to add another class of Parser, one which allows code to use a parser before it is defined.

public class Recurrence extends Parser
{
    private Parser l;
    
    public void setParser(Parser l) { this.l = l; }
    
    @Override public Parser derive(char ch) { return l.derive(ch); }
    @Override public Set deriveNull() { return l.deriveNull(); }
}

The Recurrence parser has a reference to another parser which can be updated after the Recurrence is created. Other than that, the operations of the Parser, derive and deriveNull, simply call the referenced parser.

Code using a Recurrence parser creates the recurrence, then uses it to build the grammar in question, finally setting the recurrence to close the loop. The following code uses a Recurrence to implement the grammar \( S\ :=\ \epsilon\ |\ S\ \cdot\ '1' \). The graph to the right shows the structure of the parser. (One thing I added is enough infrastructure to output dot visualizations of the parser structures.)

    Recurrence S = new Recurrence();
    Parser a = new Alternative(new Epsilon(), new Concat(S, new Literal('1')));
    S.setParser(a);

Now to make it work

This code has just two problems: derive and deriveNull. In many of the parsers, those two methods are recursive, and the Recurrence makes the language parser circular. However, Matt Might presents three ideas that neatly combine to make the two methods safe and the whole at least somewhat useful. The three ideas are:

  1. Lazy evaluation: delaying the evaluation of an expression until the result is actually required (and ensuring that the evaluation is only performed once). (Wikipedia)
  2. Computing fixed points of recursive functions. (Wikipedia)
  3. Memoizing function results, trading memory for computation. (Wikipedia)

Laziness

Lazy evaluation, in this case, is the technique of creating a record that a given expression has been seen without evaluating the expression. The expression is not actually evaluated until it is needed, at which time the result is also recorded to be used whenever the result is needed. Specifically, the derivation of a parser with respect to a given character is performed lazily by the Delay parser, in the derive method of the Parser class Fix:

public abstract class Fix extends Parser
{
  public abstract Parser innerDerive(char ch);
  public abstract Set innerDeriveNull();

  @Override
  public Parser derive(char ch)
  {
    return new Delay(this,ch);
  }
[...]
}

The result of the derivation is a Delay parser, a static inner class of Fix, which is constructed with a parser and a character and represents the evaluation of derive on the parser with respect to the character.

  private static class Delay extends Parser
  {
    private Fix parser;
    private char ch;
    private Parser derivative = null;
    
    public Delay(Fix parser, char ch)
    {
      this.parser = parser;
      this.ch = ch;
    }

    @Override
    public Parser derive(char ch)
    {
      this.force();
      return this.derivative.derive(ch);
    }

    @Override
    public Set deriveNull()
    {
      this.force();
      return this.derivative.deriveNull();
    }

    private void force()
    {
      if (derivative == null)
      {
        derivative = this.parser.innerDerive(this.ch);
      }
    }
  }

The force method performs the delayed evaluation, storing the result as the derivative parser. Delay's derive and deriveNull methods first force the evaluation before returning the result of the corresponding method of the derivative.

How does this prevent endless recursion? The derive method of a subclass of Fix returns immediately; when the actual derivative is needed, it is computed, but because Fix is (now) the superclass of Concat, Alternative, and all the other structural parsers, each step computes a single derivation rather than all of them.

Subclasses of Fix use the definition of derive (and deriveNull) from Fix, but their original definitions are renamed as the methods innerDerive and innerDeriveNull, the first of which is used by the force method of Delay to actually evaluate the derivative.

Fixing stuff

Unfortunately, laziness does not directly help with the deriveNull function, since it is computing the end result of a parse and therefore doesn't present the opportunity to be lazy in a strict language like Java. (Actually, I'm not sure how deriveNull would need to be handled in a non-strict language like Haskell.) Fortunately, the second trick is available: computing the fixed point of deriveNull. Computing a fixed point here is similar to the evaluation strategy used by spreadsheets: assume a tentative value for each cell (or each variable) and compute the expression, using the tentative values if needed in the expression. Then take the resulting value as the new tentative value and repeat until the resulting value is the same as the tentative value. At that state, the tentative value is no longer tentative; it is the fixed point and no further repetitions will change it.

In the case of the Fix class, the derive null method looks like:

  private Set _set = null;

  @Override
  public Set deriveNull()
  {
    if (_set != null) { return _set; }
    Set newSet = new HashSet();
    do {
      _set = newSet;
      newSet = innerDeriveNull();
    } while (! _set.equals( newSet ));
    return _set;
  }

The innerDeriveNull method, like the innerDerive method, is the renamed deriveNull (and derive) methods from the Concat, Alternative, and similar classes.

The one weakness of computing a fixed point is the necessity of demonstrating that there is one. In this case, I have no real, good argument that deriveNull is going to terminate; it is possible for a fixed point computation to iterate forever between some group of values or grow without bounds, or have some other poor behavior. For the moment, though, I am just going to assume Might has such an argument, even though I don't remember it from either the presentation or the papers I've seen. In short, it seems to work.

Take a memo

The third technique, memoization, is needed because the deriveNull method is not quite as nice as I have painted it. In fact, it seems to go into an infinite recursion, since it forces delayed derivatives, which in turn creates new delayed derivatives that are, again, in turn forced. So, I added memoization to the Fix.derive method shown above, replacing it with:

  private static PairMap<Parser,Character,Parser> derivatives = new PairMap<Parser, Character, Parser>();

  @Override
  public Parser derive(char ch)
  {
    if (! derivatives.contains(this, ch)) { derivatives.put(this, ch, new Delay(this, ch)); }
    return derivatives.get(this, ch);
  }

Memoization is a specific kind of caching, where the value of a function applied to a set of arguments is stored, associated with the arguments; if the function is called again with the same arguments, then the stored value is returned. This code uses a PairMap (think of it as a map from (Parser,Character) pairs to Parsers) to record the parser being derived and the character with a Delay instance that represents the derivative. In the future, when this derivation is asked for again, the same Delay instance is returned. And, for my deriveNull issue, the Delay instance may already have been forced, meaning no new derivative must be computed and no infinite recursion happens.

The big demo

How well does it work, since I assert the current state does work?

Take code implementing the S grammar above and apply it to "111" as so:

    Recurrence S = new Recurrence();
    Parser a = new Alternative( new Epsilon(),
                                new Reduce( new Concat(S,
                                                       new Literal('1') ),
                                            reduction) );
    S.setParser(a);
    final Parser S111 = a.derive('1').derive('1').derive('1');
    Set result = S111.deriveNull();

The result is a set containing the string "111", which is the reduction of the parsed values. The final state of the parse, S111, is diagrammed on the right.

For a more complex example, take the following (still relatively simple) expression grammar: \[ \begin{array}{rcl} 1 & := & '1' \\ + & := & '+' \\ * & := & '*' \\ \textrm{op} & := & +\ |\ * \\ \textrm{ex} & := & \textrm{term}\ \cdot\ \textrm{op}\ \cdot\ \textrm{term} \\ \textrm{term} & := & 1\ |\ \textrm{ex} \end{array} \]

Note that this grammar is ambiguous: a string like "1+1*1" can be parsed in two ways, with different precedences for the + and the *. Currently, building the parser is a bit hideous:

    Recurrence term = new Recurrence();
    Parser one = new Literal('1');
    Parser add = new Literal('+');
    Parser mul = new Literal('*');
    Parser op  = new Alternative(add, mul);
    Parser ex  = new Concat(term,new Concat(op,term));
    Parser l   = new Alternative(one,ex);
    term.setParser(l);
    for (char ch : "1+1*1".toCharArray())
    {
      l = l.derive(ch);
    }
    System.out.println( l.deriveNull() );

The interesting thing is that the result is a pair of parse trees:

[<1,<+,<1,<*,1>>>>, <<1,<+,1>>,<*,1>>]

These two trees represent "1+(1*1)" and "(1+1)*1". Might seems to feel that this is a better way of debugging ambiguous grammars, rather than yacc's shift/reduce and reduce/reduce errors. I'm a little dubious.

The weakness of the current system is the number of nodes in the parser graph. S111 above has 12 nodes, the parse of "11111" has 20, ten 1's have 177 nodes, and twenty 1's have 5395 before the deriveNull step (which is going to generate more nodes as it forces delayed derivations). But dealing with that is for next time.

If you want to take a look at the code, check github.

No comments: