## Saturday, February 25, 2012

### Practical regular expression derivatives

Once again, I am returning to the topic of implementing regular expressions. I don't know why I like them, other than that regexes are fun and easy and closely related to important parts of computer science. Anyway, this one is a new, old idea.

Disclaimer: "Practical" may not be entirely the correct term; I have not done anything new or special here, nor am I releasing feature-complete, industrial strength code. But, hey, I had to title the post something. I think the approach is practical, anyway.

## Background

Roughly speaking, there are two ways of turning a regular expression into something executable; both begin by building a nondeterministic finite state machine from the expression. The first then converts the nondeterministic FSM into a deterministic finite state machine (that this is always possible and the details of how it is done can be found in your local automata theory textbook). The deterministic FSM can then be either interpreted directly or compiled to code. The second is to interpret the nondeterministic machine itself. Both approaches have strengths and weaknesses.

A deterministic finite state machine offers the possibility of speed: each character of the input is examined once and discarded. This is one reason why deterministic machines are frequently compiled to code, which can be executed directly and provides excellent performance. This is the method used by the lex and flex lexical analysis tools. On the other hand, the conversion to a deterministic machine can result in an exponential increase in the number of states, causing the method to use excessive amounts of memory. Also, the conversion to a deterministic machine changes the structure of the FSM, making it difficult (if not impossible, but it seems like I've seen something to the contrary) to have features like matching groups common to most regular expression libraries.

Interpreting the nondeterministic machine is the way most regular expression libraries work. Because the nondeterministic machine is almost a direct translation of the original expression, the regular expression engine can easily support features like matching groups and back references that make regular expression libraries strictly more powerful than theoretical regular languages. On the other hand, because it is implementing a nondeterministic machine, the library must use backtracking algorithms; upon reading a character 'a', the engine must enter some state s and subsequent characters in the input stream may not match, forcing the engine to back up to the choice, re-read the 'a' and choose a different state t to enter. Backtracking means that matching an input stream is an exponential algorithm.

Regular expression derivatives are a third option.

## Regular expression derivatives

The derivative of a regular expression, according to Matt Might, is defined as:

The derivative of a regular expression with respect to a character is a new regular expression that matches the tails of all strings beginning with that character and which match the original regular expression [; it] computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.

So, the basic idea of a regular expression engine based on derivatives is, given a regular expression and upon reading a character from the input, to compute the derivative of the expression with respect to the character, giving a new expression that can be matched against the next character.

## Asides...

Before I go any further, in deference to my aeronautical engineering friends, a word about the use of "derivative". What's up with that? The use goes back to Janusz A. Brzozoski in 1964, and I have not been able to find his paper to determine how he justifies the term. On the other hand, if you have a mathematical expression, say $$x^2$$, you can compute the derivative with respect to $$x$$, $$2x$$, but you can also compute the derivative with respect to $$y$$. So, why not use it for regexes?

Oh, and one other thing: why? Most existing regular expression libraries, including lex, flex, and the various Perl-compatible libraries, are what I would like to call closed. (Matt Might calls them blocking, but I think that is an overloading of that word that becomes confusing since it will be used in the same context as slightly different uses of "blocking".) Say you are attempting to parse data coming into your program on a TCP stream. One of the behaviors of TCP that occasionally confuses beginners is that, no matter how big a block of data you write to the sending side of a stream at once, the sending TCP will break the data into segments to transmit them; when the receiving TCP gets the next segment in the stream, it makes the data from that segment available to the application. As a result, the gets data in batches, potentially one segment at a time. If you were attempting to parse space-separated tokens, you could get a non-space character at the end of a segment. In order to determine if the character ends a token or if the token is continued, the program must wait until the next segment becomes available. Waiting for another segment is problematic if it prevents the program from doing something else that you would like it to do.

Now, most regular expression libraries are closed, meaning they would block in that situation waiting for either the end of the entire input stream or at least the end of the provisional token. If instead you want non-blocking behavior, you need an open parser (Oleg Kiselyov refers to these as incremental), one which can return control to the rest of the program while maintaining its current parsing state on this input stream. There is no reason that either deterministic or nondeterministic libraries could not support being open; most just don't. Therefore, a new implementation is needed, with different trade-offs. The derivative technique makes this sort of behavior easy. The other primary method, using delimited (or non-delimited) continuations (à la Oleg), is a wee bit more complicated to implement, although it would allow any library to be used incrementally.

## Teh formal stuff

Matt Might provides the following formulaic definition of computing the derivative of a regular expression. Dc(re) is the derivative of the regular expression re. ∅ is the empty expression, the regular expression that matches no strings. ε is the regular expression that matches exactly (and only) the empty string. Regular expressions are built from ∅, ε, regular characters, sequential composition (re1 re2), union (re1 | re2), and the Kleene star (re*).

• Dc(∅) = ∅
• Dc(ε) = ∅
• Dc(c) = ε if c = c'
• Dc(c') = ∅ if c ≠ c'
• Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
• Dc(re1 | re2) = Dc(re1) | Dc(re2)
• Dc(re*) = Dc(re) re*

This definition uses an additional function, δ(re), that effectively determines whether or not its argument expression accepts the empty string. If the expression accepts the empty string, δ evaluates to ε; if it does not, δ evaluates to ∅. By using δ in the above expression for sequential composition, Dc enables or disables the left side of the result; if re1 accepts the empty string then "re1 re2" could accept the strings matched by r2; otherwise it cannot and only those that begin with re1 are acceptable.

• δ(∅) = ∅
• δ(ε) = ε
• δ(c) = ∅
• δ(re1 re2) = δ(re1) δ(re2)
• δ(re1 | re2) = δ(re1) | δ(re2)
• δ(re*) = ε

The matching strategy with derivatives is straightforward:

1. If the string to match is empty and the current pattern matches empty, then the match succeeds.
2. If the string to match is non-empty, the new pattern is the derivative of the current pattern with respect to the first character of the current string, and the new string to match is the remainder of the current string.

Therefore, δ also allows us to check whether the overall pattern matches: if δ(re) is ε at the end of the input, the pattern matches; if δ(re) is ∅ at the end of the input, the pattern does not match.

In words, the derivative of ∅ and ε is ∅; they do not match anything. The derivative of a character is ε if the character matches the pattern and ∅ otherwise. The derivative of concatenated expressions, re1 re2, is the union of the derivative of re2 (if re1 accepts the empty string) and the derivative of re1 concatenated with re2. The derivative of a union expression is the union of the derivatives of the two expressions. Finally, the derivative of a Kleene star applied to an expression re is the derivative of re concatenated with the original re* expression.

How do these definitions play out in code? Not bad at all.

## Java

This code is strongly based on Matt Might's Scala code.

The code is defined in terms of the regular language which the given expression operator accepts, for example a single character or the union of two other regular languages. The conversion of a regular expression into the corresponding description of the regular language should be fairly obvious, after you have seen the code.

### RegularLanguage

A RegularLanguage, the superclass of the individual operators, requires four methods and has a few helpful operations.

public abstract class RegularLanguage
{

abstract public RegularLanguage derive(char c);
abstract public boolean acceptsEmptyString();
abstract public boolean rejectsAll();
abstract public boolean isEmptyString();

public boolean mustBeSubsumedBy(RegularLanguage other) { return false; }

public RegularLanguage oneOrMore()
{
if (this.isEmptyString()) { return this; }
if (this.rejectsAll()) { return EmptySet.instance; }
return Catenation.concatenate(this, Star.star(this));
}

public RegularLanguage option()
{
if (this.isEmptyString()) { return this; }
if (this.rejectsAll()) { return Epsilon.instance; }
return Union.union(Epsilon.instance, this);
}
}


derive() implements Dc; acceptsEmptyString() implements δ. rejectsAll() and isEmptyString() are tests, effectively for ∅ and ε.

mustBeSubsumedBy() is a small, but important, optimization; it tests whether this expression is a sub-language of the other expression. It is hard to compute and as a result by default false.

oneOrMore() and option() are basic extensions to the regular expression language so far; oneOrMore() is equivalent to the '+' operation and option() is the '?' operation. EmptySet, Catenation, Star, Union, and Epsilon are described below.

### EmptySet

The EmptySet is the RegularLanguage that contains no strings; It does not accept the empty string, the derivative with respect to anything is also the EmptySet, and it rejects any input.

  public static class EmptySet extends RegularLanguage
{
public static final EmptySet instance = new EmptySet();

@Override public RegularLanguage derive(char c) { return this; }
@Override public boolean acceptsEmptyString() { return false; }
@Override public boolean rejectsAll() { return true; }
@Override public boolean isEmptyString() { return false; }

private EmptySet() { }
}


### Epsilon

Epsilon represents ε, the language consisting of the empty string. The derivative is the EmptySet, since any character is not the empty string.

  public static class Epsilon extends RegularLanguage
{
public static final Epsilon instance = new Epsilon();

@Override public RegularLanguage derive(char c) { return EmptySet.instance; }
@Override public boolean acceptsEmptyString() { return true; }
@Override public boolean rejectsAll() { return false; }
@Override public boolean isEmptyString() { return true; }

private Epsilon() { }
}


### Character

Character is the language consisting of the string with the one, given character. Unlike the above classes, Character cannot be a singleton, although you could optimize this a bit by caching instances.

The derivative is Epsilon, if the character matches, or EmptySet, if it does not. Character does not accept the empty string.

Character is one case where mustBeSubsumedBy() is possible to compute; it is true if the other language is also this same character or is AnyChar.

  public static class Character extends RegularLanguage
{
public final char ch;

public Character(char ch) { this.ch = ch; }

@Override public RegularLanguage derive(char c) { return (this.ch == c) ? Epsilon.instance : EmptySet.instance; }
@Override public boolean acceptsEmptyString() { return false; }
@Override public boolean rejectsAll() { return false; }
@Override public boolean isEmptyString() { return false; }

@Override public boolean mustBeSubsumedBy(RegularLanguage other)
{
return ((other instanceof Character && this.ch == ((Character) other).ch) || (other instanceof AnyChar));
}
}


### AnyChar, CharSet, and NotCharSet

AnyChar is equivalent to the regular expression '.', CharSet implements '[...]', equivalent to the union of a finite set of Characters. NotCharSet is the negation of a CharSet. I'm not going to present the code because they are very similar to Character.

### Catenation

This is where things start to get interesting. In fact, it is the most complex of the classes.

A Catenation is built from a prefix expression and a suffix expression. As an optimization, the factory method concatenate() checks whether either is the empty string or if either rejects all input, in which cases the result is simpler than the inputs.

The derivative is,

• The concatenation of the derivative of the prefix and the suffix, if the prefix does not accept the empty string, or
• The union of the above and the derivative of the suffix, if the prefix accepts the empty string.

A Catenation accepts the empty string if both the prefix and the suffix accept the empty string. It rejects all input if either the prefix or the suffix reject all input. It is the empty string if both the prefix and the suffix are the empty string, although this and the rejects all input case should have been handled by the concatenate() factory.

  public static class Catenation extends RegularLanguage
{
public final RegularLanguage prefix;
public final RegularLanguage suffix;

public static RegularLanguage concatenate(RegularLanguage prefix, RegularLanguage suffix)
{
if (prefix.isEmptyString()) { return suffix; }
if (suffix.isEmptyString()) { return prefix; }
if (prefix.rejectsAll() || suffix.rejectsAll()) { return EmptySet.instance; }
return new Catenation(prefix, suffix);
}

private Catenation(RegularLanguage prefix, RegularLanguage suffix) { this.prefix = prefix; this.suffix = suffix; }

@Override
public RegularLanguage derive(char c)
{
final RegularLanguage simple = concatenate(prefix.derive(c), suffix);
return prefix.acceptsEmptyString() ? Union.union(simple, suffix.derive(c)) : simple;
}

@Override public boolean acceptsEmptyString() { return prefix.acceptsEmptyString() && suffix.acceptsEmptyString(); }
@Override public boolean rejectsAll() { return prefix.rejectsAll() || suffix.rejectsAll(); }
@Override public boolean isEmptyString() { return prefix.isEmptyString() && suffix.isEmptyString(); }
}


### Union

The Union has left and right subexpressions. The union() factory method simplifies the resulting expression if the left or right subexpressions reject all input or must be subsumed by one another.

The derivative of a Union is a Union of the derivatives of the components and a Union accepts the empty string if either of the components accepts the empty string.

  public static class Union extends RegularLanguage
{
public final RegularLanguage left;
public final RegularLanguage right;

public static RegularLanguage union(RegularLanguage left, RegularLanguage right)
{
if (left.rejectsAll() || left.mustBeSubsumedBy(right)) { return right; }
if (right.rejectsAll() || right.mustBeSubsumedBy(left)) { return left; }
return new Union(left,right);
}

private Union(RegularLanguage left, RegularLanguage right) { this.left = left; this.right = right; }

@Override public RegularLanguage derive(char c) { return Union.union(left.derive(c), right.derive(c)); }
@Override public boolean acceptsEmptyString() { return left.acceptsEmptyString() || right.acceptsEmptyString(); }
@Override public boolean rejectsAll() { return left.rejectsAll() && right.rejectsAll(); }
@Override public boolean isEmptyString() { return left.isEmptyString() && right.isEmptyString(); }
}


### Star

Star implements the Kleene star; it has a base language and the factory method simplifies the result if the base is the empty string or rejects all input. The derivative is the concatenation of the derivative of the base language and this original Star instance. Star accepts the empty string, which would be zero repetitions of the base language.

  public static class Star extends RegularLanguage
{
public final RegularLanguage language;

public static RegularLanguage star(RegularLanguage l)
{
if (l.isEmptyString()) { return l; }
if (l.rejectsAll()) { return Epsilon.instance; }
return new Star(l);
}

private Star(RegularLanguage language) { this.language = language; }

@Override public RegularLanguage derive(char c) { return Catenation.concatenate(language.derive(c), this); }
@Override public boolean acceptsEmptyString() { return true; }
@Override public boolean rejectsAll() { return false; }
@Override public boolean isEmptyString() { return language.isEmptyString(); }
}


## Conclusion

The code for this engine is a little over 300 lines, including some comments and code that I haven't presented. That is about the same size as Matt Might's Scala version. A quick test showed a moderately complex match of /usr/share/dict/words took 180-200ms, which is roughly an order of magnitude slower than wc.

How could I use it to implement an open scanner for TCP connections? The state of the match is the current RegularLanguage, created by deriving from the prior RegularLanguage based on the last input character. By exposing that RegularLanguage, or an interface to it, I can throw each incoming character at the regex engine, handling matches as they occur, and then go off to do something else when I reach the end of a segment and need to wait for further input.

At the moment, I am not going to make the code available; there is not much to it as it is. Instead, my next project is to look at the context free parsers that Might, et al., describe, created using grammar derivatives.

## Monday, February 13, 2012

### The Anti-Meeting Theory of Time

1. Meetings occupy time.
2. Meeting cancellations free up time.
3. Scheduling a meeting, then canceling it results in the original amount of free time.
4. What happens when you schedule a meeting, then cancel it more than once?
5. Time is created!

## Thursday, February 9, 2012

### Link o' the day: Java and memory

From proggit, this is a presentation from OOPSLA 2008:

Building Memory-efficient Java Applications

## Friday, February 3, 2012

### Quote o' the week: Don Stewart on the big picture

This is from Don Stewart's keynote at PADL'12 (formally, the Fourteenth International Symposium on Practical Aspects of Declarative Languages) and is one of the few fundamental truths of computer programming. (Number three might be controversial.)

Really big picture: why software?
• Increase economic productivity by automating everything
• Make the complexity of the world tractable by making it hackable
• Impose structure through typed data
• Transform work/jobs/tasks into resuable/composable functions
• Find new efficiencies. Repeat

(Demonstrate you can do this and you'll never be out of work)

## Wednesday, February 1, 2012

### State space search: A*

Last time, I quoted Wikipedia, "For the 15-puzzle, lengths of optimal solutions range from 0 to 80 moves". Unfortunately, the two solutions I found require 255 and 205 moves. That seems less than satisfactory.

So, here's the state of 15-puzzle solutions as it stands: we have depth-first search, which is relatively fast but which is not guaranteed to find a solution. On the other hand, we have breadth-first search, which is guaranteed to find an optimal solution, in the sense of the smallest number of moves for the 15-puzzle, but which is slow and expensive. And finally, we have a couple of heuristics which, used alone, relatively quickly find a solution, having some of the benefits of depth-first search, but which find markedly sub-optimal solutions.

What this situation calls for is a slightly modified algorithm. The algorithm I have in mind is called A*. A* is actually a basic modification of the state evaluation function used by an informed search. Instead of the Manhattan distance or the number of tiles out of position (which I have learned should be called the Hamming distance) as the evaluation function, A* uses the sum of a forward-looking heuristic function and the cost of the path from the initial state to the state being evaluated. For the 15-puzzle, this would be, say, the Manhattan distance of the current state from the goal state plus the number of moves from the initial state to the current state.

A* works by being a best-first search; by expanding the state that looks like it will produce the best solution, rather than a state that is simply structurally next (as in the case of uninformed searches) or a state that looks like it is nearest to a solution. As a result, A* is likely to use fewer resources than breadth-first search while finding a better solution than an informed depth-first search.

In fact, A* has a very nice property: if the forward-looking heuristic function is admissible, then the search algorithm is guaranteed to find the optimal solution. Admissible simply means that the heuristic function does not overestimate the cost of the path from the current state to a goal. Possibly the simplest way of finding an admissible heuristic function is to relax some of the constraints on the problem; for example, the Hamming distance heuristic relaxes the 15-puzzle problem by allowing tiles to be directly moved to their goal position and counting the number of tiles thus moved while the Manhattan distance heuristic relaxes the constraints by allowing tiles to move directly to their goal states (rather than requiring sliding them into the empty space) and counting the number of moves needed. Both the Hamming distance and the Manhattan distance are admissible.

Converting the fifteenpuzzle code to use A* is incredibly easy:

  public static class AStarDistance implements Comparator<FState>
{
@Override
public int compare(FState o1, FState o2)
{
return (o1.totalDistance() + o1.pathLength) - (o2.totalDistance() + o2.pathLength);
}
}


I created a new implementation of Comparator<FState> which compares the two states by computing the totalDistance value (the Manhattan distance) of each object and the length of the path from the initial state to the states and then comparing the resulting sums. When creating an instance of Search<FState>, use a PriorityQueue based on this Comparator.

How well does it work? I've got no clue. Sorry. The A* search goes through hundreds of thousands of states before it exhausts the default JVM heap size, and it takes long enough to do so that I don't really feel interested in trying it with more memory. The implementation could be optimized, but this is an exponential algorithm and optimization would be unlikely to make too much of a dent. (On the other hand, there is the possibility, however remote, that I've really hoarked something up. If so, I'd appreciate knowing about it.)

But I have the next best thing: by modifying the A* evaluation function, I can bias the evaluation towards the heuristic function, producing a solution in reasonable time and space. Unfortunately by doing so, I am likely to effectively make the heuristic function inadmissible, meaning that A* is no longer guaranteed to find an optimal solution. This is the implementation:

  public static class BiasedAStarDistance implements Comparator<FState>
{
private final int bias;

public BiasedAStarDistance(int bias)
{
this.bias = bias;
}

@Override
public int compare(FState o1, FState o2)
{
return (bias * o1.totalDistance() + o1.pathLength) - (bias * o2.totalDistance() + o2.pathLength);
}
}


For me, it works significantly better than the un-biased A* search.

BiasStates examinedPath length
5386575
4426683
315,44273
219,87957

You can get the code on github. The previous posts are State space search: the basics and State space search: heuristics.