Practical regular expression derivatives

Posted on February 25, 2012 by Tommy McGuire

Once again, I am returning to the topic of implementing regular expressions. I don't know why I like them, other than that regexes are fun and easy and closely related to important parts of computer science. Anyway, this one is a new, old idea.

Disclaimer: "Practical" may not be entirely the correct term; I have not done anything new or special here, nor am I releasing feature-complete, industrial strength code. But, hey, I had to title the post something. I think the approach is practical, anyway.

Background

Roughly speaking, there are two ways of turning a regular expression into something executable; both begin by building a nondeterministic finite state machine from the expression. The first then converts the nondeterministic FSM into a deterministic finite state machine (that this is always possible and the details of how it is done can be found in your local automata theory textbook). The deterministic FSM can then be either interpreted directly or compiled to code. The second is to interpret the nondeterministic machine itself. Both approaches have strengths and weaknesses.

A deterministic finite state machine offers the possibility of speed: each character of the input is examined once and discarded. This is one reason why deterministic machines are frequently compiled to code, which can be executed directly and provides excellent performance. This is the method used by the lex and flex lexical analysis tools. On the other hand, the conversion to a deterministic machine can result in an exponential increase in the number of states, causing the method to use excessive amounts of memory. Also, the conversion to a deterministic machine changes the structure of the FSM, making it difficult (if not impossible, but it seems like I've seen something to the contrary) to have features like matching groups common to most regular expression libraries.

Interpreting the nondeterministic machine is the way most regular expression libraries work. Because the nondeterministic machine is almost a direct translation of the original expression, the regular expression engine can easily support features like matching groups and back references that make regular expression libraries strictly more powerful than theoretical regular languages. On the other hand, because it is implementing a nondeterministic machine, the library must use backtracking algorithms; upon reading a character 'a', the engine must enter some state s and subsequent characters in the input stream may not match, forcing the engine to back up to the choice, re-read the 'a' and choose a different state t to enter. Backtracking means that matching an input stream is an exponential algorithm.

Regular expression derivatives are a third option.

Regular expression derivatives

The derivative of a regular expression, according to Matt Might, is defined as:

The derivative of a regular expression with respect to a character is a new regular expression that matches the tails of all strings beginning with that character and which match the original regular expression [; it] computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.

So, the basic idea of a regular expression engine based on derivatives is, given a regular expression and upon reading a character from the input, to compute the derivative of the expression with respect to the character, giving a new expression that can be matched against the next character.

Asides...

Before I go any further, in deference to my aeronautical engineering friends, a word about the use of "derivative". What's up with that? The use goes back to Janusz A. Brzozoski in 1964, and I have not been able to find his paper to determine how he justifies the term. On the other hand, if you have a mathematical expression, say \(x^2\), you can compute the derivative with respect to \(x\), \(2x\), but you can also compute the derivative with respect to \(y\). So, why not use it for regexes?

Oh, and one other thing: why? Most existing regular expression libraries, including lex, flex, and the various Perl-compatible libraries, are what I would like to call closed. (Matt Might calls them blocking, but I think that is an overloading of that word that becomes confusing since it will be used in the same context as slightly different uses of "blocking".) Say you are attempting to parse data coming into your program on a TCP stream. One of the behaviors of TCP that occasionally confuses beginners is that, no matter how big a block of data you write to the sending side of a stream at once, the sending TCP will break the data into segments to transmit them; when the receiving TCP gets the next segment in the stream, it makes the data from that segment available to the application. As a result, the gets data in batches, potentially one segment at a time. If you were attempting to parse space-separated tokens, you could get a non-space character at the end of a segment. In order to determine if the character ends a token or if the token is continued, the program must wait until the next segment becomes available. Waiting for another segment is problematic if it prevents the program from doing something else that you would like it to do.

Now, most regular expression libraries are closed, meaning they would block in that situation waiting for either the end of the entire input stream or at least the end of the provisional token. If instead you want non-blocking behavior, you need an open parser (Oleg Kiselyov refers to these as incremental), one which can return control to the rest of the program while maintaining its current parsing state on this input stream. There is no reason that either deterministic or nondeterministic libraries could not support being open; most just don't. Therefore, a new implementation is needed, with different trade-offs. The derivative technique makes this sort of behavior easy. The other primary method, using delimited (or non-delimited) continuations (à la Oleg), is a wee bit more complicated to implement, although it would allow any library to be used incrementally.

Teh formal stuff

Matt Might provides the following formulaic definition of computing the derivative of a regular expression. D_c(re) is the derivative of the regular expression re. ∅ is the empty expression, the regular expression that matches no strings. ε is the regular expression that matches exactly (and only) the empty string. Regular expressions are built from ∅, ε, regular characters, sequential composition (re1 re2), union (re1 | re2), and the Kleene star (re*).

D_c(∅) = ∅
D_c(ε) = ∅
D_c(c) = ε if c = c'
D_c(c') = ∅ if c ≠ c'
D_c(re1 re2) = δ(re1) D_c(re2) | D_c(re1) re2
D_c(re1 | re2) = D_c(re1) | D_c(re2)
D_c(re*) = D_c(re) re*

This definition uses an additional function, δ(re), that effectively determines whether or not its argument expression accepts the empty string. If the expression accepts the empty string, δ evaluates to ε; if it does not, δ evaluates to ∅. By using δ in the above expression for sequential composition, D_c enables or disables the left side of the result; if re1 accepts the empty string then "re1 re2" could accept the strings matched by r2; otherwise it cannot and only those that begin with re1 are acceptable.

δ(∅) = ∅
δ(ε) = ε
δ(c) = ∅
δ(re1 re2) = δ(re1) δ(re2)
δ(re1 | re2) = δ(re1) | δ(re2)
δ(re*) = ε

The matching strategy with derivatives is straightforward:
If the string to match is empty and the current pattern matches empty, then the match succeeds.
If the string to match is non-empty, the new pattern is the derivative of the current pattern with respect to the first character of the current string, and the new string to match is the remainder of the current string.

Therefore, δ also allows us to check whether the overall pattern matches: if δ(re) is ε at the end of the input, the pattern matches; if δ(re) is ∅ at the end of the input, the pattern does not match.

In words, the derivative of ∅ and ε is ∅; they do not match anything. The derivative of a character is ε if the character matches the pattern and ∅ otherwise. The derivative of concatenated expressions, re1 re2, is the union of the derivative of re2 (if re1 accepts the empty string) and the derivative of re1 concatenated with re2. The derivative of a union expression is the union of the derivatives of the two expressions. Finally, the derivative of a Kleene star applied to an expression re is the derivative of re concatenated with the original re* expression.

How do these definitions play out in code? Not bad at all.

Java

This code is strongly based on Matt Might's Scala code.

The code is defined in terms of the regular language which the given expression operator accepts, for example a single character or the union of two other regular languages. The conversion of a regular expression into the corresponding description of the regular language should be fairly obvious, after you have seen the code.

RegularLanguage

A RegularLanguage, the superclass of the individual operators, requires four methods and has a few helpful operations.


public abstract class RegularLanguage
{

  abstract public RegularLanguage derive(char c);
  abstract public boolean acceptsEmptyString();
  abstract public boolean rejectsAll();
  abstract public boolean isEmptyString();
  
  public boolean mustBeSubsumedBy(RegularLanguage other) { return false; }
  
  public RegularLanguage oneOrMore()
  {
    if (this.isEmptyString()) { return this; }
    if (this.rejectsAll()) { return EmptySet.instance; }
    return Catenation.concatenate(this, Star.star(this));
  }
  
  public RegularLanguage option()
  {
    if (this.isEmptyString()) { return this; }
    if (this.rejectsAll()) { return Epsilon.instance; }
    return Union.union(Epsilon.instance, this);
  }
}

derive() implements D_c; acceptsEmptyString() implements δ. rejectsAll() and isEmptyString() are tests, effectively for ∅ and ε.

mustBeSubsumedBy() is a small, but important, optimization; it tests whether this expression is a sub-language of the other expression. It is hard to compute and as a result by default false.

oneOrMore() and option() are basic extensions to the regular expression language so far; oneOrMore() is equivalent to the '+' operation and option() is the '?' operation. EmptySet, Catenation, Star, Union, and Epsilon are described below.

EmptySet

The EmptySet is the RegularLanguage that contains no strings; It does not accept the empty string, the derivative with respect to anything is also the EmptySet, and it rejects any input.


  public static class EmptySet extends RegularLanguage
  {
    public static final EmptySet instance = new EmptySet();

    @Override public RegularLanguage derive(char c) { return this; }
    @Override public boolean acceptsEmptyString() { return false; }
    @Override public boolean rejectsAll() { return true; }
    @Override public boolean isEmptyString() { return false; }

    private EmptySet() { }
  }

Epsilon

Epsilon represents ε, the language consisting of the empty string. The derivative is the EmptySet, since any character is not the empty string.


  public static class Epsilon extends RegularLanguage
  {
    public static final Epsilon instance = new Epsilon();

    @Override public RegularLanguage derive(char c) { return EmptySet.instance; }
    @Override public boolean acceptsEmptyString() { return true; }
    @Override public boolean rejectsAll() { return false; }
    @Override public boolean isEmptyString() { return true; }
    
    private Epsilon() { }
  }

Character

Character is the language consisting of the string with the one, given character. Unlike the above classes, Character cannot be a singleton, although you could optimize this a bit by caching instances.

The derivative is Epsilon, if the character matches, or EmptySet, if it does not. Character does not accept the empty string.

Character is one case where mustBeSubsumedBy() is possible to compute; it is true if the other language is also this same character or is AnyChar.


  public static class Character extends RegularLanguage
  {
    public final char ch;
    
    public Character(char ch) { this.ch = ch; }

    @Override public RegularLanguage derive(char c) { return (this.ch == c) ? Epsilon.instance : EmptySet.instance; }
    @Override public boolean acceptsEmptyString() { return false; }
    @Override public boolean rejectsAll() { return false; }
    @Override public boolean isEmptyString() { return false; }
    
    @Override public boolean mustBeSubsumedBy(RegularLanguage other)
    {
      return ((other instanceof Character && this.ch == ((Character) other).ch) || (other instanceof AnyChar));
    }
  }

AnyChar, CharSet, and NotCharSet

AnyChar is equivalent to the regular expression '.', CharSet implements '[...]', equivalent to the union of a finite set of Characters. NotCharSet is the negation of a CharSet. I'm not going to present the code because they are very similar to Character.

Catenation

This is where things start to get interesting. In fact, it is the most complex of the classes.

A Catenation is built from a prefix expression and a suffix expression. As an optimization, the factory method concatenate() checks whether either is the empty string or if either rejects all input, in which cases the result is simpler than the inputs.

The derivative is,

The concatenation of the derivative of the prefix and the suffix, if the prefix does not accept the empty string, or
The union of the above and the derivative of the suffix, if the prefix accepts the empty string.

A Catenation accepts the empty string if both the prefix and the suffix accept the empty string. It rejects all input if either the prefix or the suffix reject all input. It is the empty string if both the prefix and the suffix are the empty string, although this and the rejects all input case should have been handled by the concatenate() factory.


  public static class Catenation extends RegularLanguage
  {
    public final RegularLanguage prefix;
    public final RegularLanguage suffix;
    
    public static RegularLanguage concatenate(RegularLanguage prefix, RegularLanguage suffix)
    {
      if (prefix.isEmptyString()) { return suffix; }
      if (suffix.isEmptyString()) { return prefix; }
      if (prefix.rejectsAll() || suffix.rejectsAll()) { return EmptySet.instance; }
      return new Catenation(prefix, suffix);
    }
    
    private Catenation(RegularLanguage prefix, RegularLanguage suffix) { this.prefix = prefix; this.suffix = suffix; }
    
    @Override
    public RegularLanguage derive(char c)
    {
      final RegularLanguage simple = concatenate(prefix.derive(c), suffix);
      return prefix.acceptsEmptyString() ? Union.union(simple, suffix.derive(c)) : simple;
    }

    @Override public boolean acceptsEmptyString() { return prefix.acceptsEmptyString() && suffix.acceptsEmptyString(); }
    @Override public boolean rejectsAll() { return prefix.rejectsAll() || suffix.rejectsAll(); }
    @Override public boolean isEmptyString() { return prefix.isEmptyString() && suffix.isEmptyString(); }
  }

Union

The Union has left and right subexpressions. The union() factory method simplifies the resulting expression if the left or right subexpressions reject all input or must be subsumed by one another.

The derivative of a Union is a Union of the derivatives of the components and a Union accepts the empty string if either of the components accepts the empty string.


  public static class Union extends RegularLanguage
  {
    public final RegularLanguage left;
    public final RegularLanguage right;
    
    public static RegularLanguage union(RegularLanguage left, RegularLanguage right)
    {
      if (left.rejectsAll() || left.mustBeSubsumedBy(right)) { return right; }
      if (right.rejectsAll() || right.mustBeSubsumedBy(left)) { return left; }
      return new Union(left,right);
    }
    
    private Union(RegularLanguage left, RegularLanguage right) { this.left = left; this.right = right; }
    
    @Override public RegularLanguage derive(char c) { return Union.union(left.derive(c), right.derive(c)); }
    @Override public boolean acceptsEmptyString() { return left.acceptsEmptyString() || right.acceptsEmptyString(); }
    @Override public boolean rejectsAll() { return left.rejectsAll() && right.rejectsAll(); }
    @Override public boolean isEmptyString() { return left.isEmptyString() && right.isEmptyString(); }
  }

Star

Star implements the Kleene star; it has a base language and the factory method simplifies the result if the base is the empty string or rejects all input. The derivative is the concatenation of the derivative of the base language and this original Star instance. Star accepts the empty string, which would be zero repetitions of the base language.


  public static class Star extends RegularLanguage
  {
    public final RegularLanguage language;
    
    public static RegularLanguage star(RegularLanguage l)
    {
      if (l.isEmptyString()) { return l; }
      if (l.rejectsAll()) { return Epsilon.instance; }
      return new Star(l);
    }
    
    private Star(RegularLanguage language) { this.language = language; }
    
    @Override public RegularLanguage derive(char c) { return Catenation.concatenate(language.derive(c), this); }
    @Override public boolean acceptsEmptyString() { return true; }
    @Override public boolean rejectsAll() { return false; }
    @Override public boolean isEmptyString() { return language.isEmptyString(); }
  }

Conclusion

The code for this engine is a little over 300 lines, including some comments and code that I haven't presented. That is about the same size as Matt Might's Scala version. A quick test showed a moderately complex match of /usr/share/dict/words took 180-200ms, which is roughly an order of magnitude slower than wc.

How could I use it to implement an open scanner for TCP connections? The state of the match is the current RegularLanguage, created by deriving from the prior RegularLanguage based on the last input character. By exposing that RegularLanguage, or an interface to it, I can throw each incoming character at the regex engine, handling matches as they occur, and then go off to do something else when I reach the end of a segment and need to wait for further input.

At the moment, I am not going to make the code available; there is not much to it as it is. Instead, my next project is to look at the context free parsers that Might, et al., describe, created using grammar derivatives.

Comments

You might enjoy looking at how we compile a simple implementation of regular expression derivatives using partial evaluation, and get super-fast regular expression matchers as a result.

William Cook