Monads and regular expressions

Posted on August 8, 2010 by Tommy McGuire

Labels: software development, notation, toy problems, haskell, books, programming language

The thing that makes Software Tools, by Brian W. Kernighan and P. J. Plauger, a classic among programming books is that it begins with a minimal set of very basic operations and uses them to implement a good collection of the standard Unix tool set, while demonstrating good design and development principles and good coding techniques. This is in spite of being more than thirty years old, being written around languages that have not been widely used in decades (Pascal, in one version of the book) or that most readers have never heard of (Ratfor, in the other version), and containing some (rare) ideas that thirty years of programming experience have shown to be ill-advised.

As for basic operations, Software Tools begins with literally nothing more than reading a character from the standard input stream. From that, it presents file manipulation tools, an archiver, an editor (ok, a very simple one), and some text processing and source-to-source compiling tools. In doing so, it shows the use of the command-line-level tools that are being developed as well as the re-use of code tools built for previous programs.

The range of the book, from the simple to the complex if not comprehensive, is the reason I chose it as a tool for learning Haskell. As I mentioned on the pages I wrote about the project, I had previously bounced off Haskell precisely because all of the tutorials I found were built around topics that I was already familiar with from LISP, ML and O'Caml, while none of them dealt with the kind of grubby programming that Software Tools presents and that is so different in Haskell. (No, I have not read Real World Haskell yet. Soon, I promise.)

Anyway, one of the software tools presented in Software Tools is a simple nondeterministic regular expression engine, used as the basis of the sed- and grep-like tools, as well as used in the ed-like editor. While reimplementing it, dividing the work into an expression parser and the actual engine itself was easiest. I based the engine on the Maybe data type, allowing me to return "Just String" in the case where a regular expression component matched (and the string value was the remainder of the input) and "Nothing" where the component did not match.

It was at that point that I remembered that Maybe was a monad.

Note: This post is not a monad tutorial. I do not know enough category-theoretical mathematics to understand what the heck a monad is, so I am completely unqualified to write a meaningful one. Sorry.

However, what I do know about monads is that they are a way of sequencing operations and that they are not anything new. If you have written any code in an imperative language like Java, C, C#, C++, FORTRAN, Pascal, etc., then you have used a monad. In particular, you have used what Haskell calls the IO monad, because Haskell uses the monad structure to sequence I/O operations, a perennial problem for pure functional languages. Indeed, with Haskell's do notation and the IO monad, code is almost indistinguishable from any normal procedural language. What separates Haskell is that it goes further, exposing the monad as a general programming feature. And the neat thing is that there is more than one way to sequence operations.

The Maybe monad is one of those ways. Consider a series of operations each of which can succeed, passing its results on to the next operation, or fail, causing the entire series to fail. The Maybe monad exactly encapsulates this form of sequencing: a successful operation returns "Just x" where x is the result of the operation; a failing operation returns "Nothing", which is passed on through the remainder of the operations to be the result of the whole series. Operations after the failing one are not executed.

The thing is, expressing the regular expression engine using monadic notation was obviously better than expressing it using Maybe directly, since the latter required handling the Nothing cases that the former dealt with automatically. On the other hand, both forms meant the same thing. A similar effect is apparent when comparing, say, walking over a tree structure using a loop control structure, which requires manipulating a stack structure, with walking over the tree recursively, where the stack is implicit.

Anyway, while working on Software Tools, I realized the potential of monadic operations but did not immediately see how to go further to take advantage of them. That is the topic of that post.


> module Pattern where
> 
> import Prelude hiding (or,seq)
> import Control.Monad
> import List hiding (or)

A regular expression pattern is represented by an instance of the Pattern type, which is built from a constructor, P, and a function from a pair of lists to some type (which will be the monad) containing another pair of lists.

The definition of the Pattern type represents an instance of an "extractor" pattern (ahem) in Haskell; the contents of the P value is actually a record containing a single field, runPattern, which is the function. Because of the way that Haskell records work, applying the runPattern extractor to a record produces the function contained in the record, which can be immediately applied. Clever, no? But not especially interesting....

The two lists may require some explanation: the first represents the portion of the input that has already been matched by prior parts of the expression while the second represents the remainder of the input to be matched.

When I originally wrote this, the contents of the pairs were Strings, which happen to be lists of Chars. Subsequently, I realized I could generalize it further simply by removing the reference to the String type and using a plain list of elements. Further down, some operations add an Eq a requirement on the list elements, since they are compared. However, even with that change, the engine works on Strings as it did originally.


> newtype Pattern m a = P { runPattern :: ([a],[a]) -> m ([a],[a]) }

Most of the rest of the functions in the module represent specific types of regular expression elements. "done", for example represents the end of a pattern; whatever has already been matched is good enough, and whatever is left of the input string is will not be examined. The implementation of "done" is by the monadic "return" operation, which just passes the input pair into the result monad. The regular expression form of "done" is as the end of the pattern.


> done :: Monad m => Pattern m a
> done = P return

"end," on the other hand, requires the end of the input string. It also uses a MonadPlus type class constraint, since the concept of failure in monadic terms is represented by mzero; mzero is in turn represented by different special values for different MonadPlus instances: Nothing for the Maybe instance, the empty list for the list instance, and so on.. The regular expression form of "end" is as the metacharacter '$'.


> end :: MonadPlus m => Pattern m a
> end = P end'
>     where
>       end' (m,[]) = return (m,[])
>       end' _ = mzero

"any" is the opposite of "end." Where "end" requires the end of the input string, "any" requires any character of the input string. Note that the character matched is removed from the input string and added to the previously matched string. The regular expression form of "any" is the metacharacter '.'.


> any :: MonadPlus m => Pattern m a
> any = P any'
>     where
>       any' (m,(c:cs)) = return (m++[c],cs)
>       any' (_,[]) = mzero

"lit" is very similar to "any", except that it requires the input to match a given string, passed as an argument to create the Pattern. The regular expression form of "lit' is a non-metacharacter string.

This function is the first to add a constraint on the type a, Eq. The constraint is needed because the literal value must be compared to the input being matched. Fortunately, the String type is an alias for [Char], and there is an Eq Char instance.


> lit :: (MonadPlus m, Eq a) => [a] -> Pattern m a
> lit literal = P lit'
>     where
>       lit' (m,s) | literal `isPrefixOf` s = return (m++literal, drop len s)
>                  | otherwise = mzero
>       len = length literal

Positive ("cls") and negative ("ncls") character classes are also similar to "any" and "lit".


> cls :: (MonadPlus m, Eq a) => [a] -> Pattern m a
> cls set = P cls'
>     where
>       cls' (m,(c:cs)) | c `elem` set = return (m++[c], cs)
>                       | otherwise = mzero
>       cls' _ = mzero
> 
> ncls :: (MonadPlus m, Eq a) => [a] -> Pattern m a
> ncls set = P ncls'
>     where
>       ncls' (m,(c:cs)) | c `elem` set = mzero
>                        | otherwise = return (m++[c], cs)
>       ncls' _ = mzero

The remaining three functions are very different from the previous, base, pattern elements. The first two provide two ways of combining two Patterns to form a new Pattern, while the third modifies a Pattern to construct a new Pattern.

"seq" combines the two patterns sequentially, attempting the first pattern and then passing the result on to the second pattern. The regular expression form is just the concatenation of patterns.

"seq" is the first function to really use monadic capabilities, beyond the return function and mzero "failure" value. The monadic bind operator, "(>>=)," provides the primary sequencing in a regular expression, allowing the structure underlying the monad to do whatever it is that it does.


> seq :: Monad m => Pattern m a -> Pattern m a -> Pattern m a
> seq p1 p2 = P $ \ cur -> runPattern p1 cur >>= runPattern p2

Fundamentally, the "seq" operation tries the first pattern on the input pattern-matching state and then tries the second on the result. The "or" operation is significantly different. It applies the original, input pattern-matching state to both of the input patterns and combines the results using "mplus" rather than (>>=). If mzero represents failure and (>>=) represents sequencing, mplus represents alternation or a combination of the input patterns. If the first pattern fails, the second still has the chance to examine the input from the same state. The regular expression form of "or" is "|", as in "a|b", which matches "a..." or "b...".


> or :: MonadPlus m => Pattern m a -> Pattern m a -> Pattern m a
> or p1 p2 = P $ \ cur -> (runPattern p1 cur) `mplus` (runPattern p2 cur)

One of the downsides of nondeterministic regular expression engines such as this one is that some patterns and inputs can take exponential time to match. For example, consider the expression "a*aaa" and the input "aaa". Most nondeterministic regular expression engines will use the "a*" to match the entire string, then fail to find the "aaa" literal. The engine must then back up one character at a time until it can find the match for the whole string. The roots of the exponential explosion are the "or" operation above, which stores the current state for reuse, and the Kleene star operation, "kst", which uses "or" to implement recursive, depth-first search. The regular expression form of "kst" is "*", as in "a*".

"kst" implements greedy matching, like most regular expression engines. It attempts to match p, followed by a recursive call to (kst p); if either fails, kst calls "done", to indicate that it is happy to match whereever it is. [My original implementation here had the "done" call first, which implemented non-greedy matching. I blame operator error as to why I thought it had to be that way.]


> kst :: MonadPlus m => Pattern m a -> Pattern m a
> kst p = (p `seq` (kst p)) `or` done

So, how does it work?

*Pattern> runPattern done ("","abc") :: Maybe (String,String)
Just ("","abc")
*Pattern> runPattern end ("","abc") :: Maybe (String,String)
Nothing
*Pattern> runPattern end ("","") :: Maybe (String,String)
Just ("","")
*Pattern> runPattern (lit "a") ("","a") :: Maybe (String,String)
Just ("a","")
*Pattern> runPattern (lit "b") ("","a") :: Maybe (String,String)
Nothing
*Pattern> runPattern ((lit "a") `seq` (lit "b")) ("","a") :: Maybe (String,String)
Nothing
*Pattern> runPattern ((lit "a") `seq` (lit "b")) ("","ab") :: Maybe (String,String)
Just ("ab","")
*Pattern> runPattern ((lit "a") `seq` (lit "b")) ("","b") :: Maybe (String,String)
Nothing
*Pattern> runPattern ((lit "a") `or` (lit "b")) ("","ab") :: Maybe (String,String)
Just ("a","b")
*Pattern> runPattern ((lit "a") `or` (lit "b")) ("","ba") :: Maybe (String,String)
Just ("b","a")
*Pattern> runPattern (kst (lit "a")) ("","ba") :: Maybe (String,String)
Just ("","ba")
*Pattern> runPattern (kst (lit "a")) ("","aba") :: Maybe (String,String)
Just ("a","ba")
*Pattern> runPattern (kst (lit "a")) ("","aaba") :: Maybe (String,String)
Just ("aa","ba")

These examples show the output of the eengine in the Maybe monad, which provides the first match.

If Maybe were the only option, the monadic approach to the code would not have much to recommend it. Fortunately, it is not, and the List monad can be used to create a list of successes:

*Pattern> runPattern done ("","abc") :: [(String,String)]
[("","abc")]
*Pattern> runPattern end ("","") :: [(String,String)]
[("","")]
*Pattern> runPattern end ("","abc") :: [(String,String)]
[]
*Pattern> runPattern (lit "a") ("","abc") :: [(String,String)]
[("a","bc")]
*Pattern> runPattern (lit "b") ("","abc") :: [(String,String)]
[]
*Pattern> runPattern ((lit "a") `seq` (lit "b")) ("","abc") :: [(String,String)]
[("ab","c")]
*Pattern> runPattern ((lit "a") `seq` (lit "b")) ("","a") :: [(String,String)]
[]
*Pattern> runPattern ((lit "a") `seq` (lit "b")) ("","ba") :: [(String,String)]
[]
*Pattern> runPattern ((lit "a") `or` (lit "b")) ("","ab") :: [(String,String)]
[("a","b")]
*Pattern> runPattern ((lit "a") `or` (lit "b")) ("","ba") :: [(String,String)]
[("b","a")]
*Pattern> runPattern ((lit "b") `or` (lit "ba")) ("","ba") :: [(String,String)]
[("b","a"),("ba","")]
*Pattern> runPattern (kst (lit "a")) ("","ba") :: [(String,String)]
[("","ba")]
*Pattern> runPattern (kst (lit "a")) ("","aba") :: [(String,String)]
[("a","ba"),("","aba")]
*Pattern> runPattern (kst (lit "a")) ("","aaba") :: [(String,String)]
[("aa","ba"),("a","aba"),("","aaba")]

There are some small but very significant differences between the previous uses and the latest. For one thing, the result is a List rather than a Maybe value; an empty list for failure and a list of one or more pairs for success. The interesting bit is the "or more" part: the "or" operation can provide two potential matching patterns, in which case both appear in the output. Likewise, the Kleene star operation always succeeds at least once, but can succeed potentially many times.

As for the generalization to lists of type a rather than Strings, check out these interesting but not otherwise useful examples:

*Pattern> runPattern (lit ["a"]) ([],["a","b","c"]) :: [([String],[String])]
[(["a"],["b","c"])]
*Pattern> runPattern (lit ["a","b"]) ([],["a","b","c"]) :: [([String],[String])]
[(["a","b"],["c"])]
*Pattern> runPattern (lit [1,2]) ([],[1,2,3]) :: [([Int],[Int])]
[([1,2],[3])]

Postscript

I could not write a post about regular expressions without linking to The Prime That Wasn’t, just to show an extended regular expression (which is not supported by this current engine) that is more powerful than (non-extended) regular languages should be.

Comments

You should really give real world Haskell a read because it would explain monads better because programming in an imperative language is not the same as using a monad. They were only made to look and feel the same for the purpose of feeling like you could write this imperative, line by line, code in Haskell. That was one of the reasons that IO in Haskell is a Monad. A monad in Haskell still follows rules and has a type whereas, in a language like C, you can do whatever you want whenever you want. No rules.

Anonymous
2010-08-08

Anonymous, you seem confused. "Following rules" and "having a type" are totally independent from whether something is imperative, or whether it's monadic.

Monads are not, in general, about imperative programming. But I think it's fair to say that the IO monad (and a few others, like ST) are imperative mini-languages embedded within Haskell.

Anonymous
2010-08-08

The cool thing about parameterizing on the monad is that you should be able to get an implementation that is not backtracking by simply switching to an appropriate monad.

Cool stuff!

Josef
2010-08-09

Anonymous #1: I know, I know. I really do need to read RWH.

I have to differ with the rest of your comment, though. The monadic structure, in the IO implementation anyway, is imperative code. Beyond the fact that IO is a dumping ground of non-IO things like mutable variables, I would be willing to bet that Haskell's IO monad is the same mathematical structure as sequential operations in any non-pure language. The advantage of Haskell is that it exposes that monadic structure to other implementations, which gives you all sorts of useful features as well as letting you isolate particular state operations from other state operations as well as from IO, specifically with things like monad transformers.

Anonymous #2: You're right, the type system is independent from imperative or monadic code, although it is how you choose what monad to use. But the IO monad specifically is not an "imperative mini-language"; it is an imperative language, although one that is missing stuff like loops.

Josef: That was exactly the idea that I started with when I was playing with the regex engine. My original was pretty much specific to Maybe, although using a monadic structure cleaned it up greatly. What I wanted to do was to generalize it to get things like backtracking.

I am still a little upset that doing that seemed to require the Kleene star to be non-greedy; there should be some simple modification that I am missing, that would not require that. And I need to find some other monads to see what neat things they'll do with the same code.

Tommy McGuire
2010-08-10