Variations on the theme of monadic regular expressions: Back references

Posted on September 14, 2010 by Tommy McGuire
Labels: notation, toy problems, haskell, programming language
Adding the capability of matching previously-matched input, or back references, completely changes the power of a regular expression engine. In fact, as a result the engine is strictly more powerful than the class of regular expressions. The key reason why is that back references add a storage capability.

This module reinforces that key difference by showing an implementation of a monadic regular expression engine that includes back references, by building on the previous Pattern module and by explicitly implementing the state.

> module BackrefPattern (PatternState, newState, matched, remain,
> Pattern, runPattern', runPattern, done, end,
> any, litc, lits, cls, ncls, seq, or, kst,
> stor, ref) where
> import qualified Pattern as P
> import qualified Data.Map as M
> import qualified List as L
> import Prelude hiding (or,seq,any)
> import Control.Monad.State

The storage facility is represented by the bak element of this module's PatternState record. The bak element is a mapping from integers to strings, where the integer is the "number" of the reference and the string is the previously matched text. The cur element is used to record the number of matching references the engine has seen.

This module does not implement most of the regular expression combinators. Instead, it uses the previous Pattern module, and thus incorporates that module's state, in the form of the st :: P.PatternState element.

> data PatternState = PS {
> cur :: Int,
> bak :: M.Map Int String,
> st :: P.PatternState
> } deriving Show

The newstate function is a utility used to simplify constructing a PatternState value.

> newState :: String -> PatternState
> newState str = PS { cur = 0, bak = M.empty, st = P.newState str }

The matched and remain functions make accessing the elements of the st element easier, in effect flattening the accessors to the contained record. An alternative way of looking at this is that the lower-level P.matched and P.remain functions are lifted into this module by first applying the st accessor to get the lower-level state.

> matched, remain :: PatternState -> String
> matched = P.matched . st
> remain = P.remain . st

The next declaration is where things get exciting.

Heretofore, the regular expression engines have passed their own states around directly, which works acceptibly becaues those states are relatively simple. With the addition of the new state elements, I thought it worthwhile to add a State monad, which controls access to the PatternState while the engine is being evaluated.

In fact, since this engine is based on the lower-level monad m, I use the StateT monad transformer. Monad transformers act as "layerable" monads: a transformer applied to an underlying monad based on the underlying monad but adding the new capabilities of the transformer.

StateT is a transformer which adds a modifyable state to the computation. Picking apart the type declaration below, StateT s m a is a monad transformer, carrying a state of type s through its computation. m is the type of the underlying monad, and a is the result of the current computation. (For example, in the IO monad, the getChar function has the type IO Char; Char is the type of the result of a getChar computation.) In this case, the state is a PatternState and the result of the computations will always be (). (All of the information necessary for regular expression matching is carried through the PatternState.)

> type Pattern m = StateT PatternState m ()

The runPattern' function is similar to the similarly named function in the Pattern module. In this case, the StateT monad transformer provides a runStateT function, which accepts a Pattern m and an initial PatternState and which executes the computation described by the Pattern m in a similar way to the use of runPattern' in the Pattern module, In that case, runPattern' exposed the function contained in the P.Pattern value and subsequently applied it to an initial P.PatternState value. In this module, runStateT works similarly (if I understand it correctly) and returns a value of the form m ((), PatternState); that value is then forwarded to the second expression in runPattern', which discards the unnecessary ().

> runPattern' :: Monad m => Pattern m -> PatternState -> m PatternState
> runPattern' pat ps = runStateT pat ps >>= return . snd

The runPattern function is a utility which constructs an initial PatternState from the input String and then calls runPattern'. It forms the basic interface to this regular expression engine.

> runPattern :: Monad m => Pattern m -> String -> m PatternState
> runPattern pat = runPattern' pat . newState

The following collection of functions implement the base combinators of the regular expression engine. They do it by lifting the corresponding function from the Pattern module into this module. The idea of lifting a function from one computational context into another is very important, particularly when dealing with monad transformers. Since transformers organize monads into layers, operations in the lower-level monad must be lifted into the higher-level layer in order to be used there. For more examples of monad transformer layers and lifting and the hideous, hideous code needed (in addition to the boilerplate pasted over records), see Chapter 7 and Chapter 8 of Software Tools in Haskell. You'll probably regret it.

> done, end, any :: MonadPlus m => Pattern m
> done = runP P.done
> end = runP P.end
> any = runP P.any
> litc :: MonadPlus m => Char -> Pattern m
> litc c = runP $ P.litc c
> lits :: MonadPlus m => String -> Pattern m
> lits c = runP $ P.lits c
> cls, ncls :: MonadPlus m => String -> Pattern m
> cls set = runP $ P.cls set
> ncls set = runP $ P.ncls set

The actual lifting is done by the runP function. This function takes a pattern combinator from the Pattern module and uses it to create a Pattern in this module with similar behavior. The actual definition of runP below is a bit of a one-liner, but it expands into the following function:

-- runP pat = do
-- pps <- gets st
-- pps' <- P.runPattern' pat pps
-- modify (\ps -> ps { st = pps' })

The basic idea is that the current lower-level engine state, pps (think,, is recovered from the StateT, then this lower-level state is used to run the lower-level pattern, creating a new lower-level state, pps'. The new lower-level state is then injected into the PatternState maintained by StateT.

The one-line version is:

> runP pat = gets st >>= P.runPattern' pat >>= \pps' -> modify $ \ps -> ps { st = pps' }

I would like to provide a type for runP, really I would. In fact, I would like to provide this type for it:
-- runP :: MonadPlus m => P.Pattern m -> Pattern m
Unfortunately, that type makes ghc unhappy:
Occurs check: cannot construct the infinite type:
m = StateT PatternState m
When generalising the type(s) for `runP'
That is, in spite of the fact that runP acts as if it had that type.

According to ghci, runP actually has the type:
-- runP :: MonadState PatternState m => P.Pattern m -> m ()
which seems to be to be very wrong. As if to complete my confusion, ghc returns the following error. (I chose not to use any extensions to Haskell, so I do not know what would happen if I used -XFlexibleContexts.)
Non type-variable argument
in the constraint: MonadState PatternState m
(Use -XFlexibleContexts to permit this)
In the type signature for `runP':
runP :: (MonadState PatternState m) => P.Pattern m -> m ()
If anyone can help resolve my confusion, I would greatly appreciate it.

Fortunately, runP works even if I am not sure why.

The remainder of the regular expression combinators are more straightforward to implement directly.

> seq :: Monad m => Pattern m -> Pattern m -> Pattern m
> seq p1 p2 = p1 >> p2
> or :: MonadPlus m => Pattern m -> Pattern m -> Pattern m
> or p1 p2 = p1 `mplus` p2
> kst :: MonadPlus m => Pattern m -> Pattern m
> kst p = (p `seq` (kst p)) `or` done

We are now getting to the meat of this module. In addition to the normal regular expression combinators, it adds stor and ref. The stor function is similar to kst in that it takes as an argument a Pattern m and produces another Pattern m. When evaluated, stor runs its contained pattern and, if the pattern matches, stores the matched text in the bak map under the next back reference number.

Specifically, it increments cur, reserving a position in the map. Then, it runs the pattern with a new lower-level state. When the runPattern completes, the matched text will be the text to record; this is safe because the matched text is not relevant to the behavior of the pattern on the input. Finally, it records the matched text and new lower-level state via put.

> stor p = do
> modify $ \ ps -> ps { cur = (cur ps) + 1 }
> ps <- get
> ps' <- runPattern' p $ ps { st = P.newState (remain ps) }
> put $ ps {
> bak = M.insert (cur ps) (matched ps') (bak ps'),
> st = P.PS ((matched ps) ++ (matched ps')) (remain ps')
> }

The implementation of stor as a combinator causes a bit of odd behavior. Consider the regular expression "((a)*)", or as represented by combinators, "stor (kst (stor (litc 'a')))", on the string "aa"; most engines would produce two mappings, { 1 => "aa", 2 => "a" }. This engine will potentially produce three: { 1 => "aa", 2 => "a", 3 => "a" }. The 1 value represents the outer stor matching a*, and the 2 represents the inner stor matching against the first a, while the 3 represents the inner stor matching against the second a. Bizzare, but probably fixable.

stor, however, has two bigger issues: first, like runP, it doesn't type right, and second, it does not correctly backtrack unless m is the List monad.

First, the type I would like to assing to stor is Monad m => Pattern m -> Pattern m; the type that ghci gives it is (MonadState PatternState m) => Pattern m -> m ().

Second, stor does not actually backtrack at all. If the lower-level monad m is Maybe then this function runs normally, recording the first text matched by the pattern p. If subsequent combinators fail, the recorded text is never revised, meaning that a PatternState which could succeed is never examined. Fortunately, everything works correctly if the lower-level monad is List.

*BackrefPattern> runPattern (stor (kst $ litc 'a') `seq` (ref 1) `seq` end) "aa" :: Maybe PatternState
*BackrefPattern> runPattern (stor (kst $ litc 'a') `seq` (ref 1) `seq` end) "aa" :: [PatternState]
[PS {cur = 1, bak = fromList [(1,"a")], st = PS {matched = "aa", remain = ""}}]

The ref combinator, on the other hand, is fairly simple and acts much like the lits combinator.

> ref :: MonadPlus m => Int -> Pattern m
> ref n = do
> ps <- get
> r <- gets remain
> m <- gets matched
> case M.lookup n (bak ps) of
> Nothing -> mzero
> Just s | s `L.isPrefixOf` r -> put $ ps { st = P.PS (m ++ s) (drop (length s) r) }
> | otherwise -> mzero

Now, for the real reason I did this, and the reason the bugs and type issues don't matter: the world's most (maybe) inefficient primality test (described in Avinash Meetoo: Blog and in detail in Perl tricks by Neil Kandalgaonkar and originally seen in Ruby).

The plu combinator is the + regular expression special character: it causes the pattern to match one or more times, rather than kst's zero or more times. The opt combinator is the ? special character, which matches zero or one times.

> plu :: MonadPlus m => Pattern m -> Pattern m
> plu pat = pat `seq` (kst pat)
> opt :: MonadPlus m => Pattern m -> Pattern m
> opt pat = pat `or` done

The one function is a Pattern matching a literal '1'.

> one :: MonadPlus m => Pattern m
> one = litc '1'

The primePattern is the regular expression that will check if a number is prime or not.

> primePattern :: MonadPlus m => Pattern m
> primePattern = ((opt (litc '1')) `or` (stor (one `seq` (plu one))) `seq` (plu (ref 1))) `seq` end

isPrime attempts to match the primePattern against a sequence of n '1' characters. If it matches, n is not prime, otherwise n is prime. Yay!

> isPrime n = null $ runPattern primePattern sequence
> where
> sequence = replicate n '1'

In practice:

*BackrefPattern> take 10 $ filter isPrime [0..]

Seems pretty snappy until I get up to 650 or so. The non-greedy version of kst might be useful.

But no! This is the last post on monadic regular expressions for a while! Really! Something completely different, if not more interesting, is coming.


I mentioned in the post my problems typing a couple of the functions, runP and stor. Today, I was reading The Gateway Drug to Type Programming and Haskell features I'd like to see in other languages, and I realized that rank-2 types might work. I came up with the following:

> runP :: MonadPlus m => (forall m. MonadPlus m => P.Pattern m) -> (Pattern m)
> runP pat = ...


> stor :: MonadPlus m => (forall m. MonadPlus m => Pattern m) -> Pattern m
> stor p = ...

In the case of runP, I kind of understand it; the Monad for P.Pattern may not be the same Monad as for Pattern, conceptually, as far as the type checker thinks.

Clearly, I need to do some reading on rank-n types.

Tommy McGuire
active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.