Variations on the theme of monadic regular expressions: Abstraction

Posted on September 6, 2010 by Tommy McGuire
Labels: toy problems, haskell, programming language
Regular expressions are a more general tool than appears in normal usage. They can be used to specify any finite state machine, although without extensions to the expressions such machines would have no visible behavior beyond accepting or rejecting. (They're also both a floor wax and a dessert topping, but are artificial whipped cream and a some kind of very lame floor wax, if such a thing is possible.)

In any case, it might be useful to have a regular expression engine that can operate on different collections than Haskell's List and on elements of a different type than Char. (In Haskell, a String is a List of Char's.) That is the purpose of this variation on the monadic regular expression engine.

> module AbstractEngine where
>
> import Prelude hiding (or,seq,length,drop,elem)
> import Control.Monad
> import List as L hiding (or,drop,elem)


In addition to the underlying monad and the base element type, the Pattern type must be parameterized by the type of the collection of elements, s. As with the previous iteration, the two elements of the pair are the previously matched text and the yet-to-be-examined text.

> newtype Pattern m s a = P { runPattern :: (s a, s a) -> m (s a, s a) }


Most of the regular expression operations are very similar to their counterparts in the previous version. Certainly, done is.

> done :: Monad m => Pattern m s a
> done = P return


The end function is different in two respects. First, it adds a type class, Q, which must be instanced by the collection s. Second, instead of using pattern matching to examine the state of the input, it uses the isEmpty predicate, part of the Q type class.

> end :: (Q s, MonadPlus m) => Pattern m s a
> end = P $ \ (m,e) -> if isEmpty e then return (m,e) else mzero


Likewise, any uses the isEmpty predicate in addition to the deq function to decompose the input and the enq function to re-compose the matched output. The next function, lit, adds prefix, append, drop, and len.

> any :: (Q s, MonadPlus m) => Pattern m s a
> any = P $ \ (m,s) -> let (hd,tl) = deq s in
>                          if isEmpty s then mzero else return (m `enq` hd, tl)
>
> lit :: (MonadPlus m, Q s, Eq a) => s a -> Pattern m s a
> lit l = P $ \ (m,s) -> if l `prefix` s then return (m `append` l, drop (len l) s) else mzero


The cls and ncls functions are made up of a plethora of functions from the Q type class. Part of the reason is that, for simplicity, I chose to represent the set arguments to cls and ncls as values of the type of the collections.

> cls :: (Q s, MonadPlus m, Eq a) => s a -> Pattern m s a
> cls set = P cls'
>     where
>       cls' (m,s) | (isEmpty s) || not (hd `elem` set)  = mzero
>                  | otherwise                           = return (m `enq` hd, tl)
>                  where
>                    (hd,tl) = deq s
>
> ncls :: (Q s, MonadPlus m, Eq a) => s a -> Pattern m s a
> ncls set = P ncls'
>     where
>       ncls' (m,s) | (isEmpty s) || (hd `elem` set)  = mzero
>                   | otherwise                       = return (m `enq` hd, tl)
>                  where
>                    (hd,tl) = deq s


The seq, or, and kst combinators are identical to the previous version, with the exception of the additional type parameter.

> seq :: Monad m => Pattern m s a -> Pattern m s a -> Pattern m s a
> seq p1 p2 = P $ \ cur -> runPattern p1 cur >>= runPattern p2
>
> or :: MonadPlus m => Pattern m s a -> Pattern m s a -> Pattern m s a
> or p1 p2 = P $ \ cur -> (runPattern p1 cur) `mplus` (runPattern p2 cur)
>
> kst :: MonadPlus m => Pattern m s a -> Pattern m s a
> kst p = (p `seq` (kst p)) `or` done


The type class Q is the key to this version of the regular expression engine. Q is used as a constraint on the collection type of the input, and is named somewhat appropriately, Any instance of this class must supply the basic functions isEmpty, enq, and deq.



> class Q s where
>     isEmpty :: s a -> Bool
>     enq     :: s a -> a -> s a
>     deq     :: s a -> (a, s a)


The remaining functions in the type class are provided with implementations based on the previous functions. For the most part, they are fairly simple exercises in basic Haskell, if of limited style, given their reliance on only the three base Q functions.

>     len :: s a -> Int
>     len s | isEmpty s  = 0
>           | otherwise  = 1 + (len tl)
>           where
>           (_,tl) = deq s
>
>     prefix :: Eq a => s a -> s a -> Bool
>     prefix s t | isEmpty s   = True
>                | isEmpty t   = False
>                | hds == hdt  = prefix tls tlt
>                | otherwise   = False
>                where
>                (hds, tls) = deq s
>                (hdt, tlt) = deq t
>
>     drop :: Int -> s a -> s a
>     drop 0 s = s
>     drop i s = let (_,tl) = deq s in
>                    if isEmpty s then s else drop (i-1) tl
>
>     elem :: Eq a => a -> s a -> Bool
>     elem a set | isEmpty set  = False
>                | a == elt     = True
>                | otherwise    = elem a rem
>                where
>                (elt,rem) = deq set
>
>     append :: s a -> s a -> s a
>     append s t | isEmpty t  = s
>                | otherwise  = (s `enq` elt) `append` rem
>                where
>                (elt,rem) = deq t


The instance declaration of Q for Lists allows the engine to handle Strings as well as potentially other structures, and illustrates how to develop a suitable instance for another collection type.

> instance Q [] where
>     isEmpty = L.null
>     enq l a = l ++ [a]
>     deq q   = (L.head q, L.tail q)


The point of this article is not necessarily to present the wondrous generic regular expression parser. If there is any point to it at all, it is the technique of using a type class to parameterize a module and to require, and supply, the minimal set of features for necessary functionality. The same technique is applicable to almost any language.
active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.