Variations on the theme of monadic regular expressions: Records

Posted on September 13, 2010 by Tommy McGuire

Labels: notation, toy problems, haskell, programming language

Like the previous very abstract pattern module, this implementation of a regular expression engine has very little reason to exist. Mostly it demonstrates the use of Haskell records rather than the pair (of Strings and other things) used in previous versions. As it turns out, I do not particularly like the record syntax, for reasons that should become clear as I go through this monstrosity.

However, this module is used as the basis of the next version, which provides back-references in patterns, making that module strictly stronger than typical regular expressions. Since that version uses records and is considerably more complex, it seemed like a good idea to start simpler here.

> module Pattern (PatternState(PS), newState, remain, matched,
>                 Pattern, runPattern, runPattern', any, end, done,
>                 litc, lits, cls, ncls, seq, or, kst) where
> 
> import Prelude hiding (or,seq,length,drop,elem,any)
> import qualified List as L
> import Control.Monad
> import Control.Monad.State

A PatternState replaces the pair. Although this example is isomorphic to the pair and obviously gratuitous, records in general are useful to bind together more elements than a tuple could reasonably manage.

Records in Haskell are very similar to tuples: they contain a fixed number of values of any type, where the types within a given tuple are heterogeneous. To access the elements of the record, it generates accessor functions named after the record elements. These functions take a record value and project out the element value corresponding to the function name. In this case, since both elements are Strings, the two functions are, effectively,

matched :: PatternState -> String
remain :: PatternState -> String

Records are always immediately preceded by a type constructor, in this case PS. In fact, the record is tightly connected to the type constructor, which has a type of String -> String -> PatternState.

In the case of records, there is an alternative syntax to construct a value:

PS { matched = "abc", remain = "def" }

This syntax, like records themselves, is more applicable when there are many fields in the record. I have generally used the function syntax to create PatternState values because it is marginally shorter in this instance.

Updates are the other record peculiarity. ("Update" is perhaps a misnomer since these records are purely functional.) Rather than actually updating a record value, the idea is to create a new record value, based on an existing value with specified differences. The syntax is:

ps { matched = "ghi" }

where ps is an existing PatternState value.

> data PatternState = PS { matched :: String, remain :: String } deriving Show

The next few functions are utilities for PatternState values.

newState creates a PatternState from a String.
empty is a predicate which is true when all of the input text has been matched.
next returns the next character in the input. It is not total, and unsafe to use if the PatternState is empty.
advance moves the first character from the input to the end of the matched string. It to is not total
advanceN does the same for n characters.

> newState :: String -> PatternState
> newState str = PS "" str
> 
> empty :: PatternState -> Bool
> empty ps = remain ps == ""
> 
> next :: PatternState -> Char
> next ps = head $ remain ps
> 
> advance :: PatternState -> PatternState
> advance ps = PS ((matched ps) ++ [h]) t
>     where
>       (h:t) = remain ps
> 
> advanceN :: PatternState -> Int -> PatternState
> advanceN ps n = PS ((matched ps) ++ pre) suf
>     where
>       (pre,suf) = splitAt n $ remain ps

If you followed the discussion of records, you know how the Pattern extractor thing works. P is a type constructor that takes a function PatternState -> m PatternState (where m will be the underlying monad) and produces a Pattern. runPattern' is a function taking a Pattern and returning the original function, which can be applied to evaluate the pattern.

> newtype Pattern m = P { runPattern' :: PatternState -> m PatternState }

The function runPattern is a utility that constructs an initial PatternState and then executes the given Pattern.

> runPattern :: Pattern m -> String -> m PatternState
> runPattern pat s = runPattern' pat $ PS "" s

Because it is a very common operation in the pattern combinators below, madvance is separated out as a function. It calls advance on a PatternState and then returns the PatternState into the monad m.

> madvance :: Monad m => PatternState -> m PatternState
> madvance = return . advance

The majority of the combinators are very similar to their previous incarnations, with the exception that these functions do not use pattern matching to deconstruct their inputs but instead use the record and utility functions from above.

> done :: (Monad m) => Pattern m
> done = P return
>
> end :: (MonadPlus m) => Pattern m
> end = P $ \ps -> if empty ps then return ps else mzero
>
> any :: (MonadPlus m) => Pattern m
> any = P $ \ps -> if empty ps then mzero else madvance ps
> 
> litc :: (MonadPlus m) => Char -> Pattern m
> litc c = P $ \ps -> if empty ps || (next ps) /= c then mzero else madvance ps
> 
> lits :: (MonadPlus m) => String -> Pattern m
> lits s = P $ \ps -> if empty ps || not (s `L.isPrefixOf` (remain ps))
>                        then mzero
>                        else return $ advanceN ps (L.length s)
> 
> cls :: (MonadPlus m) => String -> Pattern m
> cls set = P $ \ps -> if empty ps || not ((next ps) `L.elem` set)
>                         then mzero
>                         else madvance ps
> 
> ncls :: (MonadPlus m) => String -> Pattern m
> ncls set = P $ \ps -> if empty ps || (next ps) `L.elem` set
>                          then mzero
>                          else madvance ps

On the other hand, the combining operations, seq, or, and kst, are almost identical to their corresponding previous versions.

> seq :: (Monad m) => Pattern m -> Pattern m -> Pattern m
> seq p1 p2 = P $ \ps -> runPattern' p1 ps >>= runPattern' p2
> 
> or :: (MonadPlus m) => Pattern m -> Pattern m -> Pattern m
> or p1 p2 = P $ \ps -> (runPattern' p1 ps) `mplus` (runPattern' p2 ps)
> 
> kst :: MonadPlus m => Pattern m -> Pattern m
> kst p = (p `seq` (kst p)) `or` done

The resulting engine runs as expected. Yay.

*Pattern> runPattern done "abc" :: [PatternState]
[PS {matched = "", remain = "abc"}]
*Pattern> runPattern end "abc" :: [PatternState]
[]
*Pattern> runPattern end "" :: [PatternState]
[PS {matched = "", remain = ""}]
*Pattern> runPattern (litc 'a') "" :: [PatternState]
[]
*Pattern> runPattern (litc 'a') "a" :: [PatternState]
[PS {matched = "a", remain = ""}]
*Pattern> runPattern (litc 'a') "b" :: [PatternState]
[]
*Pattern> runPattern ((litc 'a') `seq` (litc 'b')) "b" :: [PatternState]
[]
*Pattern> runPattern ((litc 'a') `seq` (litc 'b')) "a" :: [PatternState]
[]
*Pattern> runPattern ((litc 'a') `seq` (litc 'b')) "ab" :: [PatternState]
[PS {matched = "ab", remain = ""}]
*Pattern> runPattern ((litc 'a') `or` (litc 'b')) "b" :: [PatternState]
[PS {matched = "b", remain = ""}]
*Pattern> runPattern ((litc 'a') `or` (litc 'b')) "a" :: [PatternState]
[PS {matched = "a", remain = ""}]
*Pattern> runPattern ((litc 'a') `or` (litc 'b')) "ab" :: [PatternState]
[PS {matched = "a", remain = "b"}]
*Pattern> runPattern (kst (litc 'a')) "b" :: [PatternState]
[PS {matched = "", remain = "b"}]
*Pattern> runPattern (kst (litc 'a')) "ab" :: [PatternState]
[PS {matched = "a", remain = "b"},PS {matched = "", remain = "ab"}]
*Pattern> runPattern (kst (litc 'a')) "aab" :: [PatternState]
[PS {matched = "aa", remain = "b"},PS {matched = "a", remain = "ab"},PS {matched = "", remain = "aab"}]