Variations on the theme of monadic regular expressions: Records

Posted on September 13, 2010 by Tommy McGuire
Labels: notation, toy problems, haskell, programming language
Like the previous very abstract pattern module, this implementation of a regular expression engine has very little reason to exist. Mostly it demonstrates the use of Haskell records rather than the pair (of Strings and other things) used in previous versions. As it turns out, I do not particularly like the record syntax, for reasons that should become clear as I go through this monstrosity.

However, this module is used as the basis of the next version, which provides back-references in patterns, making that module strictly stronger than typical regular expressions. Since that version uses records and is considerably more complex, it seemed like a good idea to start simpler here.

> module Pattern (PatternState(PS), newState, remain, matched,
> Pattern, runPattern, runPattern', any, end, done,
> litc, lits, cls, ncls, seq, or, kst) where
> import Prelude hiding (or,seq,length,drop,elem,any)
> import qualified List as L
> import Control.Monad
> import Control.Monad.State

A PatternState replaces the pair. Although this example is isomorphic to the pair and obviously gratuitous, records in general are useful to bind together more elements than a tuple could reasonably manage.

Records in Haskell are very similar to tuples: they contain a fixed number of values of any type, where the types within a given tuple are heterogeneous. To access the elements of the record, it generates accessor functions named after the record elements. These functions take a record value and project out the element value corresponding to the function name. In this case, since both elements are Strings, the two functions are, effectively,
matched :: PatternState -> String
remain :: PatternState -> String

Records are always immediately preceded by a type constructor, in this case PS. In fact, the record is tightly connected to the type constructor, which has a type of String -> String -> PatternState.

In the case of records, there is an alternative syntax to construct a value:
PS { matched = "abc", remain = "def" }
This syntax, like records themselves, is more applicable when there are many fields in the record. I have generally used the function syntax to create PatternState values because it is marginally shorter in this instance.

Updates are the other record peculiarity. ("Update" is perhaps a misnomer since these records are purely functional.) Rather than actually updating a record value, the idea is to create a new record value, based on an existing value with specified differences. The syntax is:
ps { matched = "ghi" }
where ps is an existing PatternState value.

> data PatternState = PS { matched :: String, remain :: String } deriving Show

The next few functions are utilities for PatternState values.

> newState :: String -> PatternState
> newState str = PS "" str
> empty :: PatternState -> Bool
> empty ps = remain ps == ""
> next :: PatternState -> Char
> next ps = head $ remain ps
> advance :: PatternState -> PatternState
> advance ps = PS ((matched ps) ++ [h]) t
> where
> (h:t) = remain ps
> advanceN :: PatternState -> Int -> PatternState
> advanceN ps n = PS ((matched ps) ++ pre) suf
> where
> (pre,suf) = splitAt n $ remain ps

If you followed the discussion of records, you know how the Pattern extractor thing works. P is a type constructor that takes a function PatternState -> m PatternState (where m will be the underlying monad) and produces a Pattern. runPattern' is a function taking a Pattern and returning the original function, which can be applied to evaluate the pattern.

> newtype Pattern m = P { runPattern' :: PatternState -> m PatternState }

The function runPattern is a utility that constructs an initial PatternState and then executes the given Pattern.

> runPattern :: Pattern m -> String -> m PatternState
> runPattern pat s = runPattern' pat $ PS "" s

Because it is a very common operation in the pattern combinators below, madvance is separated out as a function. It calls advance on a PatternState and then returns the PatternState into the monad m.

> madvance :: Monad m => PatternState -> m PatternState
> madvance = return . advance

The majority of the combinators are very similar to their previous incarnations, with the exception that these functions do not use pattern matching to deconstruct their inputs but instead use the record and utility functions from above.

> done :: (Monad m) => Pattern m
> done = P return
> end :: (MonadPlus m) => Pattern m
> end = P $ \ps -> if empty ps then return ps else mzero
> any :: (MonadPlus m) => Pattern m
> any = P $ \ps -> if empty ps then mzero else madvance ps
> litc :: (MonadPlus m) => Char -> Pattern m
> litc c = P $ \ps -> if empty ps || (next ps) /= c then mzero else madvance ps
> lits :: (MonadPlus m) => String -> Pattern m
> lits s = P $ \ps -> if empty ps || not (s `L.isPrefixOf` (remain ps))
> then mzero
> else return $ advanceN ps (L.length s)
> cls :: (MonadPlus m) => String -> Pattern m
> cls set = P $ \ps -> if empty ps || not ((next ps) `L.elem` set)
> then mzero
> else madvance ps
> ncls :: (MonadPlus m) => String -> Pattern m
> ncls set = P $ \ps -> if empty ps || (next ps) `L.elem` set
> then mzero
> else madvance ps

On the other hand, the combining operations, seq, or, and kst, are almost identical to their corresponding previous versions.

> seq :: (Monad m) => Pattern m -> Pattern m -> Pattern m
> seq p1 p2 = P $ \ps -> runPattern' p1 ps >>= runPattern' p2
> or :: (MonadPlus m) => Pattern m -> Pattern m -> Pattern m
> or p1 p2 = P $ \ps -> (runPattern' p1 ps) `mplus` (runPattern' p2 ps)
> kst :: MonadPlus m => Pattern m -> Pattern m
> kst p = (p `seq` (kst p)) `or` done

The resulting engine runs as expected. Yay.

*Pattern> runPattern done "abc" :: [PatternState]
[PS {matched = "", remain = "abc"}]
*Pattern> runPattern end "abc" :: [PatternState]
*Pattern> runPattern end "" :: [PatternState]
[PS {matched = "", remain = ""}]
*Pattern> runPattern (litc 'a') "" :: [PatternState]
*Pattern> runPattern (litc 'a') "a" :: [PatternState]
[PS {matched = "a", remain = ""}]
*Pattern> runPattern (litc 'a') "b" :: [PatternState]
*Pattern> runPattern ((litc 'a') `seq` (litc 'b')) "b" :: [PatternState]
*Pattern> runPattern ((litc 'a') `seq` (litc 'b')) "a" :: [PatternState]
*Pattern> runPattern ((litc 'a') `seq` (litc 'b')) "ab" :: [PatternState]
[PS {matched = "ab", remain = ""}]
*Pattern> runPattern ((litc 'a') `or` (litc 'b')) "b" :: [PatternState]
[PS {matched = "b", remain = ""}]
*Pattern> runPattern ((litc 'a') `or` (litc 'b')) "a" :: [PatternState]
[PS {matched = "a", remain = ""}]
*Pattern> runPattern ((litc 'a') `or` (litc 'b')) "ab" :: [PatternState]
[PS {matched = "a", remain = "b"}]
*Pattern> runPattern (kst (litc 'a')) "b" :: [PatternState]
[PS {matched = "", remain = "b"}]
*Pattern> runPattern (kst (litc 'a')) "ab" :: [PatternState]
[PS {matched = "a", remain = "b"},PS {matched = "", remain = "ab"}]
*Pattern> runPattern (kst (litc 'a')) "aab" :: [PatternState]
[PS {matched = "aa", remain = "b"},PS {matched = "a", remain = "ab"},PS {matched = "", remain = "aab"}]
active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.