*Debunking Economics: The Naked Emperor Dethroned?* is a book by Steve Keen, at the time professor of economics and finance at the University of Western Sydney. According to the blurb, it “exposes what a minority of economists have long known and many of the rest of us have long suspected: that economic theory is not only unpalatable, but also plain wrong.” The book contains a “scathing critique of conventional economic theory whilst [whilst?] explaining what mainstream economists cannot: why the [2008] crisis occurred, why it is proving to be intractable [the first edition was published in 2001, this edition was published in 2011], and what needs to be done to end it.”

Keen’s approach to each chapter—with a “kernel” section describing the basic argument, a “roadmap” section spelling out how it will go, and a presentation of the neoclassical idea followed by his counter-arguments—is admirable, he spends a great number of words attacking neoclassical economics and economists. These attacks may be needed by what the book is trying to do, but it does serve to hide the underlying arguments and counter-arguments.

It’s been a rather long time since I was anywhere near an economics classroom (And I cannot find an econ textbook around here!), so these notes are my attempt to understand Keen’s fundamental arguments. In each chapter in the bulk of the book, he presents a traditional “undergraduate” version of a basic economic idea (as taught by what he calls “neoclassical economists”) and then presents an apparently conclusive that the idea has major faults, if not being completely wrong. I am attempting to pick out those basic ideas and avoid the criticism of neoclassical economics and economists.

Chapter 3, the first chapter of part 1, Foundations, addresses the demand curve, one part of introductory economics’ laws of supply and demand.

Note: if you are already familiar with how a demand curve is created, you can skip to the chapter’s punchline at The market demand curve.

Economics begins building a description of demand based on Jeremy Bentham’s (1748-1832) utilitarianism. (I like Jeremy Bentham; he spent time designing prisons and you can visit him at University College London today.) The idea is that, given some commodity such as bananas, having one banana will give you a certain amount of utility, and having more bananas always gives you more utility, although the amount of utility you gain from one *more* banana is less than the utility you gained from getting the previous banana. Here’s a table, denominated in a utility currency of “utils”:

```
## Bananas Utils Change.in.Utils
## 1 1 8 8
## 2 2 15 7
## 3 3 19 4
## 4 4 20 1
```

(These examples are taken from the book.) Plotted, that looks like:

The problem here is that it uses a currency of “utils”, which doesn’t exist and may or may not be definable. It also does not refer to price. To remove the dependency on utils and introduce a relationship to price requires a few steps, the first of which is to relate the bananas commodity to a new commodity, say biscuits. Biscuits obey the same utility rules as bananas, although with different util values, and biscuits can be traded off against bananas at varying utilities:

```
## [,1] [,2] [,3] [,4]
## [1,] 0 9 15 18
## [2,] 8 13 19 25
## [3,] 13 15 24 30
## [4,] 14 18 25 31
```

Graphed, that looks like:

In this graph, the highest utility is provided by 3 bananas and 3 biscuits, but 2 bananas and 3 biscuits is nearly as high. 1 banana and 3 biscuits has the same utility as 3 bananas and 2 biscuits. This last point is important; Keen writes,

The final abstraction en route to the modern theory was to drop this ‘3D’ perspective—since the actual ‘height’ couldn’t be specified numerically anyway—and to instead link points of equal ‘utility height’ into curves, just as contours on a geographic map indicate locations of equal height, or isobars on a weather chart indicate regions of equal pressure.

Based on this depiction, Keen writes,

Since consumers were presumed to be motivated by the utility they gained from consumption, and points of equal utility height gave them the same satisfaction, then a consumer should be ‘indifferent’ between any two points on any given curve, since they both represent the same height, or degree of utility. These contours were therefore christened ‘indifference curves’.

The properties of these curves are:

Completeness. Given two different combinations of commodities, a consumer can decide which is preferred or that he is indifferent between the two. (The combinations are ordered.)

Transitivity. If combination A is preferred to combination B, and B is preferred to combination C, combination A is preferred to C.

Non-satiation. More is better than less: if A contains the same amount of every commodity as B, except for one, and A has more of that one commodity than B, then A is preferred to B.

Convexity. “The marginal utility … falls with additional units, so that indifference curves are convex in shape.” (In the graph above, not all of the indifference curves are convex. Weird.)

Price enters the picture at this point.

Assume that bananas and biscuits both cost $1 each, and the consumer has $3 to spend. This allows the consumer to buy 3 bananas and no biscuits, or no bananas and 3 biscuits, or some combination. This assumption allows a “budget line” to be drawn on the graph:

The point on this line that maximizes the utility “height” indicates the combination of commodities that maximizes the utility and therefore satisfaction of the consumer while keeping the price of the combination to the budget; it looks like 1 banana and 2 biscuits is the appropriate combination.

Keen writes:

Economic theory then repeats this process numerous times—each time considering the same income [“budget”] and price for biscuits, but a [different] price for bananas. Each time, there will be a new combination of biscuits and bananas that the consumer will buy, and the combination of the prices and quantities of bananas purchased is the conumer’s demand curve for bananas.

Given the varying price of bananas and the corresponding number of bananas purchased in the optimal combination, we can construct something that sort of looks like a demand curve:

The resulting graph is a *demand curve*: it follows the *Law of Demand*; that demand increases as price falls. Although Keen does not go into the math, he notes and I believe that the process is well defined mathematically, so that a corresponding demand function can be created to play with analytically. There are some other twists, though. For one, lowering the price of a commodity that is not particularly desirable can result in no more, or even less, of that commodity being purchased.

If, for example, the price of bananas falls while the budget (or “income”, traditionally) and other prices remain the same, then the consumer can buy more bananas and has a higher utility value. On the other hand, if bananas are a major component of the budget yet are relatively undesirable, an increase in income can induce the consumer to buy fewer bananas and more of something more desirable, even if the change in income is the result of changing prices. The canonical example is potatos during the Irish potato famine—consumers bought more potatos even as the prices rose because they could no longer afford more desirable alternatives like pork.

As a result, two effects need to be separated to produce a useful demand curve: the *substitution effect* is the change in demand purely due to the change in prices, and the *income effect* is the change in demand due to a change in prices affecting the consumer’s perceived income. The substitution effect is always inversely related to prices: an increase in price produces a reduction in demand. The income effect, however, can have any relation to demand, depending on the commodity.

The income effect produces four classes of commodities:

Necessities, which take a diminshing

*share*of spending as income grows. Think of, say, toilet paper; you only buy a certain number of rolls per month, no matter what your income is.“Giffen” commodities, whose actual consumption falls as income grows. I have not bought many packages of Ramen noodles since I graduated.

Luxuries, which take an increasing share of spending as income grows. The percentage of income I spend on Picassos is essentially zero, but if I made more, I might find myself buying some.

Neutral or “homothetic” commodities, which take a constant share of spending as income grows.

Note that the second class is a sub-class of the first, and that the fourth class is unoccupied—what kind of commodity would you spend 10% of your income on, for any income from $10,000 per year to $1,000,000 per year? Pizza?

Anyway, back to the substitution effect and the income effect. The substitution effect is what the demand curve is trying to isolate. Fortunately, it is possible to neutralize the income effect to produce a “Hicksian compensated demand curve” which is well behaved: the demand for a commodity will rise if its price falls. There are certain assumptions that are needed to neutralize the income effect, though, such as that changing the price of bananas does not *directly* alter an individual’s income—the only change in income is the income effect.

(As an aside, Keen notes that Ted Wheelwright once described this as “tobogganing up and down your indifference curves until you disappear up your own abscissa”. The verb “to toboggan” needs to see more use.)

This is a demand curve for *one person* and one commodity. The process for creating a *market demand curve,* though, is very simple: given a market consisting of many consumers, each with their own demand curve, the overall market demand curve is simply the sum of the individual demand curves.

Without considering any other effects, that works: the sum of a set of individual demand curves will be a demand curve.

Unfortunately, considering other effects, there is a problem: changing the price of bananas in a market environment will change the income of some of the individuals. Keen uses the example of a market consisting of Robinson Crusoe and Man Friday, and more bananas.

Suppose that Crusoe is a banana consumer, and that Friday is both a consumer and a producer. An increase in the price of bananas will make Friday richer while making Crusoe poorer; Friday will be able to buy more biscuits. (What happens if they’re both producers? The same thing, assuming they are not equally good at producing bananas.)

As a result (Gorman 1953), the market demand function can be any polynomial; the graph is not necessarily negatively sloped. Demand can increase as prices rise. It can then decrease as prices continue to rise. And then increase again. Whee!

Fortunately, the market demand function can be made to obey the Law of Demand under two conditions, also known as the *Sonnenschein-Mantel-Debreu (SMD) conditions*:

All of the Engel curves of all consumers are straight lines, and

All of the Engel curves are parallel.

Keen quotes Gorman:

…we will show that there is just one community indifference locus through each point if, and only if, the Engel curves for different individuals at the same prices are parallel straight lines…

An Engel curve is based on an indifference graph, with multiple “budget” or income lines at different distances from the origin, showing the effects of an increasing income.

Joining the points of maximum utility on each budget line in sequence produces the Engel curve. (I have fudged this example; the actual curve for this silly graph should go through the indicated points.)

Engel curves relate back to the classes of commodities described above, with the income effect. Necessities and Giffen commodities produce upward-curving Engel curves, luxuries produce downward-curving Engle curves (such as the line in the bananas vs. biscuits graph above), and neutral commodities produce linear Engel curves.

The problem here is that the two SMD conditions mean that

All commodities are neutral, and

Since all Engel curves start from (0,0) and parallel lines through the same point are the same line, all consumers have the same Engel curve.

As a result, the conditions imply that there is a single, “generic” commodity, and a single, “representative” consumer.

About this, Gorman writes,

The necessary and sufficient condition quoted above is intutively reasonable. It says, in effect, that an extra unit of purchasing power should be spent in the same way no matter to whom it is given.

…which doesn’t seem all that intuitively reasonable.

There is a good deal more to the chapter, such as discussions of further ways of looking at the SMD conditions (including a “benevolent dictator” who redistributes wealth prior to market activity), and demonstrations that the SMD caveats are not discussed in introductory textbooks and not clearly discussed in advanced textbooks.

As an addendum to the chapter, Keen presents some arguments that the initial idea of a rational consumer maximizing utility to produce an individual demand curve is not necessarily valid, either. His one example is a study of a group of consumers trying to pick the “best” basket of a number of different products. Keen argues that their failure is due to the computational complexity of picking the optimal solution to a largish combinatorial problem.

I believe there are enough other examples from behavioral economics to assert that the “rational” consumer, in practice, does not really exist.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

(Of course, in this post I do not intend to address ESR’s comments on Rust’s crate environment or the necessity (or lack thereof) of being “batteries included”. And I certainly wish to avoid appearing to present “stupid flamage from zealots” or being a “clueless Rust fanboy” (Heaven forfend!).)

There are two kinds of people who will have excessive difficulty with Rust ownership:

Those who have extensive experience with C and to a lesser extent, C++.

Those who have not had any experience with un-managed memory.

(And yes, I’m aware that those two categories cover just about everyone.)

I myself am a member of the first group. I grew up with C. C is my native tongue. In particular, I have a lot of experience with C on “normal” systems—the workstations and personal computers in which C originated.

C, the language, has essentially no compiler-enforced rules preventing the programmer from doing hideously stupid things with memory. Feel the urge to pass a pointer to a pointer to a `struct ssl_ctx_st`

to a function expecting a pointer to a `struct ssl_ctx_st`

or to a function expecting a floating point number? Want to return a pointer to a buffer you allocated on the stack? Sure! Go ahead! The compiler will generate code that does exactly what you told it to do. It’s up to the programmer to figure out what went wrong and why, after the resulting nuclear apocalypse.

In order to be productive with this sort of entertainment, I have internalized a whole lot of rules describing the “C virtual machine” and what you can and cannot do with it. This, incidentally, is why I mentioned “normal” systems above; there are machines out there that violate what I have internalized as the “C virtual machine”, for which the code that I might expect to work happily will instead fail dramatically. That is the primary difficulty in writing “portable” C code: the rules for doing so are even stricter than the rules for 90% of the systems in existence, and code that works almost everywhere will blow up on that one ridiculous machine.

Having internalized those rules, I am in the position of instinctively writing code that I know to be (or quite reasonably expect to be) memory safe *in this specific time and place*, but I am unused to having any tools that try to verify that for me. I sit down and write me some Rust but when I try to compile it, the compiler becomes very unhappy. The compiler is unhappy because it cannot verify that what I have done is safe *in general* and I am unhappy because I know that what I have done is safe *in this specific case*.

The rub is that the compiler is more likely to be right than I am.

Anyway, at this point, I have to “fight the compiler”. I can either abandon the technique that I am trying to use and find something more to the compiler’s liking, or I can try to add enough information to allow the compiler to judge that my technique is safe. Unfortunately, the tools to do the latter are a bit obscure, seemingly weird, and often not available. After all, Rust is not a dependently typed language and adding a proof of correctness to my code would likely be somewhat more painful than using Rust as it exists.

As a result, to learn Rust I have to unlearn some of the lessons I have internalized from years of writing C and relearn the related, but different, approaches needed for Rust.

Those who know C++ are in a similar, but less extreme, position. Rust is, after all, aimed at C++. Programmers using “standard” C++, specifically avoiding the keywords `new`

and `delete`

and avoiding raw pointers, are likely to have an easier time. On the other hand, there will be bumps in the road particularly regarding other parts of Rust.

The second group, above, are in a whole ’nother world of hurt: they’re engaged in a journey of adventure and discovery wherein they need to learn how to manage their own memory without amputating any of their own limbs. Fortunately, on the one hand, Rust inherits many of C++’s techniques such as RAII, which make the task significantly easier. And on the other hand, the compiler’s insistence on its ownership model should ensure that they don’t have to also learn to love valgrind. But still, managing memory is a new requirement that has nothing to do with the actual logic of the problem to be solved, and the compiler is going to be an enemy until the programmer has learned how to handle this extra requirement.

]]>```
Triangle Pen Show 2016
```

I managed to miss the Atlanta Pen Show this year, but fortunately, I was in North Carolina for the Triangle Pen Show in Cary, June 2-5. It was somewhat smaller and more intimate, but no less fun. Only one room of tables (plus a couple of hallways), but the contents of those tables was no less mind-blowing.

This time I did manage to see Susan Wirth’s seminar on “How some pens can make anyone’s handwriting look good. How to find them at a Pen show.” Unfortunately, I have developed the complete inability to remember the third item of any list of three things, so I don’t remember what the last suggestion she had for a pen to improve my generally horrifying writing; the first two were a fine italic (a smidgen of line variation makes everything prettier) and an extra fine or needlepoint (for those situations that call for very small writing). She also had lots of tips for enjoying a pen show.

I had counted on Andreas Lambrou being at the show, and he did not disappoint. As a result, I quickly found myself carrying around 15lbs (6.8038856 kg) of autographed, beautiful pen books. There’s a funny story there: I didn’t have the cash for such a grand purchase, but Andreas had recently started using a Square card scanner. Unfortunately, he added an extra “04” to the price without either of us noticing. A couple of weeks later, I got a bill recording a purchase of roughly $3000. I got on the horn with Bank of America and contested the charge, then went to track down how to contact Andreas. I discovered that he’d already sent me email and had reversed the excess charges through Square. I think all the counter- and counter-counter-transactions have settled down, but my credit card at one point was showing I owed $6000. Whee.

The other attendee I was counting on was Franklin Christoph, which is a pretty sure thing: they’re based in North Carolina. If they didn’t show up, I could drive over to their location and pound on the doors. After fooling with some of their testers at Atlanta, I could not stop thinking about their Model 66, so I decided to combine my strange attraction for their pens with Susan’s pen suggestions (at least those I remember). I bought a Model 66 with a Masuyama fine italic nib and a Model 55 with a Masuyama needlepoint. Both are very nice to write with.

While the Franklin Christoph folks did have some of their standard colors there, they also had limited edition, non-production colors: the Model 66 is in flaming, Corvette red and the Model 65 is in a mottled reddish-brown and black. Both are pretty, but very hard to take accurate pictures of.

]]>To that end, I’ll start with regular expression matching, then go on to the conversion to a deterministic finite automaton and finally discuss the minor extras needed to build a lexical analyzer, similar to lex or flex (but better).

I have written about regular expressions before, too.^{3} They are cool, they are useful, they are fairly simple to understand and work with…what’s not to like? Even better, how about a way of working with them that does not involve a trip into non-determinism land? Hence, Regular expression derivatives reexamined, by Scott Owens, John Reppy, and Aaron Turon. Let’s take a closer look at that.

When I was taking classes in graduate school, I noticed something peculiar: every professor, in every class, spent a couple of sessions on predicate logic, usually first-order predicate calculus. Naturally, all of the introductions were different in some aspect or another. *And,* I had gone into graduate school with a pretty fair understanding of predicate logic courtesy of an excellent undergraduate course on the subject. Now, I recognize that all of the incoming students from various places could not be assumed to have had a brilliant introduction to the topic, nor to have followed the same schedule as I did, and further, it is an important topic to computer science…but geeze. Couldn’t they have compared notes or something?

As a result, I spent some of the first few class sessions doing one of two things: napping or becoming confused. Neither is especially a good way to start a computer science class, although napping is probably does the lesser harm. Eventually, I got to the point where I no longer really understood predicate calculus, and to this day, I could not come up with a coherent description of it.

At the risk of causing someone else to suffer my pain, here is my brief introduction to regular expressions. Feel free to take a nap.

This section starts with a basic definition of regular expressions and the terminology around them, and then goes into the definition of the derivative of a regular expression. We will do something useful with them in the next section.

A regular expression starts with an *alphabet*, *Σ*, containing a finite number of basic *symbols*; think of the ASCII characters or Unicode or something. A *string* is a sequence of symbols from *Σ*, kind of like this sentence. *Σ** is the Kleene-closure (more on this in a minute) of the alphabet; i.e. the set of all strings that can be constructed from the alphabet, by gluing together zero or more symbols. Note that *Σ** is infinite; although every string has some finite length, *Σ** includes every string. This set does include one special member, *ϵ*, which represents the string with zero components, the empty string.

A *language* is a subset of *Σ**, dividing that set of strings into those strings that are in the language and those strings that are not in the language. There is one special language, ∅, consisting of the empty set; all strings, including *ϵ*, are on the “not in the language” side.

A *regular expression* is a (relatively) concise description of a (simple) language. (There are many other ways of describing languages, such as grammars (context free, context sensitive, and “I acknowledge there is a context but I don’t have the social skills to follow it”), but regular expressions, or regexs, or even regexps, are the topic here.)

Regular expressions are built from the symbols of the alphabet *Σ*, combined using a specific group of operations:

*Concatenation.*If*a*and*b*are symbols,*a**b*is the regular expression describing the language {*a**b*}; i.e. the language containing the string consisting of*a*followed by*b*.*Alternation.*If*a*and*b*are symbols,*a*|*b*describes the language {*a*,*b*}; i.e. the language containing the two strings,*a*and*b*.*Kleene-closure.*The Kleene star operator, *, is kind of special; it says, “any number, zero or more, of copies of whatever is in front of it.” If*a*is a symbol, the regular expression*a** describes the language {*ϵ*,*a*,*a**a*, ...}. Unlike the two examples above, this language contains an infinite number of strings. (That should explain what*Σ** means; it is another use of the Kleene star.)

In the examples above, I used the simplest regular expressions, the symbols *a* and *b*. All of these operations work for “sub-expressions” similar to arithmetic operators: if *r* and *s* are regular expressions, *r**s* is the regular expression describing the language built from the languages described by *r* and *s*, where every member of the language *r**s* is a string consisting of a prefix from the language *r* followed by a suffix from the language *s*. The language described by *r*|*s* is the union of the languages described by *r* and *s*, and the language described by *r** is the language made up of zero or more concatenated copies of the strings in *r*.

For our use here, this relatively normal description of regular expressions is extended with two other operations. While these operations are not normally seen in descriptions of regular expressions, they do not add anything; theoretically, they can be built from the operations above.

*Conjunction.*If*r*and*s*are regular expressions,*r*∧*s*describes the language consisting of those strings that are in*both**r*and*s*.*Negation.*If*r*is a regular expression, ¬*r*describes the language consisting of strings that are*not*in the language*r*.

Finally, feel free to add parentheses to resolve any ambiguity in the expression.^{4}

Given the brief introduction above, what is the derivative of a regular expression? It certainly has nothing obvious to do with differential calculus, that’s for sure.

The *derivative of a regular expression* *r*, with respect to a symbol *a*, is another regular expression, call it *r*′. The language described by *r*′ consists of all of the strings from the language of *r* that start with *a*, with the *a* prefix stripped off. Here’s some examples:

The derivative of *a**b* (i.e. the language {*a**b*}) with respect to *a* is the set of strings from the original language that start with *a* (i.e. {*a**b*}) with the initial *a* stripped off: {*b*}. This is the regular expression *b*.

The derivative of *a**b* ({*a**b*}, remember) with respect to *c* is the set of strings from the language described by *a**b* that start with *c*, with the *c* removed: the empty set, ∅. (For simplicity (if not clarity), let’s say that the language ∅ is represented by the regular expression ∅.)

The derivative of *a** (the set {*ϵ*, *a*, *a**a*, ...}, remember) with respect to *a* is the set of strings described by *a** with the initial *a* removed. The *ϵ* is removed from the set, but it is replaced by *a* with the *a* removed, which is itself replaced by *a**a* with *its* initial *a* removed: *a*. So, the language of the derivative *r*′ is still {*ϵ*, *a*, *a**a*, ...}. The derivative of *a** with respect to *a* is simply *a**. Neat, eh?

The language of the derivative of *a*|*b* with respect to *a* is the subset of {*a*, *b*} of strings starting with *a*, which is {*a*}, with the *a* prefix removed: {*ϵ*}. (For convenience, this language is described by *ϵ*, treated as a regular expression. Can’t make this stuff too easy, ya’ know.)

The paper, “Regular-expression derivatives reexamined”, gives the following simple, mathematical definition of the derivative of a regular expression with respect to a character, *a*. (The notation *δ*(*r*, *a*) stands for the derivative of *r* with respect to *a* (and yes, I simplified the notation used in the paper, because a subscript does not make any damn sense as an argument).)

$$
\begin{array}{rcl}
\delta(\epsilon, a) & = & \emptyset \\
\delta(a, a) & = & \epsilon \\
\delta(b, a) & = & \emptyset\ \textrm{ when $a \neq b$} \\
\delta(\emptyset, a) & = & \emptyset \\
\delta(r \cdot s, a) & = & (\delta(r,a) \cdot s) | (\nu(r) \cdot \delta(s,a))\\
\delta(r*, a) & = & \delta(r, a) \cdot r* \\
\delta(r|s, a) & = & \delta(r,a) | \delta(s,a) \\
\delta(r \wedge s, a) & = & \delta(r,a) \wedge \delta(s,a) \\
\delta(\neg r, a) & = & \neg \delta(r,a) \\
\delta(r, \epsilon) & = & r \\
\delta(r, a \cdot u) & = & \delta(\delta(r,a),u)\ \textrm{ where $u$ is a string} \\
\delta((r_1, r_2, \ldots, r_n), a) & = & (\delta(r_1,a), \delta(r_2,a), \ldots, \delta(r_n,a))
\end{array}
$$

(While concatenation is usually represented by juxtaposition, i.e. *a**b*, for clarity in this table I have represented it as an explicit ⋅ (no, that’s a dot, not a speck on your monitor) between two other expressions. Alternation and conjunction are represented by “|” and “∧”, respectively.)

The last three equations may require some special explanation. The first two, for *δ*(*r*, *ϵ*) and *δ*(*r*, *a**u*) extend the definition of a regular expression derivative, with respect to a character, into the definition of a regular expression with respect to a string. This is done straightforwardly, by defining the derivative of *r* with respect to *ϵ* to be *r* itself, and by defining the derivative of *r* with respect to a non-empty string to be the first-order derivative of *r* with respect to the first character, computing a second-order derivative with respect to the next character, and so on. (Taking the derivative with respect to a string is nice to be able to do, for completeness anyway, but won’t further be used in this post. See, stuff’s getting simpler already!)

The final equation describes another extension to the regular expression, the idea of a *regular vector*; the derivative is just the derivative of the individual components. There will be much more about this extension later, though; it is primarily important as a tool to build a scanner, where you have a number of regular expressions and are interested in the one that matches at the beginning of a longer string.

If you look closely, you will see something odd in the line defining *δ*(*r* ⋅ *s*, *a*): a function *ν*(*r*). The function *ν* is a *nullability test*, a test if the expression *r* describes a language that contains the empty string, *ϵ*. If *r* includes *ϵ*, then *ν*(*r*) ultimately evaluates to *ϵ*; if *r* doesn’t include *ϵ*, then *ν*(*r*) evaluates to ∅. Mathematically, *ν* is:

$$
\begin{array}{rcl}
\nu(\epsilon) & = & \epsilon \\
\nu(a) & = & \emptyset \\
\nu(\emptyset) & = & \emptyset \\
\nu(r \cdot s) & = & \nu(r) \wedge \nu(s) \\
\nu(r|s) & = & \nu(r) | \nu(s) \\
\nu(r*) & = & \epsilon \\
\nu(r \wedge s) & = & \nu(r) \wedge \nu(s) \\
\nu(\neg r) & = & \left\{ \begin{array}{rl}
\epsilon & \textrm{if $nu(r) = \emptyset$} \\
\emptyset & \textrm{if $nu(r) = \epsilon$}
\end{array} \right.
\end{array}
$$

The point of *ν* in the derivative of concatenation is, if the left component, *r*, is nullable, then we need to push the derivative operation onto the right component, *s*. Consider what happens when you take the derivative of the regular expression *a**b* with respect to *a*: *a* is not nullable, so that equation evaluates to (*ϵ* ⋅ *b*)|(∅ ⋅ ∅), which can be simplified to *ϵ* ⋅ *b*. The derivative of *that* with respect to *b* (note that *ϵ* *is* nullable) is (∅⋅*b*)|(*ϵ* ⋅ *ϵ*), which simplifies to *ϵ*. Which is exactly what we need: the derivative of the regular expression *a**b* with respect to the string *a**b* should be exactly *ϵ*.

**A brief aside about Greek letters** I have an embarrassing admission to make: I never memorized the Greek alphabet. I recognize the common ones, sure, such as *α*, *π*, and *λ*. I even know some of the upper case: *Σ* and *Ω*. But generally, when I run across a letter in some math in some paper, well… I could look it up, but really, who does that? I generally just refer to it as “smegma”. Multiple letters? “Big smegma”, “the other smegma” and so on. It’s probably a good thing I don’t have to do this in public very often. In any case, I did look them up this time: *δ* is *delta* and *ν* is *nu*.

Now that we have a good, clean definition of the derivative of a regular expression, what can you do with it?

How about writing a proggy?

In this section, I’m going to start by presenting a basic internal representation of regular expressions, as a combination of simple expressions (*ϵ*, symbols, etc.) and higher-level operations on one or more lower-level expressions (concatenation, alternation, Kleene-star, etc.), plus some operations on the internal representation of expressions. Then, I’ll present a simple parser for expressions, to build the internal representation. Finally, I’ll show a regular expression matching engine using the derivative operation.

I have not written about Scheme before; in fact, this is the first serious work I’ve done in Scheme. But it seems like a nice enough language and I thought that, since I was doing programming language work, I would use one of the favorite tools for that task. Hence, Scheme, specifically Guile 2.0.11.

I may not have used Scheme before, but I do have a lot of experience with Common Lisp, although I haven’t used *it* seriously in quite a number of years. But one thing I have written about before is the weird love-affair that Lisp programmers have with representations over abstractions for data structures. Matt Might, in his own Scheme code for regular expression derivatives, provides some dandy examples:

```
;; Special regular expressions.
(define regex-NULL #f) ; -- the empty set
(define regex-BLANK #t) ; -- the empty string
```

Yes, using Booleans for ∅ and *ϵ*. He even goes so far as to write, for his version of *ν*,

```
; regex-empty : regex -> boolean
(define (regex-empty re)
(cond
((regex-empty? re) #t)
((regex-null? re) #f)
((regex-atom? re) #f)
((match-seq re (lambda (pat1 pat2)
(seq (regex-empty pat1) (regex-empty pat2)))))
((match-alt re (lambda (pat1 pat2)
(alt (regex-empty pat1) (regex-empty pat2)))))
((regex-rep? re) #t)
(else #f)))
```

Not only is that not using standard Scheme (Might is using Racket), but the only reason that function results in a Boolean is a carefully arranged accident of representation along with careful construction of the alternation and concatenation operators.

As a result, I’m going to follow my own advice and create a nice pile of structures. (Who’s scared of a little verbosity, anyway?) To start with, here are representations of ∅ and *ϵ* expressions:

```
(define-record-type dre-null-t ; The empty language; the empty set
(dre-null-raw) dre-null?)
(define dre-null (dre-null-raw)) ; Uninteresting value; use a constant
(define-record-type dre-empty-t ; The empty string
(dre-empty-raw) dre-empty?)
(define dre-empty (dre-empty-raw))
```

The records use SRFI-9 record types, as recommended in the Guile documentation. Since ∅ and *ϵ* do not have any interesting structure, I am defining constants for them, `dre-null`

, and `dre-empty`

.

I am presenting the completed proof-of-concept work here, and there are some differences between it and the simple presentation in the previous section. One of those differences is the regular vectors mentioned above (and again, later). Another is this: Rather than using single, constant characters as the base regular expressions, this code uses character sets.

The fundamental idea is that, rather than using raw symbols such as *a*, I use symbol sets, *S*, such as the set {*a*}. *Nothing changes* from the presentation above, except that the definitions

$$
\begin{array}{rcl}
\delta(a, a) & = & \epsilon\ \textrm{ when $a$ = $a$} \\
\delta(b, a) & = & \emptyset\ \textrm{ when $a$ $\neq$ $b$}
\end{array}
$$

become

$$
\begin{array}{rcl}
\delta(S, a) & = & \left\{ \begin{array}{rl}
\epsilon\ & \textrm{ when $a$ $\in$ $S$} \\
\emptyset\ & \textrm{ when $a$ $\not\in$ $S$}
\end{array} \right.
\end{array}
$$

Using character sets has a number of advantages. For one thing, trying to enumerate a large number of characters, like Unicode, would be silly. Instead, large character sets can be defined positively and negatively: a finite set of given characters, or the negation of a finite set of characters.

```
(define-record-type dre-chars-t ; Set of characters
(dre-chars-raw positive chars) dre-chars?
(positive dre-chars-pos?)
(chars dre-chars-set))
(define (dre-chars chars)
(dre-chars-raw #t (list->set chars)))
(define (dre-chars-neg chars)
(dre-chars-raw #f (list->set chars)))
(define dre-chars-sigma (dre-chars-neg '()))
```

*Σ* is defined here as the negation of the empty set.

The basic set operations are complicated by the positive/negative distinction, but not overwhelmingly so.

```
(define (dre-chars-member? re ch)
(let ([is-member (set-member? (dre-chars-set re) ch)])
(if (dre-chars-pos? re) is-member (not is-member)) ))
(define (dre-chars-empty? chars)
(cond
[(not (dre-chars? chars)) (error "not a character set:" chars)]
[(dre-chars-pos? chars) (set-empty? (dre-chars-set chars))]
[else #f] ))
(define (dre-chars-intersection l r)
(cond
[(and (dre-chars-pos? l) (dre-chars-pos? r))
;; both positive: simple intersection
(dre-chars-raw #t (set-intersection (dre-chars-set l) (dre-chars-set r)))]
[(dre-chars-pos? l)
;; l positive, r negative: elts in l also in r by dre-chars-member?
(dre-chars (filter (lambda (elt) (dre-chars-member? r elt))
(set-elts (dre-chars-set l))))]
[(dre-chars-pos? r)
;; l negative, r positive: the mathematician's answer
(dre-chars-intersection r l)]
[else
;; both negative: slightly less simple union
(dre-chars-raw #f (set-union (dre-chars-set l) (dre-chars-set r)))] ))
```

If you take a close look at `dre-chars-empty?`

, you’ll see that a negated character set cannot be empty; it always returns false. Interesting, eh? (It means *Σ* is infinite.)

The implications of that appear in the one further special operation: the choice of a character from a set. It is easy if the set is positive, but if the set is negative I am using Scheme’s typelessness and taking the cheap, quick way out: returning something that cannot possibly be a member of a character set.

```
(define (dre-chars-choice chars)
(cond
[(not (dre-chars? chars))
(error "not a character set:" chars)]
[(and (dre-chars-pos? chars) (not (dre-chars-empty? chars)))
(car (set-elts (dre-chars-set chars)))]
[else (gensym)] ; Not a character
))
```

`dre-chars-choice`

is not immediately needed, but becomes important when converting a regular expression directly into a deterministic automaton, which will appear further down.

The concatenation of two regular expressions is represented by `dre-concat-t`

, a record with left and right fields.

Constructing a concatenation is another moderately complicated operation.

```
(define-record-type dre-concat-t ; Concatenation; sequence
(dre-concat-raw left right) dre-concat?
(left dre-concat-left)
(right dre-concat-right))
(define (dre-concat left right)
(cond
[(not (dre? left))
(error "not a regular expression: " left)]
[(not (dre? right))
(error "not a regular expression: " right)]
;; ∅ ∙ r => ∅
[(dre-null? left)
dre-null]
;; r ∙ ∅ => ∅
[(dre-null? right)
dre-null]
;; ϵ ∙ r => r
[(dre-empty? left)
right]
;; r ∙ ϵ => r
[(dre-empty? right)
left]
;; (r ∙ s) ∙ t => r ∙ (s ∙ t)
[(dre-concat? left)
(dre-concat (dre-concat-left left)
(dre-concat (dre-concat-right left)
right))]
[else
(dre-concat-raw left right)]
))
```

The idea here is to keep the structure in a “canonical” state. If either the left or right expressions is ∅, the concatenation is ∅. If the left expression is *ϵ*, the concatenation is simply the right expression; likewise for the right expression. Finally, a tree structure built of concatenations should be “right-heavy”: the leftmost expression should always be a non-concatenation, recursively.

This canonical structure, which will be repeated for alternation, Kleene-closure, conjunction, and negation, is used when comparing two regular expression structures for equality (or at least semi-equality; it’s still possible for two differing expressions to describe the same language). Once again, a topic that will become important when converting the expression to a DFA.

Given that alternation in a regular expression is the same as union of the two languages described by the two expressions, the simplifications for the canonical form should be easy to see.

```
(define-record-type dre-or-t ; Logical or; alternation; union
(dre-or-raw left right) dre-or?
(left dre-or-left)
(right dre-or-right))
(define (dre-or left right)
(cond
[(not (dre? left))
(error "not a regular expression: " left)]
[(not (dre? right))
(error "not a regular expression: " right)]
;; ∅ + r => r
[(dre-null? left)
right]
[(dre-null? right)
left]
;; r + r => r
[(dre-equal? left right)
left]
;; ¬∅ + r => ¬∅
[(and (dre-negation? left)
(dre-null? (dre-negation-regex left)))
left]
;; (r + s) + t => r + (s + t)
[(dre-or? left)
(dre-or (dre-or-left left)
(dre-or (dre-or-right left)
right))]
[else
(dre-or-raw left right)]
))
```

Just to keep everything notationally interesting, I have used “|” for alternation, but the “practical” paper uses “+”, and I copied the comments in the code from the paper.

There are a couple of slightly odd facts about the Kleene closure.

The closure of the empty language is the language containing nothing but the empty string.

The closure of the closure of an expression, (

*r** )*, is the same as the unnested closure of the expression,*r**.

Weird, eh?

```
(define-record-type dre-closure-t ; Kleene closure; repetition
(dre-closure-raw regex) dre-closure?
(regex dre-closure-regex))
(define (dre-closure regex)
(cond
[(not (dre? regex)) (error "not a regular expression: " regex)]
;; ∅* => ϵ
[(dre-null? regex) dre-empty]
;; ϵ* => ϵ
[(dre-empty? regex) dre-empty]
;; (r*)* => r*
[(dre-closure? regex) regex]
[else (dre-closure-raw regex)]
))
```

Conjunction represents the intersection of the two languages, so the canonical simplifications, similar to alternation, should be self-explanatory.

Don’t you hate it when an author says, “This should be immediately clear.”

```
(define-record-type dre-and-t ; Logical and; intersection
(dre-and-raw left right) dre-and?
(left dre-and-left)
(right dre-and-right))
(define (dre-and left right)
(cond
[(not (dre? left))
(error "not a regular expression: " left)]
[(not (dre? right))
(error "not a regular expression: " right)]
;; ∅ & r => ∅
[(dre-null? left)
dre-null]
[(dre-null? right)
dre-null]
;; r & r => r
[(dre-equal? left right)
left]
;; (r & s) & t => r & (s & t)
[(dre-and? left)
(dre-and (dre-and-left left)
(dre-and (dre-and-right left)
right))]
;; ¬∅ & r => r
[(and (dre-negation? left)
(dre-null? (dre-negation-regex left)))
right]
[else
(dre-and-raw left right)]
))
```

And once again, the negation of a regular expression is the complement of the language set. And the negation of a negation is the original expression and set.

```
(define-record-type dre-negation-t ; Complement
(dre-negation-raw regex) dre-negation?
(regex dre-negation-regex))
(define (dre-negation regex)
;; ¬(¬r) => r
(if (dre-negation? regex)
(dre-negation-regex regex)
(dre-negation-raw regex))
)
```

Regular expression vectors are remarkably simple, all the way through this. That is fortunate, because they become very, very important much later on.

```
(define-record-type dre-vector-t ; A vector of regexs
(dre-vector v) dre-vector?
(v dre-vector-list))
```

I am almost done with the preliminaries, I promise. Now, we have a couple of predicates that are both used in the constructors above (as well as the code below).

```
(define (dre? re)
(or (dre-null? re)
(dre-empty? re)
(dre-chars? re)
(dre-concat? re)
(dre-or? re)
(dre-closure? re)
(dre-and? re)
(dre-negation? re)
(dre-vector? re) ))
```

The function `dre?`

returns true if the argument is a regular expression and false otherwise. It’s built from the predicates for the constituent structures.

**A brief aside about names** I have used `dre`

as the prefix for the structures and functions I am writing here. This stands for “Derivatives of Regular Expressions”. Possibly. Maybe. Or something. This is why I don’t get to name things.

At this point, I feel my choice of Guile Scheme is unfortunate. If I had chosen Racket, I could have called the package “Dr. Dre.” Not that I would ever actually do that. I really shouldn’t name things.

In any case, the next predicate is `dre-equal?`

, a function to compare two regular expressions for equality.

```
(define (dre-equal? left right)
(cond
[(not (dre? left))
#f]
[(not (dre? right))
#f]
[(and (dre-and? left) (dre-and? right))
(let ([l1 (dre-and-left left)]
[l2 (dre-and-right left)]
[r1 (dre-and-left right)]
[r2 (dre-and-right right)])
(or (and (dre-equal? l1 r1)
(dre-equal? l2 r2))
(and (dre-equal? l1 r2)
(dre-equal? l2 r1))))]
[(and (dre-or? left) (dre-or? right))
(let ([l1 (dre-or-left left)]
[l2 (dre-or-right left)]
[r1 (dre-or-left right)]
[r2 (dre-or-right right)])
(or (and (dre-equal? l1 r1)
(dre-equal? l2 r2))
(and (dre-equal? l1 r2)
(dre-equal? l2 r1))))]
[(and (dre-vector? left) (dre-vector? right))
(every dre-equal?
(dre-vector-list left)
(dre-vector-list right))]
[else
(equal? left right)]
))
```

Truly comparing expressions for equality is probably undecidable; it would involve deciding whether they represented the same languages, which is the same kind of problem as deciding whether two functions produce the same output given the same inputs.

What this predicate is doing is comparing the two expressions for limited, structural, equality, with the additional proviso that `dre-and`

and `dre-or`

are symmetric—the order of the arguments doesn’t matter. This, combined with the (minimal) work I put into making the structural representations canonical ought to be good enough, especially since any two expressions being compared were probably derived from some single previous expression.

“Good enough”, in this case, means optimizing the structures above (where minor problems will not affect their semantics) and minimizing the number of states below (much further below, where the argument that the expressions are derived from a single parent is probably more important for correctness).

At this point, I have the structural tools to create representations of regular expressions by gluing things together by hand. But I would very much like not to do that. So, I have (ahem) borrowed a recursive-descent regex parser from Matt Might.

The original grammar (I’ve modified it a bit, specifically to extend to more regular expressions) was:

$$
\begin{array}{rcl}
\textit{regex} & := & \textit{term}\ \textbf{|}\ \textit{regex} \\
& | & \textit{term} \\
\textit{term} & := & \{ \textit{factor} \} \\
\textit{factor} & := & \textit{base} \{ \textbf{*} \} \\
\textit{base} & := & \textit{char} \\
& | & \textbf{(}\ \textit{regex}\ \textbf{)} \\
& | & \textbf{[}\ \textit{set}\ \textbf{]} \\
\textit{set} & := & \{ \textit{char} \} \\
& | & \textbf{^}\ \{ \textit{char} \} \\
\textit{char} & := & \textrm{character} \\
& | & \textbf{\\}\ \textrm{character}
\end{array}
$$

The parser is `string->dre`

; given a string, it produces a regular expression built from the structures above. It reads from a string argument and uses the value `cur`

to record its current location in the string.

```
(define (string->dre str)
(let ([cur 0])
;; Peek at the next character
(define (peek)
(if (more) (string-ref str cur) (error "unexpected end of string")))
;; Advance past the next character
(define (eat ch)
(if (char=? ch (peek)) (set! cur (+ cur 1))
(error "unexpected character:" ch (peek))))
;; Eat and return the next character
(define (next)
(let ([ch (peek)]) (eat ch) ch))
;; Is more input available?
(define (more)
(< cur (string-length str)))
;; The regex rule
(define (regex)
(let ([trm (term)])
(cond
[(and (more) (char=? (peek) #\|))
(eat #\|)
(dre-or trm (regex))]
[else
trm]
)))
;; The term rule
(define (term)
(let loop ([fact dre-empty])
(cond
[(and (more) (and (not (char=? (peek) #\)))
(not (char=? (peek) #\|))))
(loop (dre-concat fact (factor)))]
[else
fact]
)))
;; The factor rule
(define (factor)
(let loop ([b (base)])
(cond
[(and (more) (char=? (peek) #\*))
(eat #\*)
(loop (dre-closure b))]
[else
b]
)))
;; The base rule
(define (base)
(cond
[(char=? (peek) #\()
;; parenthesized sub-pattern
(eat #\()
(let ([r (regex)])
(eat #\))
r)]
[(char=? (peek) #\[)
;; character set
(eat #\[)
(let ([s (set)])
(eat #\])
s)]
[(char=? (peek) #\.)
;; any character except newline
(eat #\.)
(dre-chars-neg '(#\newline))]
[else
;; single character
(dre-chars (list (char)))]
))
;; The set rule
(define (set)
(cond
[(char=? (peek) #\^)
;; negated set
(eat #\^)
(dre-chars-neg (chars))]
[else
;; positive set
(dre-chars (chars))]
))
;; The chars rule
(define (chars)
(let loop ([ch '()])
(if (char=? (peek) #\])
ch
(loop (cons (char) ch)))))
;; The char rule
(define (char)
(cond
[(char=? (peek) #\\)
;; quoted character
(eat #\\)
(next)]
[else
;; unquoted character
(next)]
))
;; Read the regex from the string
(let ([r (regex)])
(if (more)
(error "incomplete regular expression:" (substring str cur))
r))
))
```

Note the brilliant use of lexical scoping, to enclose the functions implementing the grammar rules, which simplifies their interfaces and hides them from the outside. Lexical scoping is damn nice.

And finally, I reach the heart of regular expression derivatives: implementing the functions *δ* and *ν*. They are actually pretty trivial, simply translating the definitions above to use the structures and functions that I’ve defined.

The function *ν* is first, implemented as `nu`

.

```
(define (nu re)
(cond
[(not (dre? re))
(error "not a regular expression: " re)]
[(dre-empty? re)
dre-empty]
[(dre-chars? re)
dre-null]
[(dre-null? re)
dre-null]
[(dre-concat? re)
(dre-and (nu (dre-concat-left re))
(nu (dre-concat-right re)))]
[(dre-or? re)
(dre-or (nu (dre-or-left re))
(nu (dre-or-right re)))]
[(dre-closure? re)
dre-empty]
[(dre-and? re)
(dre-and (nu (dre-and-left re))
(nu (dre-and-right re)))]
[(dre-negation? re)
(if (dre-null? (nu (dre-negation-regex re)))
dre-empty
dre-null)]
))
```

The second function is *δ*, implemented as `delta`

.

```
(define (delta re ch)
(cond
[(not (dre? re))
(error "not a regular expression: " re)]
[(dre-empty? re)
dre-null]
[(dre-null? re)
dre-null]
[(dre-chars? re)
(if (dre-chars-member? re ch) dre-empty dre-null)]
[(dre-concat? re)
(dre-or (dre-concat (delta (dre-concat-left re) ch)
(dre-concat-right re))
(dre-concat (nu (dre-concat-left re))
(delta (dre-concat-right re) ch)))]
[(dre-closure? re)
(dre-concat (delta (dre-closure-regex re) ch) re)]
[(dre-or? re)
(dre-or (delta (dre-or-left re) ch)
(delta (dre-or-right re) ch))]
[(dre-and? re)
(dre-and (delta (dre-and-left re) ch)
(delta (dre-and-right re) ch))]
[(dre-negation? re)
(dre-negation (delta (dre-negation-regex re) ch))]
[(dre-vector? re)
(dre-vector (map (lambda (r) (delta r ch))
(dre-vector-list re)))]
))
```

Ok, so now I have code for representing regular expressions, parsing them, and computing their derivatives with respect to a character. What can I do with it?

How about a simple, quick, and dirty regular expression matcher?

```
(define (dre-match-list? re list)
(cond
[(null? list) (dre-empty? (nu re))]
[else (dre-match-list? (delta re (car list)) (cdr list))]
))
(define (dre-match? re str) (dre-match-list? re (string->list str)))
```

`dre-match?`

returns true if the expression matches the string and false otherwise. It does this by recursively computing the derivative of the starting regular expression with respect to each character in the string, in order. When the end of the string is reached, the regular expression is declared to match if it matches the empty string, using *ν* (in case the current expression has some significant structure).

Things don’t really get much simpler than that.

How does it work? Pretty well:

```
> (dre-match? (string->dre "ab") "ab")
$2 = #t
> (dre-match? (string->dre "ab*") "abbb")
$3 = #t
> (dre-match? (string->dre "ab*") "acbb")
$4 = #f
> (dre-match? (string->dre "\"[^\"]*\"") "\"A string!\"")
$5 = #t
> (dre-match? (string->dre "\"[^\"]*\"") "\"A string!\" not really")
$6 = #f
> (dre-match? (string->dre "\"[^\"]*\"") "\"A \\\"silly\\\" string!\"")
$7 = #f
> (dre-match? (string->dre "\"(\\\"\|[^\"])*\"") "\"A \\\"silly\\\" string!\"")
$8 = #t
```

I apologize for the extra backslashes there. The string in the fourth example is “A string!”; the regex is “[^"]*“. The final example shows regular expressions which match quoted strings with escaped quotes inside.

I know what you’re thinking. You’re saying to yourself, “We went through all that and now we have a moderately poor, slow, regular expression matcher. Yay.” But there are two points to consider:

By my count, including blank lines and comments, and some stuff that I haven’t shown (I had to implement my own set for characters, and I wrote pretty-printers for the structures), we are at about 600 lines. Yeah, that’s not great, but I haven’t been trying to compress things, either.

That’s not all of the fun we can have here. Using derivative techniques, I can convert a regular expression into a deterministic finite automaton. Without first creating a non-deterministic finite automaton, which is the usual process as described in the Dragon book, for example.

This section begins by describing deterministic finite automata and discusses one minor road bump in building one from a regular expression: character sets. Then I go into the Scheme code for finite automata and finally describe the direct conversion from regular expressions to DFAs. Finally, I cover how to use the resulting DFA to build a lexical scanner.

“What’s a deterministic finite automaton?” you’re asking. (Well, maybe not. If not, you can probably take another nap through this section.)

An *automaton* in this context is a machine for recognizing strings that are in a given language, as opposed to strings that are not in the language. A *finite state automaton* is a machine that has a finite number of states. (Ok, that was just redundant.) Unlike other automatons which have external memory, like the stacks of a stack automaton or the tape of a Turing machine, a FA only has a fixed number of states that it can be in. It reads a character, makes a transition into another state or possibly the same one, and then looks to read another character. If the machine is an *accepting state* when the last character has been read, the string is declared to be in the language.

The following illustration shows a DFA for *a**b* * *c*.

State 1 is the starting state, state 3 is accepting, and each edge is either labeled with a character set or not labeled, indicating any character.

More formally, a FA is a directed graph made of nodes (representing states), *S*, and edges (representing transitions), *T*. A *state* *s* ∈ *S* has three important properties:

An identity—each node can be distinguished from every other node.

Whether or not the node is a starting node for the machine. There is only one starting node in a given FA.

^{5}Whether or not the node is an accepting node for the machine. There can be any number of accepting nodes (although there should be at least one, otherwise the language will be ∅).

A *transition* *t* ∈ *T* is a triple of three elements, (*s*, *c*, *s*′). The initial state for the transition is *s*; *c* is an input character (or a set of input characters), and *s*′ is a destination state. What the transition *t* means is that when the FA is in state *s*, if it sees a character in *c*, then it moves to state *s*′.

A *deterministic* finite automaton is an FA were every state has exactly one outgoing transition for every possible input character. As a result, when reading a character, the machine does not need to make any choices; it just matches the current state and the input character to determine the single possible destination state.

“Great!” you say, “Wonderful! But, so what?” A DFA is the fastest way to do regular expression matching. It won’t work for all the fancy features of other regular expression engines, and there are techniques that it cannot match, like using a fast string-search algorithm to look for fixed strings. But a DFA is an *O*(*n*) algorithm with very fast constants, if you can determine which transition matches an input character quickly—frequently this is a table look-up.

Now, about those input characters…

When doing anything with regular expressions, there is a very basic question that I have not asked until now: What are your characters, and how are you handling them? This question suddenly becomes important right at this point.

Up to this point, the answer is pretty simple: we have positive and negative sets of finite numbers of characters and the only operations we have used has been construction and a membership test. But, I defined all those extra operations…

Here’s the deal: traditional DFA construction algorithms iterate over *all* characters in the alphabet, *Σ*, to identify the transitions leaving a state. In effect, if not in practice, they build a matrix where the rows are states and there is one column for all possible input characters; the value of each entry is the next state to enter.

That approach is feasible for small alphabets, such as 7-bit ASCII or 8-bit ISO8859-1 (or -2, or -9, and I hope you’re seeing the problem). But it will not work at all for a Unicode alphabet.

Instead, “Regular expression derivatives reexamined” provides a suggested extension to Brzozowski’s original work on DFA construction using regular expression derivatives, using character sets (which I have been following all along) and computing approximate equivalence classes of characters for each state to identify the outgoing transitions. To quote the paper,

Given an RE

roverΣ, and symbolsaandb∈Σ, we say thata≃_{r}bif and only ifδ(r,a) ≡δ(r,b) [notation modifed]. Thederivative classesofrare the equivalence classesΣ/≃_{r}.

In other words, the derivative classes divide up *Σ* in such a way that all of the symbols in a given derivative class compute the same derivative expression from a starting expression *r*.

Computing these derivative classes exactly would require iteration over the elements of *Σ*, which is exactly what we’re trying to avoid. Fortunately, the authors of the paper are able to define a function *C* by recursing over the structure of *r* that computes an approximation of the derivative classes.

In effect, given a regular expression *r*, *C* returns a partitioning of *Σ* into sets of characters, the derivative classes, that cause *r* to behave differently. This partitioning is approximate; two characters *a* and *b* that cause *r* to “do the same thing” may end up in different derivative classes, but the authors prove that all of the characters in the same derivative class cause *r* to behave the same way.

One thing that is important to note: each derivative class is a character set, the same structures that we have been using all along to handle input.

*C* is:

$$
\begin{array}{rcl}
C(\epsilon) & = & \{ \Sigma \} \\
C(S) & = & \{ S, \Sigma\setminus S \} \textrm{ where $\Sigma\setminus S$ is the negation of $S$} \\
C(r\ \cdot\ s) & = & \left\{ \begin{array}{l;l}
C(r) & \textrm{ when $r$ is not nullable } \\
C(r)\ \hat\cap\ C(s) & \textrm{ otherwise }
\end{array} \right. \\
C(r | s) & = & C(r)\ \hat\cap\ C(s) \\
C(r \wedge s) & = & C(r)\ \hat\cap\ C(s) \\
C(r*) & = & C(r) \\
C(\neg r) & = & C(r)
\end{array}
$$

**Notational note** This is one of the most notationally crazed papers I’ve read in a long time. What I have rendered as

$$c_1 \hat\cap\ c_2$$

is ∧ in the paper, and while most of the definitions are equational like *C*, there is still another oddity coming up.

Speaking of that function (you know, intersection wearing a hat), it is defined as:

$$
C(r)\ \hat\cap\ C(s) = \{ S_r \cap S_s | S_r \in C(r), S_s \in C(s) \}
$$

i.e. the set consisting of the pair-wise intersection of all of the members of *C*(*r*) with the members of *C*(*s*).

In Scheme, *C* and and its auxiliary are defined as follows:

```
(define (C re)
(cond
[(dre-empty? re)
(set dre-chars-sigma)]
[(dre-chars? re)
(let ([elts (set-elts (dre-chars-set re))])
(set (dre-chars elts)
(dre-chars-neg elts)))]
[(dre-concat? re)
(let ([r (dre-concat-left re)]
[s (dre-concat-right re)])
(if (dre-empty? (nu r))
(C-hat (C r) (C s))
(C r)))]
[(dre-or? re)
(C-hat (C (dre-or-left re)) (C (dre-or-right re)))]
[(dre-and? re)
(C-hat (C (dre-and-left re)) (C (dre-and-right re)))]
[(dre-closure? re)
(C (dre-closure-regex re))]
[(dre-negation? re)
(C (dre-negation-regex re))]
[(dre-null? re)
(set dre-chars-sigma)]
[(dre-vector? re)
(fold C-hat
(set dre-chars-sigma)
(map C (dre-vector-list re)))]
[else
(error "unhelpful regular expression:" re)]
))
```

(Oh, and the auxiliary (whose name I apparently cannot use inline, thanks MathJAX) is called `C-hat`

for simplicity and defined using a SRFI-42 list comprehension.)

```
(define (C-hat r s)
;; pair-wise intersection of two sets of character
(list->set (list-ec (:list elt-r (set-elts r))
(:list elt-s (set-elts s))
(dre-chars-intersection elt-r elt-s))))
```

And `C-hat`

uses `dre-chars-intersection`

, another of the `dre-chars`

operations.

Now, since we have the input character question settled, I can get on with building a deterministic finite-state automaton.

As I mentioned before, a state has three important properties: an identity, whether it is a starting state and whether it is an accepting state. And, as there is only one starting node, that property is pretty unimpressive. For purposes of implementing a transformation to a DFA, a state structure such as `dre-state-t`

has some extra fields.

`dre-state-number`

: This is the effective identify of the state.`dre-state-regex`

: Given that the regular expression system I’m building is based on the calculation of the derivative of a regular expression with respect to an input character, which is itself a regular expression, it is possible to label each state with the regular expression that represents*what the automata would accept, starting from this state.*`dre-state-accepting`

: True if the state is accepting, false otherwise. (Ignore the regular expression vector lurking behind the curtain. More on it later.)`dre-state-null`

: True if the state is an error state: one from which there is no accepting string.

```
(define-record-type dre-state-t ; A state in the machine
(dre-state-raw n re accept null) dre-state?
(n dre-state-number)
(re dre-state-regex)
(accept dre-state-accepting)
(null dre-state-null))
```

`dre-state`

uses a lexically-scoped counter to generate the state numbers. If you haven’t seen an example of that particular linguistic feature, well, you can’t say that any more.

It also has two auxiliary functions: `accepting`

, which sets `dre-state-accepting`

depending on the regular expression label, and `nulling`

, which sets `dre-state-null`

, also depending on the regular expression.

```
(define dre-state
(let ([state-count 0])
(lambda (re)
(define (accepting re)
(cond
[(dre-vector? re) (map accepting (dre-vector-list re))]
[else (dre-empty? (nu re))] ))
(define (nulling re)
(cond
[(dre-vector? re)
(every (lambda (b) b) (map nulling (dre-vector-list re)))]
[else
(dre-null? re)] ))
(set! state-count (+ state-count 1))
(let* ([accept
(accepting re)]
[error-state
(nulling re)])
(dre-state-raw state-count re accept error-state)))))
```

The machine also needs to record transitions. These contain two states, an origin and a destination, and the character set labeling the transition.

```
(define-record-type dre-transition-t ; <state, input, state'> transition
(dre-transition state input state') dre-transition?
(state dre-transition-origin)
(input dre-transition-input)
(state' dre-transition-destination))
```

Finally, there is the deterministic finite state machine itself, made up of a group of states, a group of transitions, and the identified start state.

```
(define-record-type dre-machine-t ; Finite state machine
(dre-machine states start transitions) dre-machine?
(states dre-machine-states)
(start dre-machine-start)
(transitions dre-machine-transitions))
```

The function `dre-transitions-for`

implements the process of picking out the transitions whose origin is a given state.

```
(define (dre-transitions-for machine state)
(let ([state-num (dre-state-number state)])
(remove (lambda (trans)
(not (eq? state-num
(dre-state-number (dre-transition-origin trans)))))
(set-elts (dre-machine-transitions machine)))))
```

Finally, we have all of the components necessary to convert a regular expression directly into a deterministic finite automaton.

The basic idea is to create a state, *s*_{i}, labeled with a regular expression. Then, compute the derivative classes for the regular expression. For each class, pick out a representative member using `dre-chars-choice`

and use it to compute the derivative of the regular expression. If this new expression is equivalent to the expression labeling an existing state (using `dre-equal?`

), say *s*_{j}, record a transition from *s*_{i} to *s*_{j} associated with the derivative class. If the new expression is not equivalent to any existing state’s label, create a new state for it, *s*_{k}, and record the transition from *s*_{i} to *s*_{k}. Follow this process recursively until all states have had all derivative classes examined.

The result is a deterministic finite state automaton.

**Another note on notation** This is the final place where “Regular-expression derivatives reexamined” gets a little wacky. The algorithm for this transformation, unlike everything else in the paper, is given in mutually recursive pseudocode. Fortunately, it is very close to the Scheme code below.

```
(define (dre->dfa r)
(define (goto q S engine)
(let* ([Q (car engine)]
[d (cdr engine)]
[c (dre-chars-choice S)]
[qc (delta (dre-state-regex q) c)]
[q' (set-find Q (lambda (q') (dre-equal? (dre-state-regex q') qc)))])
(if q'
(cons Q (set-union d (set (dre-transition q S q'))))
(let ([q' (dre-state qc)])
(explore (set-union Q (set q'))
(set-union d (set (dre-transition q S q')))
q')) )))
(define (explore Q d q)
(fold (lambda (S engine) (goto q S engine))
(cons Q d)
(remove dre-chars-empty?
(set-elts (C (dre-state-regex q))))))
(let* ([q0 (dre-state r)]
[engine (explore (set q0) (set) q0)]
[states (car engine)]
[transitions (cdr engine)])
(dre-machine states q0 transitions)
))
```

There are a few minor nits left to pick.

`explore`

computes the derivative classes for the labels of each visited state, then calls`goto`

with each. It filters out empty character sets (using`dre-chars-empty?`

) since (a) those don’t have any members to be representative, and (b) an empty set wouldn’t be a meaningful transition anyway.`goto`

picks a representative element from the derivative class / character set, uses it to compute the derivative, and makes the choice on whether a new state should be added. If you recall`dre-chars-choice`

, it returns a character from the set if the set is positive and a newly generated Scheme symbol if it isn’t. The symbol is a valid argument to`dre-chars-member?`

which is used when computing the derivative, which is good since the symbol is used to represent*any character outside the (negative) set,*which would otherwise be potentially painful to identify.

Computing a DFA, any way you do it, is a fairly expensive operation. Further, deterministic regular expression engines are unable to use the fancy, more-powerful-than-regular-expressions features of non-deterministic engines such as PCRE. So, what can you do with one?

There is one simple answer: lexical analysis. The scanner part of a parser.

The limitations of a DFA are less cumbersome when you are picking the next token from an input stream. The DFA can be precomputed, meaning that the expense of creating the DFA does not hurt the parser execution. Further, according to the authors of “Regular-expression derivatives reexamined”, the DFA created using this method, while it may have more than the optimal number of states, has fewer states than those created from the traditional method. And finally, running a DFA across an input string is an *O*(*n*) operation, where *n* is the length of the input.

Here’s how it works: a lexical analyzer is specified as a list of regular expressions; one for identifiers, one for numbers, one for each keyword, etc. This is where the *regular vectors* enter the picture. Each of the individual expressions is an element of the vector, which is treated as a single, extended regular expression while creating the DFA. At the end of the process, you have a single finite automaton whose states are the combination of the individual regular expressions.

And this is where we wrap up `dre-state-accepting`

and `dre-state-null`

. A *null state*, *s*_{∅}, is one in which all of the elements of the vector are `dre-null`

, ∅. In this state, it is impossible for the machine to match any input, no matter what follows. As a result, the state *s*_{i}, with a transition into a null state marks the end of a token.

Which token? `dre-state-accepting`

, called on *s*_{i}, returns a vector of booleans, one for each element of the regular vector. If, in state *s*_{i}, there is only one accepting element, then that element identifies the regular expression for a token *t*—*t* is the token that should be passed on.

If there are more than one accepting elements in state *s*_{i}, then theoretically you might have an error in your scanner specification. However, treating that as an error places very harsh requirements on the token regular expressions; the expressions for keywords, such as “if” and “then” frequently overlap with the expressions for identifiers, say. It would be better to use some ordering, such as the order of token specifications, to resolve this ambiguity.

On the other hand, if there are *no* accepting elements in *s*_{i}, then you do have a problem. This would be a lexical error. Fortunately, I note that it would be trivial to identify states creating this error at the time the DFA is generated, by scanning the incoming transitions to *s*_{∅}.

So, in one paper and in a fairly small amount of code (about 770 lines of my very verbose Scheme), I have demonstrated a regular expression parser, matcher, and the conversion to a deterministic finite automaton, with extensions to allow the straightforward construction of a lexical analyzer.

And some people think regular expression derivatives aren’t cool.

Back up at the top, I mentioned that I was interested in parser derivatives because regular expression derivatives were so nice. Unfortunately, things like On the Complexity and Performance of Parsing with Derivatives have brought me to believe that parser derivatives are less cool. Sure, the specification is even simpler than that of regular expressions here. On the other hand, though, there are a number of complications that require memoization, laziness, and computing fixed points.

None of those is particularly difficult, but the combination will be pretty hairy in a more complicated language than Scheme. Further, getting acceptable performance seems to require very tight coding to avoid very slow parsing.

As a result, although I have here the beginning of a very nice scanner using regular expression derivatives, I am currently exploring other context-free grammar parsing algorithms; specifically I am focusing on the Earley algorithm. Let’s see how that goes.

Ahem. Let’s see: Link o’ the day: Matt Might on parsing with derivatives, Parsing with derivatives: introduction, Parsing with derivatives: recursion, and Parsing with derivatives: compaction.↩

Software Tools in Haskell, Chapter 5: Text patterns; Monads and regular expressions; Variations on the theme of monadic regular expressions: Abstraction; Variations on the theme of monadic regular expressions: Records; Variations on the theme of monadic regular expressions: Back references; as well as a post on the specific paper that I am going to be looking at here: Practical regular expression derivatives.↩

Many regular expression engines in use have expressions that are extended with more powerful features, like back-references and the giant bag of things in Perl compatible regular expressions. Strictly speaking, the result is

*not*a regular expression. Mathematically, they are much more powerful than the basic expressions that I have described here, and further (jumping ahead), they cannot be converted to deterministic finite state machines (to my knowledge). So I’m ignoring them.↩The mathematical theory is fine with multiple starting nodes, but handling them simply uses nondeterminism, which I’m avoiding like the plague. Further, automaton constructed using this method only have one start state anyway.↩

These are like footnotes, but more warm and comforting. Like feety pajamas.↩

I ended by wondering if SDE (Syntax Directed Editing) might be the solution. What I’ve discovered is that SDE is a wonderful way of guessing someone’s age: those under 40 or so have rarely heard of SDE; and

those above 45 [2] tend to either convulse in fits of laughter when it’s mentioned, become very afraid, or assume I’m an idiot.In other words, those who know of SDE have a strong reaction to it (which, almost invariably, is against SDE).

[2] I’m not sure if my sample size of 40-45 year olds is insufficient, or whether they really are the “inbetween” generation when it comes to SDE. Fortunately, I will sleep soundly enough without knowing the answer to this particular question.

His statement is entirely true. I note that I am older than 45.

]]>