In my last post, I presented a brief introduction to Frama-C and to the process of verifying properties about a very simple C function, a brute force string search. This time around, I intended to do roughly the same thing, using a slightly more complicated function, a faster string searching algorithm. Unfortunately, in doing so, I found a bug in the algorithm. Admittedly, the bug is rather minor and should not affect the actual behavior of an implementation in practice, but still, it is doing something it shouldn’t ought to be doing.

The string searching algorithm I am looking at this time is Quick Search, so named by Daniel M. Sunday in 1990 because it “is a simple, fast, practical algorithm [and] because it can be both coded and debugged quickly” (“A Very Fast Substring Search Algorithm” [PDF]. It is my personal favorite string search because it is, indeed, simple, fast, and practical. Let me quickly show why.

The algorithm I presented last time is slow. It is slow because it tries to match the needle (the string to be looked for) against the haystack (the string being searched) at every possible position in the haystack, one after another. As a result, it does potentially *n* character comparisons (trying to match a needle of length *n*), *h* times. (*h* is the length of the haystack; technically, it is *h-n* times, but the difference isn’t very important.) In short, it’s an *O(n^2)* algorithm.

On the other hand, the current winner of the string searching algorithm competition is known as Boyer-Moore, an algorithm I’m not really going to try to describe in detail. It’s complicated. Boyer-Moore, like the other faster searches, achieves its performance by not searching for the needle at every location in the haystack; rather it does some extra work up-front to be able to advance by more than one location on a failing comparison. (All algorithms still potentially do *n* character comparisons for each visited location; there is no better way to handle that. The faster algorithms can only be smarter by reducing the *h*-side of things.) Boyer-Moore uses two tricks to increase its advances:

One, sometimes called a “bad character shift”, is the work-horse; it is normally what is used to skip forward in the haystack.

The second, sometimes called a “good suffix shift”, is there to prevent

*O(n^2)*behavior in the cases where the bad character shift would allow it.

Combining these two tricks, Boyer-Moore is an *O(n)* algorithm (specifically *O(3h - h/n)* according to the reference I’m looking at right now, if I’m correctly translating the variable names). It is fast. But it is also complicated.

(As an aside, there are other fast string searching algorithms with similar guarantees; Knuth-Morris-Pratt comes to mind. But Boyer-Moore is popular, for some reason.)

Given the complexity, other researchers presented algorithms that simplify Boyer-Moore by removing bits, typically the good suffix shift. Quick Search is one of those algorithms. (Note that this simplification may have performance benefits; Boyer-Moore does a lot of work and if Quick Search does less but still gets the algorithmic benefits of being smart in the normal case, then in the normal case Quick Search will be faster. But it still doesn’t have the linear worst-case guarantee of Boyer-Moore.)

Without further ado, here’s Quick Search.

As I mentioned above, faster searches do some work up front to reduce the work they have to do later. In the case of Quick Search, hereafter called QS, the extra work is to examine the needle to create a “bad character shift” table. When a comparison of the needle with part of the haystack fails, QS looks to the character of the haystack that follows the failed match and identifies the largest shift it can make for the next comparison without missing a possible match. If, for example, the next character in the haystack does not occur in the needle, then the shift can be made by the length of the needle, plus one. Perhaps an example might be useful.

Suppose the haystack is “xxxxxxxabcbxx” and the needle is “abcb”. In code I’ll show momentarily, QS creates a `bad_shift`

table from the needle that looks like:

```
{
default -> 5,
'a' -> 4,
'b' -> 1,
'c' -> 2
}
```

The initial state of the algorithm looks like this:

```
xxxxxxxabcbxx
abcb
```

This comparison obviously fails, since ‘x’ is not equal to ‘a’. The character after the needle is ‘x’; the shift associated with ‘x’ is 5, so the next state is:

```
xxxxxxxabcbxx
abcb
```

This comparison fails as well, and ‘c’ is the next character to check. The resulting shift is 2:

```
xxxxxxxabcbxx
abcb
```

This comparison succeeds. Yay!

Using the `bad_shift`

table to advance in the haystack potentially improves the performance of the algorithm greatly. The table is initialized by `make_bad_shift`

, the following function.

```
void
make_bad_shift (char *needle, int n, int bad_shift[], int chars)
{
int i;
for (i = 0; i < chars; ++i) {
bad_shift[i] = n + 1;
}
for (i = 0; i < n; ++i) {
bad_shift[needle[i]] = n - i;
}
}
```

There are two kinds of values in the bad shift table:

The default values for

`bad_shift`

are one more than the length of the needle. Since a character that doesn’t appear in the needle cannot participate in the match, the shift puts the start of the needle past the character used as an index into the table.The remaining values in

`bad_shift`

are set to the offset of the last instance of the character from the end of the needle. The shift aligns the last instance of the character in the needle with the matching instance of the character in the haystack.

Once you understand the bad character shift, the search function itself, `QS`

, should be easy to understand. That is, if you understood the brute force version in the last post.

```
int
QS (char *needle, int n, char *haystack, int h)
{
int i, j, bad_shift[UCHAR_MAX + 1];
/* Preprocessing */
make_bad_shift (needle, n, bad_shift, UCHAR_MAX + 1);
/* Searching */
i = 0;
while (i <= h - n) {
for (j = 0; j < n && needle[j] == haystack[i + j]; ++j);
if (j >= n) {
return i;
}
i += bad_shift[haystack[i + n]]; /* shift */
}
return -1;
}
```

While the brute force search advanced by one character after each failing comparison, Quick Search advances by at least one (if the last character in the needle matches the next character in the haystack) and often more.

By the way, did you see the bug? It’s there, and once you know what it is, it’s rather glaring. And no, I’m not talking about the use of `char`

for the needle and haystack, even though a signed character can be less than 0, which would do bad things when used as an index into the `bad_shift`

table. I’ll have to fix that, too, though. Anyway, if you didn’t see the bug, you’re in good company; neither did:

Charras, Christian; Thierry Lecroq. “Exact string matching algorithms,” http://www-igm.univ-mlv.fr/~lecroq/string/index.html.

Lecroq, Thierry. “Experimental results on string matching algorithms,”

*Software - Practice & Experience*25(7):727-765, 1995. http://www-igm.univ-mlv.fr/~lecroq/articles/spe95.pdf.Lin, Jie; Donald Adjeroh and Yue Jiang. “A faster quick search algorithm,”

*Algorithms*7(2), 2014. http://www.mdpi.com/1999-4893/7/2/253.Sunday, Daniel M. “A very fast substring search algorithm,”

*Communications of the ACM*, Vol. 33, No. 8, Aug. 1990. https://csclub.uwaterloo.ca/~pbarfuss/p132-sunday.pdf.

I tripped over the problem when starting to write this post, specifically when I was making the goofy shift diagrams above. Even when I saw the problem, though, I didn’t see what to do about it. But, rather than point out the bug now, how about I go on to the proof of safety and let Frama-C show the problem?

If you read the previous post, the contract and internal assertions for the function `QS`

are nothing surprising. The first three requires clauses and the assigns clause are copied directly from `brute_force`

, and the final requires clause is nothing too unusual. (To be honest, I’m not sure why it’s required. I have already adjusted the `char`

types to be `unsigned char`

, so the maximum value of any element of *needle* should already be `UCHAR_MAX`

.

Inside the implementation, the loop assigns clauses are present to prevent anything untoward from being yanked on, and the loop invariant clauses assert that the loop variables have bounds.

```
/*@
@ // Original requirements for searching
@ requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
@ requires \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX;
@ requires n <= h;
@ // the elements of needle are valid indices into bad_shift
@ requires \forall int i; 0 <= i < n ==> 0 <= needle[i] < UCHAR_MAX + 1;
@ assigns \nothing;
*/
int
QS (unsigned char *needle, int n, unsigned char *haystack, int h)
{
int i, j, bad_shift[UCHAR_MAX + 1];
/* Preprocessing */
make_bad_shift(needle, n, bad_shift, UCHAR_MAX + 1);
/* Searching */
i = 0;
/*@
@ loop assigns i, j;
@ loop invariant 0 <= i <= h + 1;
*/
while (i <= h - n) {
/*@
@ loop assigns j;
@ loop invariant 0 <= j <= n;
*/
for (j = 0; j < n && needle[j] == haystack[i + j]; ++j);
if (j >= n) {
return i;
}
i += bad_shift[ haystack[i + n] ]; /* shift */
}
return -1;
}
```

One interesting detail is the loop invariant for the outer loop: 0 ≤ *i* ≤ *h* + 1. The upper bound cannot be *h* − *n*, copied from the condition of the `while`

, because the shift is being calculated from the `bad_shift`

table. However, I already know that the maximum value of *i* for the last iteration of the loop is *h* − *n* and the maximum value in the table is *n* + 1, so the maximum value that *i* can achieve on the final iteration is

(*h* − *n*)+(*n* + 1) = *h* + 1

Which brings me to the safety of `make_bad_shift`

. The requirement for bounds on the bad shift table means that getting a proof for the safety of `QS`

calls for more complexity in the contract and proof for `make_bad_shift`

. Here is its contract and function header:

```
/*@
@ // needle and bad_shift are valid arrays.
@ requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
@ requires \valid(bad_shift + (0 .. chars-1)) && 0 <= chars < INT_MAX;
@ // needle and bad_shift are separate
@ requires \separated(needle + (0 .. n-1), bad_shift + (0 .. chars-1));
@ // the elements of needle are valid indices into bad_shift
@ requires \forall int i; 0 <= i < n ==> 0 <= needle[i] < chars;
@ // this function initializes bad_shift
@ assigns *(bad_shift + (0 .. chars-1));
@ // bad_shift is initialized to be between 1 and n+1
@ ensures \forall int i; 0 <= i < chars ==> 1 <= bad_shift[i] <= n+1;
*/
void
make_bad_shift (unsigned char *needle, int n, int bad_shift[], int chars)
```

Walking down through the clauses in the contract, I have:

Precondition requirements that both

`needle`

and`bad_shift`

are valid arrays of length*n*and`chars`

, respectively.A requirement that

`needle`

and`bad_shift`

(the actual arrays, not the pointers) don’t overlap. If`needle`

and`bad_shift`

overlapped, then writing to`bad_shift`

would change`needle`

, and potentially all of the other properties would go out the window.A requirement that all of the elements of

`needle`

are usable as indices into`bad_shift`

.An assigns clause indicating that

`make_bad_shift`

modifies the array pointed to by`bad_shift`

.And finally, and possibly most importantly, a postcondition indicating that

`bad_shift`

, after being initialized, will contain values between 1 and*n*+ 1. The postcondition is required to make the safety proof of the outer loop in`QS`

work.

The implementation of `make_bad_shift`

needs six clauses, three for each loop.

```
void
make_bad_shift (unsigned char *needle, int n, int bad_shift[], int chars)
{
int i;
/*@
@ loop assigns i, *(bad_shift + (0 .. chars-1));
@ loop invariant 0 <= i <= chars;
@ loop invariant \forall int k; 0 <= k < i ==> bad_shift[k] == n + 1;
*/
for (i = 0; i < chars; ++i) {
bad_shift[i] = n + 1;
}
/*@
@ loop assigns i, *(bad_shift + (0 .. chars-1));
@ loop invariant 0 <= i <= n;
@ loop invariant \forall int k; 0 <= k < chars ==> 1 <= bad_shift[k] <= n+1;
*/
for (i = 0; i < n; ++i) {
bad_shift[needle[i]] = n - i;
}
}
```

The first loop

modifies

*i*and the`bad_shift`

array,using a loop variable

*i*that ranges from 0 to`chars`

, andall of the elements of

`bad_shift`

that have been initialized, have been initialized to*n*+ 1.

Following the first loop, all of the elements of `bad_shift`

have been set to *n* + 1.

The second loop,

modifies i and the

`bad_shift`

array,using

*i*ranging from 0 to*n*(in other words, through the`needle`

), andall of the elements of

`bad_shift`

are always between 1 and*n*+ 1.

The last clause is different from the loop invariants seen before, in that it applies to all of the elements of `bad_shift`

initially and at all end-of-loop states. But it does establish the function postcondition.

The assorted assertions, plus those generated automatically to avoid run-time errors, turn into 58 goals for the proof engine. 57 of those are satisfied. The last one is…

Here is the output of `-wp-print`

for the one remaining failing goal:

```
Goal Assertion 'rte,mem_access' (file quicksearch.c, line 73):
Let a = shift_uint8(needle_0, j).
Let x = i + j.
Let m = Malloc_0[L_bad_shift_690 <- 256].
Let x_1 = i + n.
Let a_1 = shift_A256_sint32(global(L_bad_shift_690), 0).
Let a_2 = shift_sint32(a_1, 0).
Let a_3 = shift_uint8(needle_0, 0).
Let x_2 = Mint_0[shift_uint8(haystack_0, x_1)].
Let a_4 = shift_uint8(haystack_0, to_sint32(x_1)).
Let x_3 = Mint_0[a_4].
Assume {
Type: is_sint32(h) /\ is_sint32(i) /\ is_sint32(j) /\ is_sint32(n) /\
is_uint8(x_2) /\ is_uint8(x_3) /\
is_sint32(Mint_0[shift_sint32(a_1, x_2)]) /\
is_sint32(Mint_0[shift_sint32(a_1, x_3)]).
(* Heap *)
Have: linked(Malloc_0) /\ (region(haystack_0.base) <= 0) /\
(region(needle_0.base) <= 0).
(* Pre-condition *)
Have: (0 <= n) /\ (n <= 2147483646) /\ valid_rw(Malloc_0, a_3, n).
(* Pre-condition *)
Have: (0 <= h) /\ (h <= 2147483646) /\
valid_rw(Malloc_0, shift_uint8(haystack_0, 0), h).
(* Pre-condition *)
Have: n <= h.
(* Pre-condition *)
Have: forall i_1 : Z. let x_4 = Mint_1[shift_uint8(needle_0, i_1)] in
((i_1 < n) -> ((0 <= i_1) -> ((0 <= x_4) /\ (x_4 <= 255)))).
(* Call 'make_bad_shift' *)
Have: valid_rw(m, a_3, n) /\ separated(a_3, n, a_2, 256) /\
(forall i_1 : Z. let x_4 = Mint_1[shift_uint8(needle_0, i_1)] in
((i_1 < n) -> ((0 <= i_1) -> ((0 <= x_4) /\ (x_4 <= 255))))) /\
(forall i_1 : Z. let x_4 = Mint_0[shift_sint32(a_1, i_1)] in
((0 <= i_1) -> ((i_1 <= 255) -> ((0 < x_4) /\ (x_4 <= (1 + n)))))).
(* Call Effects *)
Have: havoc(Mint_1, Mint_0, a_2, 256).
(* Invariant *)
Have: (0 <= i) /\ (i <= (1 + h)).
(* Assertion 'rte,signed_overflow' *)
Have: n <= (2147483648 + h).
(* Assertion 'rte,signed_overflow' *)
Have: h <= (2147483647 + n).
(* Then *)
Have: x_1 <= h.
(* Invariant *)
Have: (0 <= j) /\ (j <= n).
Have: j < n.
(* Assertion 'rte,mem_access' *)
Have: valid_rd(m, a, 1).
(* Assertion 'rte,signed_overflow' *)
Have: (-2147483648) <= x.
(* Assertion 'rte,signed_overflow' *)
Have: x <= 2147483647.
(* Assertion 'rte,mem_access' *)
Have: valid_rd(m, shift_uint8(haystack_0, to_sint32(x)), 1).
(* Else *)
Have: Mint_0[a] != Mint_0[shift_uint8(haystack_0, x)].
}
Prove: valid_rd(m, a_4, 1).
Prover Alt-Ergo returns Unknown (Qed:8ms) (4.3s)
```

Whoo. Nice. To summarize that mess, the goal to be proved is

```
Let m = Malloc_0[L_bad_shift_690 <- 256].
Let x_1 = i + n.
Let a_4 = shift_uint8(haystack_0, to_sint32(x_1)).
valid_rd(m, a_4, 1).
```

Or in English, “mumble, mumble, valid read, `bad_shift`

, mumble, `haystack`

, *i* + *n*, mumble”.

Line 73 is

`i += bad_shift[ haystack[i + n] ]; /* shift */`

If I break the right hand side apart, the error can be found in the expression `haystack[i + n]`

. (Can you see it now?)

The maximum value of *i* in the loop is *h* − *n*; therefore the maximum value of the array index in that expression is *h* − *n* + *n* or *h*. But the array itself is *h* elements long; it only goes from 0 to *h* − 1.

In that line, the code tries to access one element past the end of the valid `haystack`

array. This is, as I mentioned back at the start, not normally a problem. With C strings, that element will be the 0 character marking the end of the string, and 0 is a valid index into `bad_shift`

. If it is not a C string, it will access some random byte immediately after the array, which should not cause anything to blow up, and the byte will also be a valid index into `bad_shift`

, because any value < 256 is valid. Further, all of the values in `bad_shift`

are greater than one, so in the next step, when the code returns to the while conditional, *i* will be greater than *h* − *n* and the loop will terminate. But still, it’s the principle of the thing.

One way to fix the bug would be to insert a test just before line 73, checking if *i* is *h* − *n*. At this point, the algorithm already ensures that no match has been found; this statement merely breaks out of the loop before trying the invalid access to `haystack`

.

`if (i == h - n) { break; }`

On the other hand, this is supposed to be a fast string search algorithm, and inserting an extra condition in the loop may harm performance. An alternative would be to change the outer loop condition to be `i < h - n`

and to add a test for the final match after the outer loop.

```
if (i == h - n) {
/*@
@ loop assigns j;
@ loop invariant 0 <= j <= n;
*/
for (j = 0; j < n && needle[j] == haystack[i + j]; ++j);
if (j >= n) {
return i;
}
}
```

Some quick benchmarking should tell which option is faster, or if it matters at all.

With either fix, all of the goals are satisfied and the Quick Search function is verified as being safe.

Functional correctness, or progress, is another kettle of fish. In order to prove that the code finds the needle in the haystack (or possibly finds the left-most needle in the haystack), or not, it is necessary to show that when the algorithm skips characters using the `bad_shift`

, it cannot miss a match. I see why it can’t, but I have not figured out how to express that in ACSL. Watch this space.

By the way, the code is available on github: https://github.com/tmmcguire/frama-c-toys/blob/master/string-search/quicksearch.c. Enjoy!

]]>Need is there, but tools are not.

– zzz95

Let’s play with Frama-C.

Frama-C, which apparently stands for “Framework for Modular Analysis of C programs”, is “a suite of tools dedicated to the analysis of the source code of software written in C”. According to the description,

Frama-C is closer to these [bug-finding] heuristic tools than it is to software metrics tools, but it has two important differences with them: it aims at being

correct, that is, never to remain silent for a location in the source code where an error can happen at run-time. And it allows its user to manipulatefunctional specifications, and toprovethat the source code satisfies these specifications.

That sounds like fun, right? Who wouldn’t want to prove their source code satisfies formal specifications?

Well, ok, many people wouldn’t. Formal methods have a long, troubled history in computer science and computer programming. They’re regarded as the best idea since sliced bread by some of the more bondage and discipline-loving elements of the academic community. But in practice, nobody (much) uses them, because they’re a gigantic pain in the rump. The famously cranky (and socks-and-sandals wearing) Edsger Dijkstra pushed hard for correctness-by-construction by calculating programs from their specifications. You can probably guess how far that got. Various very smart people introduced specification languages and tools like Promela and Spin, Z, and TLA+ (and Alloy, and probably some I’m forgetting now), in order to get away from the fine details of actual programming while proving stuff about systems. The greatest success of these tools is that sometimes people recognize the names. And then there’s programming languages: types are periodically hot, and more types are more better, so there’s a hearty push for dependent typing and programming languages with names like ATS, Agda, and Idris. Dependently typed programming languages have one great advantage: they introduce completely new shapes of learning curves. (Vertical? I swear, ATS looks like it hangs over at the top.)

So, yeah, few people are really interested in proving properties about code.

But still, mistakes are embarrassing. Code gets complicated quickly and all the other methods for not screwing up suck, too. My interest here is in seeing how far I can get, without putting too much effort into the process. Maybe I can get some decent bang without putting too many bucks in.

If you are going to play with formal methods, what language do you choose? How about the worst possible case: The language everyone loves to hate because of its impressive safety record. The language that invented the term “undefined behavior” (and “sequence point”). Here it is: C. C is actually one of my favorite languages, and I occasionally refer to it as my native tongue. Sure, it’s a room full of rabid, poisonous animals in shoddily-constructed cages, but it’s *my* room full of poisonous animals.

Frama-C is what links C and formal methods.

Frama-C is a framework for parsing and statically analyzing ANSI/ISO C code, using a collection of plugins to perform specific analyses. According to the `-plugins`

argument to `frama-c`

, there are more than a few options available. I’m starting with WP, which “implements a weakest precondition calculus for ACSL annotations through C programs”. A “weakest precondition calculus” is that logical, formal system about code created by Tony Hoare and Robert Floyd, and popularized by our old socks-n-sandals buddy, EWD. (Did I mention he also wore a cowboy hat?) Hoare logic involves predicates before and after each statement (hey, like sequence points!) describing the state of the program before and after the statement. ACSL describes the syntax of the predicates: The *ANSI/ISO C Specification Language* is a language for logical predicates (and related information) written in C comments. ACSL is not the only such language; JML, for Java, and Spark, a subset of Ada, are very close relatives. And the research language Dafny has the same bag of tricks built in.

The WP plugin, like JML, Spark, and so forth, work by parsing the ACSL comments and the C code, and translating them to the language used by SMT solvers. Those solvers then produce an indication of success, indicating that the properties of the specification are valid according to the code, or failure and a more-or-less indecipherable error.

To play with a simple example, I’ll start with a function performing a string search: the brute force algorithm from Exact String Matching Algorithms.

```
/*
* Search for needle in haystack, using a brute force algorithm.
*
* Expected complexity is O(n*h), where n in the length of the
* needle and h is the length of the haystack.
*/
int
brute_force (char *needle, int n, char *haystack, int h)
{
int i, j;
for (i = 0; i <= h - n; ++i) {
for (j = 0; j < n && needle[j] == haystack[j + i]; ++j);
if (j >= n) {
return i;
}
}
return -1;
}
```

This is about as bare-bones as a string search could be. Given a *needle* string, of length *n*, and a *haystack* string, of length *h*, return the index where *needle* is located in *haystack*. If *needle* is not found, return -1. The outer loop takes *i* through the possible indices of the haystack (0 to *h-n*, since the needle has some length), and for each position tries to find the needle starting at that position with the inner loop using *j*.

You’ll be seeing this code several times in this post. I intend to re-use it, as-is, and just add the necessary ACSL specifications to get it to pass verification with an appropriate set of correctness properties. I’ll start with safety properties and then go on to functional correctness guarantees.

What happens when I rub Frama-C with the WP plugin against this code?

```
$ frama-c -wp brute-force.c
[kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing)
[kernel] Parsing brute-force.c (with preprocessing)
[wp] warning: Missing RTE guards
[wp] 0 goal scheduled
[wp] Proved goals: 0 / 0
```

A warning? That can’t be good. The problem is the lack of another plugin, Rtegen, which “generates annotations for runtime error checking and preconditions at call sites”. Using the RTE plugin requires another argument to Frama-C:

```
$ frama-c -wp -wp-rte brute-force.c
[kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing)
[kernel] Parsing brute-force.c (with preprocessing)
[rte] annotating function brute_force
brute-force.c:16:[wp] warning: Missing assigns clause (assigns 'everything' instead)
brute-force.c:15:[wp] warning: Missing assigns clause (assigns 'everything' instead)
[wp] 8 goals scheduled
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access_2 : Unknown (108ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access : Unknown (Qed:4ms) (107ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_2 : Unknown (105ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow : Unknown (104ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_4 : Unknown (Qed:4ms) (55ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_3 : Unknown (54ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_6 : Unknown (Qed:4ms) (159ms)
[wp] Proved goals: 1 / 8
Qed: 0 (4ms)
Alt-Ergo: 1 (28ms) (41) (unknown: 7)
```

That looks like I’m making progress; I have some goals and one of them is even satisfied already. But there are a couple of warnings, which I think I should address first.

Lines 15 and 16 refer to the two `for`

loops in the code, and in ACSL, an “assigns” clause identifies the memory locations that code modifies. With no assigns clause, the code is allowed to modify any location anywhere, which is potentially problematic as far as verification is concerned. Getting rid of the warnings provides the first introduction to ACSL.

```
int
brute_force (char *needle, int n, char *haystack, int h)
{
int i, j;
/*@
loop assigns i,j;
*/
for (j = 0; j <= h - n; ++j) {
/*@
loop assigns i;
*/
for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
if (i >= n) {
return j;
}
}
return -1;
}
```

The special ACSL comments start with “/*@” and in this case apply to the `for`

statement immediately following. An assigns clause can apply to a function, as part of the function’s contract, or to a loop as in this case. The inner loop modifies only the variable *i*, while the outer loop modifies *j* and (indirectly) *i*.

C has certain…quirks. For example, any given block of code can modify just about anything else. In order to reason about this code, Frama-C needs to know that the code is not going to alter the loop variables *i* and *j* in any way other than the obviously visible loop increments. Hence the necessity of the assigns clauses; they say that *i* and *j* are going to be modified, but *nothing else is*. With this code and that knowledge, it is then possible to deduce that the only way they can be changed is the visible increments.

With that modification, I can get down to business.

```
$ frama-c -wp -wp-rte brute-force.c
[kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing)
[kernel] Parsing brute-force.c (with preprocessing)
[rte] annotating function brute_force
[wp] 10 goals scheduled
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_2 : Unknown (54ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow : Unknown (Qed:4ms) (53ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access_2 : Unknown (107ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access : Unknown (105ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_3 : Unknown (58ms)
[wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_6 : Unknown (Qed:4ms) (103ms)
[wp] Proved goals: 4 / 10
Qed: 2
Alt-Ergo: 2 (16ms) (39) (unknown: 6)
```

So now, I have 10 goals (including the loop assignment checks), 4 of which have already been satisfied (including the loop assignment checks). How do I find out more about the unsatisfied goals? Adding `-wp-pretty`

generates the following output:

```
$ frama-c -wp -wp-rte -wp-print brute-force.c
...
[wp] Proved goals: 4 / 10
Qed: 2 (4ms)
Alt-Ergo: 2 (16ms-20ms) (39) (unknown: 6)
------------------------------------------------------------
Function brute_force
------------------------------------------------------------
Goal Assertion 'rte,signed_overflow' (file brute-force.c, line 18):
Assume {
Type: is_sint32(h) /\ is_sint32(n).
(* Heap *)
Have: linked(Malloc_0) /\ sconst(Mchar_0) /\
(region(haystack_0.base) <= 0) /\ (region(needle_0.base) <= 0).
}
Prove: n <= (2147483648 + h).
Prover Alt-Ergo returns Unknown (53ms)
------------------------------------------------------------
Goal Assertion 'rte,signed_overflow' (file brute-force.c, line 18):
Assume {
Type: is_sint32(h) /\ is_sint32(n).
(* Heap *)
Have: linked(Malloc_0) /\ sconst(Mchar_0) /\
...
```

By the looks of that first block, the particular goal has something to do with run-time errors and signed overflow, around line 18 of the file. The particular property to be proved is “n <= (2147483648 + h)”.

This, by the way, is the last time I am going to show the command invocation (and you will notice I’ve already elided the summary output produced initially.) Most of the output with `-wp-pretty`

resembles this section: a description of the goal under consideration and its location, the collection of information the solver has at that point (which can be much more complicated than this example), the actual goal to be proved, and the results of one of the solvers (in this case Alt-Ergo).

In any case, 2147483648 is 2^31, or 2^32 / 2, or the maximum signed integer representable in 32 bits. Line 18 is:

`for (j = 0; j <= h - n; ++j) {`

And *n* and *h* are the lengths of the input strings. Maybe at this point I should formally describe what I know about the inputs to the function. That might clear up some of those pesky overflow goals.

Now, obviously, *needle* and *haystack* have to be valid pointers, and the blocks they point to have to be *n* and *h* bytes long, respectively. Further, *n* and *h* have to be less than INT_MAX, the C constant describing the maximum value of an `int`

. *n* and *h* have to be *strictly* less than INT_MAX because they are being used here in for loops and range from 0 to *b* for some meaningful *b*, one more than the valid range of indices from 0 to *b-1*. Iterator boundary conditions, and all that.

Finally, *n* has to be at most *h*. Otherwise, the needle cannot be in the haystack, right?

```
/*@
requires \valid(needle + (0 .. n-1)) && n < INT_MAX;
requires \valid(haystack + (0 .. h-1)) && h < INT_MAX;
requires n <= h;
*/
int
brute_force (char *needle, int n, char *haystack, int h)
{
int i, j;
/*@
loop assigns i,j;
*/
for (j = 0; j <= h - n; ++j) {
/*@
loop assigns i;
*/
for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
if (i >= n) {
return j;
}
}
return -1;
}
```

As a result, *that* one goal is satisfied, but nothing more. A little disappointing.

```
...
------------------------------------------------------------
Function brute_force
------------------------------------------------------------
Goal Assertion 'rte,signed_overflow' (file brute-force.c, line 23):
Assume {
Type: is_sint32(h) /\ is_sint32(n).
(* Heap *)
Have: linked(Malloc_0) /\ sconst(Mchar_0) /\
(region(haystack_0.base) <= 0) /\ (region(needle_0.base) <= 0).
(* Pre-condition *)
Have: (n <= 2147483646) /\ valid_rw(Malloc_0, shift_sint8(needle_0, 0), n).
(* Pre-condition *)
Have: (h <= 2147483646) /\
valid_rw(Malloc_0, shift_sint8(haystack_0, 0), h).
(* Pre-condition *)
Have: n <= h.
}
Prove: n <= (2147483648 + h).
Prover Alt-Ergo returns Valid (16ms) (16)
------------------------------------------------------------
...
```

Here’s the scorecard of satisfied and unsatisfied goals:

Property | Proved? | Notes |
---|---|---|

`n <= (2147483648 + h)` |
Valid | |

`h <= (2147483647 + n)` |
Unknown | |

`valid_rd(Malloc_0, a, 1)` |
Unknown | `a = shift_sint8(needle_0, i)` |

`valid_rd(Malloc_0, shift_sint8(haystack_0, to_sint32(x)), 1)` |
Unknown | `x = i + j` |

`(-2147483648) <= x` |
Unknown | `x = i + j` |

`x <= 2147483647` |
Valid | `x = i + j` |

`i <= 2147483646` |
Valid | |

`j <= 2147483646` |
Unknown | |

Loop assigns | Valid | |

Loop assigns | Valid |

Most of the unknown goals involve *i,* *j,* and the loops. It appears that Frama-C is smart enough to identify possible bounds violations but not smart enough to notice that they shouldn’t happen. I need to give it some help.

The simplest possible help I can see is to specify loop invariants describing the indices of the loops. For the outer loop, *j* varies from 0 to *h-n+1* inclusive (remember the final value); for the inner loop, *i* varies from 0 to *n* inclusive.

```
/*@
requires \valid(needle + (0 .. n-1)) && n < INT_MAX;
requires \valid(haystack + (0 .. h-1)) && h < INT_MAX;
requires n <= h;
*/
int
brute_force (char *needle, int n, char *haystack, int h)
{
int i, j;
/*@
loop assigns i,j;
loop invariant 0 <= j <= (h-n) + 1;
*/
for (j = 0; j <= h - n; ++j) {
/*@
loop assigns i;
loop invariant 0 <= i <= n;
*/
for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
if (i >= n) {
return j;
}
}
return -1;
}
```

In ACSL, the “loop invariant” clause specifies the invariant. The important parts about loop invariants, if you remember anything about them, is that:

they should be established before the loop starts (in this case, by assigning 0 to

*i*and*j*),they should be maintained by the code in the body of the loop (in this case, by the increment), and

the state of the program after the loop terminates is the invariant logically anded with the negation of the loop condition (in this case, that means that the inner loop ends with

*i*=*n*, or (*i*<*n*and`needle[i] != haystack[i + j]`

); the outer loop ends with*j*=*h - n + 1*).

That helps quite a lot. Running Frama-C now, the scorecard looks like:

Property | Proved? | Notes |
---|---|---|

`n <= (2147483648 + h)` |
Valid | |

`h <= (2147483647 + n)` |
Unknown | |

`valid_rd(Malloc_0, a, 1)` |
Valid | `a = shift_sint8(needle_0, i)` |

`valid_rd(Malloc_0, shift_sint8(haystack_0, to_sint32(x)), 1)` |
Valid | `x = i + j` |

`(-2147483648) <= x` |
Valid | `x = i + j` |

`x <= 2147483647` |
Valid | `x = i + j` |

`i <= 2147483646` |
Valid | |

`j <= 2147483646` |
Valid | |

Goal Loop assigns | Valid | |

Goal Loop assigns | Valid | |

`(-1) <= j` |
Valid | |

`n <= (1 + h)` |
Valid | |

`-1 <= i` |
Valid | |

`0 <= n` |
Unknown |

12 out of 14 goals have been satisfied. I’m cookin’ with gas, now!

Looking at the two that haven’t been satisfied, I realize that there is some additional information about the arguments to the function that I haven’t given Frama-C: *lower* bounds. Obviously, if *n* or *h* is less than zero, something is not going to work right.

```
/*@
requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
requires \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX;
requires n <= h;
*/
int
brute_force (char *needle, int n, char *haystack, int h)
{
int i, j;
/*@
loop assigns i,j;
loop invariant 0 <= j <= (h-n) + 1;
*/
for (j = 0; j <= h - n; ++j) {
/*@
loop assigns i;
loop invariant 0 <= i <= n;
*/
for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
if (i >= n) {
return j;
}
}
return -1;
}
```

There. Neither *n* nor *h* can be less than zero. Given that addition to the requires clauses in the function contract, all 14 of the goals are satisfied, and Frama-C has proved that this function does not have any errors. It is safe.

*If*, that is, it is called with the right arguments: *needle* and *haystack* must be valid, *n* and *h* must be between 0 and INT_MAX, and *needle* must be smaller than *haystack.* Otherwise, all bets are off. Fortunately, if Frama-C is used to ensure that `brute_force`

is only called with the right arguments, then everything will work out all right.

When I write that the function doesn’t have any errors, I mean that it’s safe. It can’t go wrong. If the arguments are ok, the function cannot access out-of-bounds memory, trip over an arithmetic error, or do anything else nasty. But I haven’t demonstrated that it’ll do anything right, either.

What I need to do next is to describe in ACSL the function’s functional correctness properties; what the brute-force search is supposed to do: find a needle string in a haystack string.

I chose to do so in a couple of steps, since this lets me abstract out the details and avoid repeating them several times. I will skip most of the description of the ACSL syntax. It will hopefully be reasonably obvious to anyone who has seen both predicate logic and C before.

The ACSL block below shows some fun special features. First, ACSL comments start with “//”; yes, that’s a comment within a comment. Secondly, it defines a logical predicate, `partial_match_at`

, which specifies what it means for a prefix of a needle to be found in a haystack at a specific location.

```
/*@
// There is a partial match of the needle at location loc in the
// haystack, of length len.
predicate partial_match_at(char *needle, char *haystack, int loc, int len) =
\forall int i; 0 <= i < len ==> needle[i] == haystack[loc + i];
*/
```

English translation: there is a partial match of length *len* of a *needle* in a *haystack* at a location *loc* if all of the elements of *needle* less than *len* match the corresponding elements of *haystack* starting at *loc*.

The predicate is only used by Frama-C static analysis (although I believe there are Frama-C plugins which generate run-time assertions from function contracts). The first place it is used is in another predicate `match_at`

. This predicate supplies *n,* the total length of the *needle* as the length of the partial match, as a result requiring a complete match of the needle. It also adds a requirement that *loc* be less than *h-n*, so that it will be an acceptable location, as well as adding the other requirements from the `brute_force`

contract. Why? I have no idea. All of those clauses are preconditions for the function. Belt-and-suspenders mathematics, anyone?

```
/*@
// There is a complete match of the needle at location loc in the
// haystack.
predicate match_at(char *needle, int n, char *haystack, int h, int loc) =
\valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX &&
\valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX &&
n <= h && loc <= h - n &&
partial_match_at(needle, haystack, loc, n);
*/
```

Anyway, `match_at`

specifies a valid match of *needle* in *haystack* at location *loc.*

The updated version of `brute_force`

has a number of changes, both in the function contract and in the loop invariants.

In the contract, I added an assigns clause, `assigns \nothing;`

, to indicate that the function has no visible side effects. Then, I added a basic postcondition on the return value:

`ensures -1 <= \result <= (h-n);`

Here, `\result`

refers to the return value, and this postcondition indicates that the result will be between -1 and *h-n*, inclusive. The two behavior specifications add detail to the result:

In the successful case, a return value at least 0 implies a match of the needle at the location indicated by the return value.

In the failure case, a return value of -1 implies that there is no match of the needle at any location in the haystack.

```
/*@
requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
requires \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX;
requires n <= h;
assigns \nothing;
ensures -1 <= \result <= (h-n);
behavior success:
ensures \result >= 0 ==> match_at(needle, n, haystack, h, \result);
behavior failure:
ensures \result == -1 ==>
\forall int i; 0 <= i < h ==>
!match_at(needle, n, haystack, h, i);
*/
int
brute_force (char *needle, int n, char *haystack, int h)
{
int i, j;
/*@
loop assigns i, j;
loop invariant 0 <= i <= (h-n) + 1;
loop invariant \forall int k; 0 <= k < i ==>
!match_at(needle, n, haystack, h, k);
*/
for (i = 0; i <= h - n; ++i) {
/*@
loop assigns j;
loop invariant 0 <= j <= n;
loop invariant partial_match_at(needle, haystack, i, j);
*/
for (j = 0; j < n && needle[j] == haystack[j + i]; ++j);
if (j >= n) {
return i;
}
}
return -1;
}
```

To convince Frama-C that this function satisfies those new postconditions, I added two loop invariants.

The first, on the outer loop, indicates that there is no match starting at any location less than

*i*; that every position in the haystack between 0 and*i*has been checked and found not to match the needle. This is true initially as*i*is 0, and is maintained because a match that invalidates the invariant results in an early return from the function. As a result, in the failing case, the return of -1 will indeed imply that all of the locations in the haystack between 0 and*h-n+1*were checked and did not match.The second, on the inner loop, indicates that there

*is*a partial match starting at*i*and extending for*j*locations. This is true initially because*j*is 0 (a zero-length match). If the loop breaks early, with*j < n*, there is a partial match of length*j*, but not a complete match since the next characters differ. As a result, if the loop completes successfully, with*j >= n*, then there is a complete match of the needle at location*i*, which is the return value. Otherwise, if*j < n*, it is time to increment*i*and check the next position.

With these changes, the following goals are verified by Frama-C:

Property | Proved? | Notes |
---|---|---|

`((-1) <= brute_force_0) /\ ((brute_force_0 + n) <= h)` |
Valid | `brute_force_0` is the result |

`(-1) <= i` |
Valid | |

`n <= (1 + h)` |
Valid | |

`!P_match_at(Malloc_0, Mchar_0, needle_0, n, haystack_0, h, i)` |
Valid | |

Establishment of Invariant (line 58) | Valid | |

`n <= (2147483648 + h)` |
Valid | |

`h <= (2147483647 + n)` |
Valid | |

`(-1) <= j` |
Valid | |

Establishment of Invariant (line 65) | Valid | |

`P_partial_match_at(Mchar_0, needle_0, haystack_0, i, x_4)` |
Valid | `x_4 = 1 + j` |

`P_partial_match_at(Mchar_0, needle_0, haystack_0, i, 0)` |
Valid | |

`valid_rd(Malloc_0, shift_sint8(needle_0, j), 1)` |
Valid | |

`valid_rd(Malloc_0, shift_sint8(haystack_0, to_sint32(i + j)), 1)` |
Valid | |

`(-2147483648) <= (i + j)` |
Valid | |

`(i + j) <= 2147483647` |
Valid | |

`j <= 2147483646` |
Valid | |

`i <= 2147483646` |
Valid | |

Loop assigns (line 56) | Valid | |

Loop assigns (line 64) | Valid | |

Assigns nothing in `brute_force` (line 61) |
Valid | |

Assigns nothing in `brute_force` (line 61) |
Valid | |

Assigns nothing in `brute_force` (line 68) |
Valid | |

Assigns nothing in `brute_force` (line 68) |
Valid | |

Assigns nothing in `brute_force` (line 70) |
Valid | |

Assigns nothing in `brute_force` (line 73) |
Valid | |

`!P_match_at(Malloc_0, Mchar_0, needle_0, n, haystack_0, h, i)` |
Valid | |

`P_match_at(Malloc_0, Mchar_0, needle_0, n, haystack_0, h, brute_force_0)` |
Valid |

And with those goals satisfied, I think I can safely say that `brute_force`

has been verified as both safe and functionally correct.

This has been a long mass of verbiage, but I think the actual amount of work done, modulo the learning curve for Frama-C, hasn’t been too great. Certainly, not more than what would be required to test `brute_force`

to anywhere near the same level of certainty.

There is one…issue (I won’t use the word “error”) with this verification, as written. It’s related to a problem described in An empirical study on the correctness of formally verified distributed systems. I’ll leave it as an exercise to the reader to find it.

If there are any other problems with this code-and-specification, I’d love to hear it. I may have missed something—that’s always the worry when dealing with formal specifications—or I may be using the tools wrong (yeah, I don’t actually know what I’m doing). Please, let me know!

The code, if you are interested, can be found on GitHub.

]]>*Debunking Economics: The Naked Emperor Dethroned?* is a book by Steve Keen, at the time professor of economics and finance at the University of Western Sydney. According to the blurb, it “exposes what a minority of economists have long known and many of the rest of us have long suspected: that economic theory is not only unpalatable, but also plain wrong.” The book contains a “scathing critique of conventional economic theory whilst [whilst?] explaining what mainstream economists cannot: why the [2008] crisis occurred, why it is proving to be intractable [the first edition was published in 2001, this edition was published in 2011], and what needs to be done to end it.”

Keen’s approach to each chapter—with a “kernel” section describing the basic argument, a “roadmap” section spelling out how it will go, and a presentation of the neoclassical idea followed by his counter-arguments—is admirable, he spends a great number of words attacking neoclassical economics and economists. These attacks may be needed by what the book is trying to do, but it does serve to hide the underlying arguments and counter-arguments.

It’s been a rather long time since I was anywhere near an economics classroom (And I cannot find an econ textbook around here!), so these notes are my attempt to understand Keen’s fundamental arguments. In each chapter in the bulk of the book, he presents a traditional “undergraduate” version of a basic economic idea (as taught by what he calls “neoclassical economists”) and then presents an apparently conclusive that the idea has major faults, if not being completely wrong. I am attempting to pick out those basic ideas and avoid the criticism of neoclassical economics and economists.

Chapter 3, the first chapter of part 1, Foundations, addresses the demand curve, one part of introductory economics’ laws of supply and demand.

Note: if you are already familiar with how a demand curve is created, you can skip to the chapter’s punchline at The market demand curve.

Economics begins building a description of demand based on Jeremy Bentham’s (1748-1832) utilitarianism. (I like Jeremy Bentham; he spent time designing prisons and you can visit him at University College London today.) The idea is that, given some commodity such as bananas, having one banana will give you a certain amount of utility, and having more bananas always gives you more utility, although the amount of utility you gain from one *more* banana is less than the utility you gained from getting the previous banana. Here’s a table, denominated in a utility currency of “utils”:

```
## Bananas Utils Change.in.Utils
## 1 1 8 8
## 2 2 15 7
## 3 3 19 4
## 4 4 20 1
```

(These examples are taken from the book.) Plotted, that looks like:

The problem here is that it uses a currency of “utils”, which doesn’t exist and may or may not be definable. It also does not refer to price. To remove the dependency on utils and introduce a relationship to price requires a few steps, the first of which is to relate the bananas commodity to a new commodity, say biscuits. Biscuits obey the same utility rules as bananas, although with different util values, and biscuits can be traded off against bananas at varying utilities:

```
## [,1] [,2] [,3] [,4]
## [1,] 0 9 15 18
## [2,] 8 13 19 25
## [3,] 13 15 24 30
## [4,] 14 18 25 31
```

Graphed, that looks like:

In this graph, the highest utility is provided by 3 bananas and 3 biscuits, but 2 bananas and 3 biscuits is nearly as high. 1 banana and 3 biscuits has the same utility as 3 bananas and 2 biscuits. This last point is important; Keen writes,

The final abstraction en route to the modern theory was to drop this ‘3D’ perspective—since the actual ‘height’ couldn’t be specified numerically anyway—and to instead link points of equal ‘utility height’ into curves, just as contours on a geographic map indicate locations of equal height, or isobars on a weather chart indicate regions of equal pressure.

Based on this depiction, Keen writes,

Since consumers were presumed to be motivated by the utility they gained from consumption, and points of equal utility height gave them the same satisfaction, then a consumer should be ‘indifferent’ between any two points on any given curve, since they both represent the same height, or degree of utility. These contours were therefore christened ‘indifference curves’.

The properties of these curves are:

Completeness. Given two different combinations of commodities, a consumer can decide which is preferred or that he is indifferent between the two. (The combinations are ordered.)

Transitivity. If combination A is preferred to combination B, and B is preferred to combination C, combination A is preferred to C.

Non-satiation. More is better than less: if A contains the same amount of every commodity as B, except for one, and A has more of that one commodity than B, then A is preferred to B.

Convexity. “The marginal utility … falls with additional units, so that indifference curves are convex in shape.” (In the graph above, not all of the indifference curves are convex. Weird.)

Price enters the picture at this point.

Assume that bananas and biscuits both cost $1 each, and the consumer has $3 to spend. This allows the consumer to buy 3 bananas and no biscuits, or no bananas and 3 biscuits, or some combination. This assumption allows a “budget line” to be drawn on the graph:

The point on this line that maximizes the utility “height” indicates the combination of commodities that maximizes the utility and therefore satisfaction of the consumer while keeping the price of the combination to the budget; it looks like 1 banana and 2 biscuits is the appropriate combination.

Keen writes:

Economic theory then repeats this process numerous times—each time considering the same income [“budget”] and price for biscuits, but a [different] price for bananas. Each time, there will be a new combination of biscuits and bananas that the consumer will buy, and the combination of the prices and quantities of bananas purchased is the conumer’s demand curve for bananas.

Given the varying price of bananas and the corresponding number of bananas purchased in the optimal combination, we can construct something that sort of looks like a demand curve:

The resulting graph is a *demand curve*: it follows the *Law of Demand*; that demand increases as price falls. Although Keen does not go into the math, he notes and I believe that the process is well defined mathematically, so that a corresponding demand function can be created to play with analytically. There are some other twists, though. For one, lowering the price of a commodity that is not particularly desirable can result in no more, or even less, of that commodity being purchased.

If, for example, the price of bananas falls while the budget (or “income”, traditionally) and other prices remain the same, then the consumer can buy more bananas and has a higher utility value. On the other hand, if bananas are a major component of the budget yet are relatively undesirable, an increase in income can induce the consumer to buy fewer bananas and more of something more desirable, even if the change in income is the result of changing prices. The canonical example is potatos during the Irish potato famine—consumers bought more potatos even as the prices rose because they could no longer afford more desirable alternatives like pork.

As a result, two effects need to be separated to produce a useful demand curve: the *substitution effect* is the change in demand purely due to the change in prices, and the *income effect* is the change in demand due to a change in prices affecting the consumer’s perceived income. The substitution effect is always inversely related to prices: an increase in price produces a reduction in demand. The income effect, however, can have any relation to demand, depending on the commodity.

The income effect produces four classes of commodities:

Necessities, which take a diminshing

*share*of spending as income grows. Think of, say, toilet paper; you only buy a certain number of rolls per month, no matter what your income is.“Giffen” commodities, whose actual consumption falls as income grows. I have not bought many packages of Ramen noodles since I graduated.

Luxuries, which take an increasing share of spending as income grows. The percentage of income I spend on Picassos is essentially zero, but if I made more, I might find myself buying some.

Neutral or “homothetic” commodities, which take a constant share of spending as income grows.

Note that the second class is a sub-class of the first, and that the fourth class is unoccupied—what kind of commodity would you spend 10% of your income on, for any income from $10,000 per year to $1,000,000 per year? Pizza?

Anyway, back to the substitution effect and the income effect. The substitution effect is what the demand curve is trying to isolate. Fortunately, it is possible to neutralize the income effect to produce a “Hicksian compensated demand curve” which is well behaved: the demand for a commodity will rise if its price falls. There are certain assumptions that are needed to neutralize the income effect, though, such as that changing the price of bananas does not *directly* alter an individual’s income—the only change in income is the income effect.

(As an aside, Keen notes that Ted Wheelwright once described this as “tobogganing up and down your indifference curves until you disappear up your own abscissa”. The verb “to toboggan” needs to see more use.)

This is a demand curve for *one person* and one commodity. The process for creating a *market demand curve,* though, is very simple: given a market consisting of many consumers, each with their own demand curve, the overall market demand curve is simply the sum of the individual demand curves.

Without considering any other effects, that works: the sum of a set of individual demand curves will be a demand curve.

Unfortunately, considering other effects, there is a problem: changing the price of bananas in a market environment will change the income of some of the individuals. Keen uses the example of a market consisting of Robinson Crusoe and Man Friday, and more bananas.

Suppose that Crusoe is a banana consumer, and that Friday is both a consumer and a producer. An increase in the price of bananas will make Friday richer while making Crusoe poorer; Friday will be able to buy more biscuits. (What happens if they’re both producers? The same thing, assuming they are not equally good at producing bananas.)

As a result (Gorman 1953), the market demand function can be any polynomial; the graph is not necessarily negatively sloped. Demand can increase as prices rise. It can then decrease as prices continue to rise. And then increase again. Whee!

Fortunately, the market demand function can be made to obey the Law of Demand under two conditions, also known as the *Sonnenschein-Mantel-Debreu (SMD) conditions*:

All of the Engel curves of all consumers are straight lines, and

All of the Engel curves are parallel.

Keen quotes Gorman:

…we will show that there is just one community indifference locus through each point if, and only if, the Engel curves for different individuals at the same prices are parallel straight lines…

An Engel curve is based on an indifference graph, with multiple “budget” or income lines at different distances from the origin, showing the effects of an increasing income.

Joining the points of maximum utility on each budget line in sequence produces the Engel curve. (I have fudged this example; the actual curve for this silly graph should go through the indicated points.)

Engel curves relate back to the classes of commodities described above, with the income effect. Necessities and Giffen commodities produce upward-curving Engel curves, luxuries produce downward-curving Engle curves (such as the line in the bananas vs. biscuits graph above), and neutral commodities produce linear Engel curves.

The problem here is that the two SMD conditions mean that

All commodities are neutral, and

Since all Engel curves start from (0,0) and parallel lines through the same point are the same line, all consumers have the same Engel curve.

As a result, the conditions imply that there is a single, “generic” commodity, and a single, “representative” consumer.

About this, Gorman writes,

The necessary and sufficient condition quoted above is intutively reasonable. It says, in effect, that an extra unit of purchasing power should be spent in the same way no matter to whom it is given.

…which doesn’t seem all that intuitively reasonable.

There is a good deal more to the chapter, such as discussions of further ways of looking at the SMD conditions (including a “benevolent dictator” who redistributes wealth prior to market activity), and demonstrations that the SMD caveats are not discussed in introductory textbooks and not clearly discussed in advanced textbooks.

As an addendum to the chapter, Keen presents some arguments that the initial idea of a rational consumer maximizing utility to produce an individual demand curve is not necessarily valid, either. His one example is a study of a group of consumers trying to pick the “best” basket of a number of different products. Keen argues that their failure is due to the computational complexity of picking the optimal solution to a largish combinatorial problem.

I believe there are enough other examples from behavioral economics to assert that the “rational” consumer, in practice, does not really exist.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

(Of course, in this post I do not intend to address ESR’s comments on Rust’s crate environment or the necessity (or lack thereof) of being “batteries included”. And I certainly wish to avoid appearing to present “stupid flamage from zealots” or being a “clueless Rust fanboy” (Heaven forfend!).)

There are two kinds of people who will have excessive difficulty with Rust ownership:

Those who have extensive experience with C and to a lesser extent, C++.

Those who have not had any experience with un-managed memory.

(And yes, I’m aware that those two categories cover just about everyone.)

I myself am a member of the first group. I grew up with C. C is my native tongue. In particular, I have a lot of experience with C on “normal” systems—the workstations and personal computers in which C originated.

C, the language, has essentially no compiler-enforced rules preventing the programmer from doing hideously stupid things with memory. Feel the urge to pass a pointer to a pointer to a `struct ssl_ctx_st`

to a function expecting a pointer to a `struct ssl_ctx_st`

or to a function expecting a floating point number? Want to return a pointer to a buffer you allocated on the stack? Sure! Go ahead! The compiler will generate code that does exactly what you told it to do. It’s up to the programmer to figure out what went wrong and why, after the resulting nuclear apocalypse.

In order to be productive with this sort of entertainment, I have internalized a whole lot of rules describing the “C virtual machine” and what you can and cannot do with it. This, incidentally, is why I mentioned “normal” systems above; there are machines out there that violate what I have internalized as the “C virtual machine”, for which the code that I might expect to work happily will instead fail dramatically. That is the primary difficulty in writing “portable” C code: the rules for doing so are even stricter than the rules for 90% of the systems in existence, and code that works almost everywhere will blow up on that one ridiculous machine.

Having internalized those rules, I am in the position of instinctively writing code that I know to be (or quite reasonably expect to be) memory safe *in this specific time and place*, but I am unused to having any tools that try to verify that for me. I sit down and write me some Rust but when I try to compile it, the compiler becomes very unhappy. The compiler is unhappy because it cannot verify that what I have done is safe *in general* and I am unhappy because I know that what I have done is safe *in this specific case*.

The rub is that the compiler is more likely to be right than I am.

Anyway, at this point, I have to “fight the compiler”. I can either abandon the technique that I am trying to use and find something more to the compiler’s liking, or I can try to add enough information to allow the compiler to judge that my technique is safe. Unfortunately, the tools to do the latter are a bit obscure, seemingly weird, and often not available. After all, Rust is not a dependently typed language and adding a proof of correctness to my code would likely be somewhat more painful than using Rust as it exists.

As a result, to learn Rust I have to unlearn some of the lessons I have internalized from years of writing C and relearn the related, but different, approaches needed for Rust.

Those who know C++ are in a similar, but less extreme, position. Rust is, after all, aimed at C++. Programmers using “standard” C++, specifically avoiding the keywords `new`

and `delete`

and avoiding raw pointers, are likely to have an easier time. On the other hand, there will be bumps in the road particularly regarding other parts of Rust.

The second group, above, are in a whole ’nother world of hurt: they’re engaged in a journey of adventure and discovery wherein they need to learn how to manage their own memory without amputating any of their own limbs. Fortunately, on the one hand, Rust inherits many of C++’s techniques such as RAII, which make the task significantly easier. And on the other hand, the compiler’s insistence on its ownership model should ensure that they don’t have to also learn to love valgrind. But still, managing memory is a new requirement that has nothing to do with the actual logic of the problem to be solved, and the compiler is going to be an enemy until the programmer has learned how to handle this extra requirement.

]]>```
Triangle Pen Show 2016
```

I managed to miss the Atlanta Pen Show this year, but fortunately, I was in North Carolina for the Triangle Pen Show in Cary, June 2-5. It was somewhat smaller and more intimate, but no less fun. Only one room of tables (plus a couple of hallways), but the contents of those tables was no less mind-blowing.

This time I did manage to see Susan Wirth’s seminar on “How some pens can make anyone’s handwriting look good. How to find them at a Pen show.” Unfortunately, I have developed the complete inability to remember the third item of any list of three things, so I don’t remember what the last suggestion she had for a pen to improve my generally horrifying writing; the first two were a fine italic (a smidgen of line variation makes everything prettier) and an extra fine or needlepoint (for those situations that call for very small writing). She also had lots of tips for enjoying a pen show.

I had counted on Andreas Lambrou being at the show, and he did not disappoint. As a result, I quickly found myself carrying around 15lbs (6.8038856 kg) of autographed, beautiful pen books. There’s a funny story there: I didn’t have the cash for such a grand purchase, but Andreas had recently started using a Square card scanner. Unfortunately, he added an extra “04” to the price without either of us noticing. A couple of weeks later, I got a bill recording a purchase of roughly $3000. I got on the horn with Bank of America and contested the charge, then went to track down how to contact Andreas. I discovered that he’d already sent me email and had reversed the excess charges through Square. I think all the counter- and counter-counter-transactions have settled down, but my credit card at one point was showing I owed $6000. Whee.

The other attendee I was counting on was Franklin Christoph, which is a pretty sure thing: they’re based in North Carolina. If they didn’t show up, I could drive over to their location and pound on the doors. After fooling with some of their testers at Atlanta, I could not stop thinking about their Model 66, so I decided to combine my strange attraction for their pens with Susan’s pen suggestions (at least those I remember). I bought a Model 66 with a Masuyama fine italic nib and a Model 55 with a Masuyama needlepoint. Both are very nice to write with.

While the Franklin Christoph folks did have some of their standard colors there, they also had limited edition, non-production colors: the Model 66 is in flaming, Corvette red and the Model 65 is in a mottled reddish-brown and black. Both are pretty, but very hard to take accurate pictures of.

]]>To that end, I’ll start with regular expression matching, then go on to the conversion to a deterministic finite automaton and finally discuss the minor extras needed to build a lexical analyzer, similar to lex or flex (but better).

I have written about regular expressions before, too.^{3} They are cool, they are useful, they are fairly simple to understand and work with…what’s not to like? Even better, how about a way of working with them that does not involve a trip into non-determinism land? Hence, Regular expression derivatives reexamined, by Scott Owens, John Reppy, and Aaron Turon. Let’s take a closer look at that.

When I was taking classes in graduate school, I noticed something peculiar: every professor, in every class, spent a couple of sessions on predicate logic, usually first-order predicate calculus. Naturally, all of the introductions were different in some aspect or another. *And,* I had gone into graduate school with a pretty fair understanding of predicate logic courtesy of an excellent undergraduate course on the subject. Now, I recognize that all of the incoming students from various places could not be assumed to have had a brilliant introduction to the topic, nor to have followed the same schedule as I did, and further, it is an important topic to computer science…but geeze. Couldn’t they have compared notes or something?

As a result, I spent some of the first few class sessions doing one of two things: napping or becoming confused. Neither is especially a good way to start a computer science class, although napping is probably does the lesser harm. Eventually, I got to the point where I no longer really understood predicate calculus, and to this day, I could not come up with a coherent description of it.

At the risk of causing someone else to suffer my pain, here is my brief introduction to regular expressions. Feel free to take a nap.

This section starts with a basic definition of regular expressions and the terminology around them, and then goes into the definition of the derivative of a regular expression. We will do something useful with them in the next section.

A regular expression starts with an *alphabet*, *Σ*, containing a finite number of basic *symbols*; think of the ASCII characters or Unicode or something. A *string* is a sequence of symbols from *Σ*, kind of like this sentence. *Σ** is the Kleene-closure (more on this in a minute) of the alphabet; i.e. the set of all strings that can be constructed from the alphabet, by gluing together zero or more symbols. Note that *Σ** is infinite; although every string has some finite length, *Σ** includes every string. This set does include one special member, *ϵ*, which represents the string with zero components, the empty string.

A *language* is a subset of *Σ**, dividing that set of strings into those strings that are in the language and those strings that are not in the language. There is one special language, ∅, consisting of the empty set; all strings, including *ϵ*, are on the “not in the language” side.

A *regular expression* is a (relatively) concise description of a (simple) language. (There are many other ways of describing languages, such as grammars (context free, context sensitive, and “I acknowledge there is a context but I don’t have the social skills to follow it”), but regular expressions, or regexs, or even regexps, are the topic here.)

Regular expressions are built from the symbols of the alphabet *Σ*, combined using a specific group of operations:

*Concatenation.*If*a*and*b*are symbols,*a**b*is the regular expression describing the language {*a**b*}; i.e. the language containing the string consisting of*a*followed by*b*.*Alternation.*If*a*and*b*are symbols,*a*|*b*describes the language {*a*,*b*}; i.e. the language containing the two strings,*a*and*b*.*Kleene-closure.*The Kleene star operator, *, is kind of special; it says, “any number, zero or more, of copies of whatever is in front of it.” If*a*is a symbol, the regular expression*a** describes the language {*ϵ*,*a*,*a**a*, ...}. Unlike the two examples above, this language contains an infinite number of strings. (That should explain what*Σ** means; it is another use of the Kleene star.)

In the examples above, I used the simplest regular expressions, the symbols *a* and *b*. All of these operations work for “sub-expressions” similar to arithmetic operators: if *r* and *s* are regular expressions, *r**s* is the regular expression describing the language built from the languages described by *r* and *s*, where every member of the language *r**s* is a string consisting of a prefix from the language *r* followed by a suffix from the language *s*. The language described by *r*|*s* is the union of the languages described by *r* and *s*, and the language described by *r** is the language made up of zero or more concatenated copies of the strings in *r*.

For our use here, this relatively normal description of regular expressions is extended with two other operations. While these operations are not normally seen in descriptions of regular expressions, they do not add anything; theoretically, they can be built from the operations above.

*Conjunction.*If*r*and*s*are regular expressions,*r*∧*s*describes the language consisting of those strings that are in*both**r*and*s*.*Negation.*If*r*is a regular expression, ¬*r*describes the language consisting of strings that are*not*in the language*r*.

Finally, feel free to add parentheses to resolve any ambiguity in the expression.^{4}

Given the brief introduction above, what is the derivative of a regular expression? It certainly has nothing obvious to do with differential calculus, that’s for sure.

The *derivative of a regular expression* *r*, with respect to a symbol *a*, is another regular expression, call it *r*′. The language described by *r*′ consists of all of the strings from the language of *r* that start with *a*, with the *a* prefix stripped off. Here’s some examples:

The derivative of *a**b* (i.e. the language {*a**b*}) with respect to *a* is the set of strings from the original language that start with *a* (i.e. {*a**b*}) with the initial *a* stripped off: {*b*}. This is the regular expression *b*.

The derivative of *a**b* ({*a**b*}, remember) with respect to *c* is the set of strings from the language described by *a**b* that start with *c*, with the *c* removed: the empty set, ∅. (For simplicity (if not clarity), let’s say that the language ∅ is represented by the regular expression ∅.)

The derivative of *a** (the set {*ϵ*, *a*, *a**a*, ...}, remember) with respect to *a* is the set of strings described by *a** with the initial *a* removed. The *ϵ* is removed from the set, but it is replaced by *a* with the *a* removed, which is itself replaced by *a**a* with *its* initial *a* removed: *a*. So, the language of the derivative *r*′ is still {*ϵ*, *a*, *a**a*, ...}. The derivative of *a** with respect to *a* is simply *a**. Neat, eh?

The language of the derivative of *a*|*b* with respect to *a* is the subset of {*a*, *b*} of strings starting with *a*, which is {*a*}, with the *a* prefix removed: {*ϵ*}. (For convenience, this language is described by *ϵ*, treated as a regular expression. Can’t make this stuff too easy, ya’ know.)

The paper, “Regular-expression derivatives reexamined”, gives the following simple, mathematical definition of the derivative of a regular expression with respect to a character, *a*. (The notation *δ*(*r*, *a*) stands for the derivative of *r* with respect to *a* (and yes, I simplified the notation used in the paper, because a subscript does not make any damn sense as an argument).)

$$
\begin{array}{rcl}
\delta(\epsilon, a) & = & \emptyset \\
\delta(a, a) & = & \epsilon \\
\delta(b, a) & = & \emptyset\ \textrm{ when $a \neq b$} \\
\delta(\emptyset, a) & = & \emptyset \\
\delta(r \cdot s, a) & = & (\delta(r,a) \cdot s) | (\nu(r) \cdot \delta(s,a))\\
\delta(r*, a) & = & \delta(r, a) \cdot r* \\
\delta(r|s, a) & = & \delta(r,a) | \delta(s,a) \\
\delta(r \wedge s, a) & = & \delta(r,a) \wedge \delta(s,a) \\
\delta(\neg r, a) & = & \neg \delta(r,a) \\
\delta(r, \epsilon) & = & r \\
\delta(r, a \cdot u) & = & \delta(\delta(r,a),u)\ \textrm{ where $u$ is a string} \\
\delta((r_1, r_2, \ldots, r_n), a) & = & (\delta(r_1,a), \delta(r_2,a), \ldots, \delta(r_n,a))
\end{array}
$$

(While concatenation is usually represented by juxtaposition, i.e. *a**b*, for clarity in this table I have represented it as an explicit ⋅ (no, that’s a dot, not a speck on your monitor) between two other expressions. Alternation and conjunction are represented by “|” and “∧”, respectively.)

The last three equations may require some special explanation. The first two, for *δ*(*r*, *ϵ*) and *δ*(*r*, *a**u*) extend the definition of a regular expression derivative, with respect to a character, into the definition of a regular expression with respect to a string. This is done straightforwardly, by defining the derivative of *r* with respect to *ϵ* to be *r* itself, and by defining the derivative of *r* with respect to a non-empty string to be the first-order derivative of *r* with respect to the first character, computing a second-order derivative with respect to the next character, and so on. (Taking the derivative with respect to a string is nice to be able to do, for completeness anyway, but won’t further be used in this post. See, stuff’s getting simpler already!)

The final equation describes another extension to the regular expression, the idea of a *regular vector*; the derivative is just the derivative of the individual components. There will be much more about this extension later, though; it is primarily important as a tool to build a scanner, where you have a number of regular expressions and are interested in the one that matches at the beginning of a longer string.

If you look closely, you will see something odd in the line defining *δ*(*r* ⋅ *s*, *a*): a function *ν*(*r*). The function *ν* is a *nullability test*, a test if the expression *r* describes a language that contains the empty string, *ϵ*. If *r* includes *ϵ*, then *ν*(*r*) ultimately evaluates to *ϵ*; if *r* doesn’t include *ϵ*, then *ν*(*r*) evaluates to ∅. Mathematically, *ν* is:

$$
\begin{array}{rcl}
\nu(\epsilon) & = & \epsilon \\
\nu(a) & = & \emptyset \\
\nu(\emptyset) & = & \emptyset \\
\nu(r \cdot s) & = & \nu(r) \wedge \nu(s) \\
\nu(r|s) & = & \nu(r) | \nu(s) \\
\nu(r*) & = & \epsilon \\
\nu(r \wedge s) & = & \nu(r) \wedge \nu(s) \\
\nu(\neg r) & = & \left\{ \begin{array}{rl}
\epsilon & \textrm{if $nu(r) = \emptyset$} \\
\emptyset & \textrm{if $nu(r) = \epsilon$}
\end{array} \right.
\end{array}
$$

The point of *ν* in the derivative of concatenation is, if the left component, *r*, is nullable, then we need to push the derivative operation onto the right component, *s*. Consider what happens when you take the derivative of the regular expression *a**b* with respect to *a*: *a* is not nullable, so that equation evaluates to (*ϵ* ⋅ *b*)|(∅ ⋅ ∅), which can be simplified to *ϵ* ⋅ *b*. The derivative of *that* with respect to *b* (note that *ϵ* *is* nullable) is (∅⋅*b*)|(*ϵ* ⋅ *ϵ*), which simplifies to *ϵ*. Which is exactly what we need: the derivative of the regular expression *a**b* with respect to the string *a**b* should be exactly *ϵ*.

**A brief aside about Greek letters** I have an embarrassing admission to make: I never memorized the Greek alphabet. I recognize the common ones, sure, such as *α*, *π*, and *λ*. I even know some of the upper case: *Σ* and *Ω*. But generally, when I run across a letter in some math in some paper, well… I could look it up, but really, who does that? I generally just refer to it as “smegma”. Multiple letters? “Big smegma”, “the other smegma” and so on. It’s probably a good thing I don’t have to do this in public very often. In any case, I did look them up this time: *δ* is *delta* and *ν* is *nu*.

Now that we have a good, clean definition of the derivative of a regular expression, what can you do with it?

How about writing a proggy?

In this section, I’m going to start by presenting a basic internal representation of regular expressions, as a combination of simple expressions (*ϵ*, symbols, etc.) and higher-level operations on one or more lower-level expressions (concatenation, alternation, Kleene-star, etc.), plus some operations on the internal representation of expressions. Then, I’ll present a simple parser for expressions, to build the internal representation. Finally, I’ll show a regular expression matching engine using the derivative operation.

I have not written about Scheme before; in fact, this is the first serious work I’ve done in Scheme. But it seems like a nice enough language and I thought that, since I was doing programming language work, I would use one of the favorite tools for that task. Hence, Scheme, specifically Guile 2.0.11.

I may not have used Scheme before, but I do have a lot of experience with Common Lisp, although I haven’t used *it* seriously in quite a number of years. But one thing I have written about before is the weird love-affair that Lisp programmers have with representations over abstractions for data structures. Matt Might, in his own Scheme code for regular expression derivatives, provides some dandy examples:

```
;; Special regular expressions.
(define regex-NULL #f) ; -- the empty set
(define regex-BLANK #t) ; -- the empty string
```

Yes, using Booleans for ∅ and *ϵ*. He even goes so far as to write, for his version of *ν*,

```
; regex-empty : regex -> boolean
(define (regex-empty re)
(cond
((regex-empty? re) #t)
((regex-null? re) #f)
((regex-atom? re) #f)
((match-seq re (lambda (pat1 pat2)
(seq (regex-empty pat1) (regex-empty pat2)))))
((match-alt re (lambda (pat1 pat2)
(alt (regex-empty pat1) (regex-empty pat2)))))
((regex-rep? re) #t)
(else #f)))
```

Not only is that not using standard Scheme (Might is using Racket), but the only reason that function results in a Boolean is a carefully arranged accident of representation along with careful construction of the alternation and concatenation operators.

As a result, I’m going to follow my own advice and create a nice pile of structures. (Who’s scared of a little verbosity, anyway?) To start with, here are representations of ∅ and *ϵ* expressions:

```
(define-record-type dre-null-t ; The empty language; the empty set
(dre-null-raw) dre-null?)
(define dre-null (dre-null-raw)) ; Uninteresting value; use a constant
(define-record-type dre-empty-t ; The empty string
(dre-empty-raw) dre-empty?)
(define dre-empty (dre-empty-raw))
```

The records use SRFI-9 record types, as recommended in the Guile documentation. Since ∅ and *ϵ* do not have any interesting structure, I am defining constants for them, `dre-null`

, and `dre-empty`

.

I am presenting the completed proof-of-concept work here, and there are some differences between it and the simple presentation in the previous section. One of those differences is the regular vectors mentioned above (and again, later). Another is this: Rather than using single, constant characters as the base regular expressions, this code uses character sets.

The fundamental idea is that, rather than using raw symbols such as *a*, I use symbol sets, *S*, such as the set {*a*}. *Nothing changes* from the presentation above, except that the definitions

$$
\begin{array}{rcl}
\delta(a, a) & = & \epsilon\ \textrm{ when $a$ = $a$} \\
\delta(b, a) & = & \emptyset\ \textrm{ when $a$ $\neq$ $b$}
\end{array}
$$

become

$$
\begin{array}{rcl}
\delta(S, a) & = & \left\{ \begin{array}{rl}
\epsilon\ & \textrm{ when $a$ $\in$ $S$} \\
\emptyset\ & \textrm{ when $a$ $\not\in$ $S$}
\end{array} \right.
\end{array}
$$

Using character sets has a number of advantages. For one thing, trying to enumerate a large number of characters, like Unicode, would be silly. Instead, large character sets can be defined positively and negatively: a finite set of given characters, or the negation of a finite set of characters.

```
(define-record-type dre-chars-t ; Set of characters
(dre-chars-raw positive chars) dre-chars?
(positive dre-chars-pos?)
(chars dre-chars-set))
(define (dre-chars chars)
(dre-chars-raw #t (list->set chars)))
(define (dre-chars-neg chars)
(dre-chars-raw #f (list->set chars)))
(define dre-chars-sigma (dre-chars-neg '()))
```

*Σ* is defined here as the negation of the empty set.

The basic set operations are complicated by the positive/negative distinction, but not overwhelmingly so.

```
(define (dre-chars-member? re ch)
(let ([is-member (set-member? (dre-chars-set re) ch)])
(if (dre-chars-pos? re) is-member (not is-member)) ))
(define (dre-chars-empty? chars)
(cond
[(not (dre-chars? chars)) (error "not a character set:" chars)]
[(dre-chars-pos? chars) (set-empty? (dre-chars-set chars))]
[else #f] ))
(define (dre-chars-intersection l r)
(cond
[(and (dre-chars-pos? l) (dre-chars-pos? r))
;; both positive: simple intersection
(dre-chars-raw #t (set-intersection (dre-chars-set l) (dre-chars-set r)))]
[(dre-chars-pos? l)
;; l positive, r negative: elts in l also in r by dre-chars-member?
(dre-chars (filter (lambda (elt) (dre-chars-member? r elt))
(set-elts (dre-chars-set l))))]
[(dre-chars-pos? r)
;; l negative, r positive: the mathematician's answer
(dre-chars-intersection r l)]
[else
;; both negative: slightly less simple union
(dre-chars-raw #f (set-union (dre-chars-set l) (dre-chars-set r)))] ))
```

If you take a close look at `dre-chars-empty?`

, you’ll see that a negated character set cannot be empty; it always returns false. Interesting, eh? (It means *Σ* is infinite.)

The implications of that appear in the one further special operation: the choice of a character from a set. It is easy if the set is positive, but if the set is negative I am using Scheme’s typelessness and taking the cheap, quick way out: returning something that cannot possibly be a member of a character set.

```
(define (dre-chars-choice chars)
(cond
[(not (dre-chars? chars))
(error "not a character set:" chars)]
[(and (dre-chars-pos? chars) (not (dre-chars-empty? chars)))
(car (set-elts (dre-chars-set chars)))]
[else (gensym)] ; Not a character
))
```

`dre-chars-choice`

is not immediately needed, but becomes important when converting a regular expression directly into a deterministic automaton, which will appear further down.

The concatenation of two regular expressions is represented by `dre-concat-t`

, a record with left and right fields.

Constructing a concatenation is another moderately complicated operation.

```
(define-record-type dre-concat-t ; Concatenation; sequence
(dre-concat-raw left right) dre-concat?
(left dre-concat-left)
(right dre-concat-right))
(define (dre-concat left right)
(cond
[(not (dre? left))
(error "not a regular expression: " left)]
[(not (dre? right))
(error "not a regular expression: " right)]
;; ∅ ∙ r => ∅
[(dre-null? left)
dre-null]
;; r ∙ ∅ => ∅
[(dre-null? right)
dre-null]
;; ϵ ∙ r => r
[(dre-empty? left)
right]
;; r ∙ ϵ => r
[(dre-empty? right)
left]
;; (r ∙ s) ∙ t => r ∙ (s ∙ t)
[(dre-concat? left)
(dre-concat (dre-concat-left left)
(dre-concat (dre-concat-right left)
right))]
[else
(dre-concat-raw left right)]
))
```

The idea here is to keep the structure in a “canonical” state. If either the left or right expressions is ∅, the concatenation is ∅. If the left expression is *ϵ*, the concatenation is simply the right expression; likewise for the right expression. Finally, a tree structure built of concatenations should be “right-heavy”: the leftmost expression should always be a non-concatenation, recursively.

This canonical structure, which will be repeated for alternation, Kleene-closure, conjunction, and negation, is used when comparing two regular expression structures for equality (or at least semi-equality; it’s still possible for two differing expressions to describe the same language). Once again, a topic that will become important when converting the expression to a DFA.

Given that alternation in a regular expression is the same as union of the two languages described by the two expressions, the simplifications for the canonical form should be easy to see.

```
(define-record-type dre-or-t ; Logical or; alternation; union
(dre-or-raw left right) dre-or?
(left dre-or-left)
(right dre-or-right))
(define (dre-or left right)
(cond
[(not (dre? left))
(error "not a regular expression: " left)]
[(not (dre? right))
(error "not a regular expression: " right)]
;; ∅ + r => r
[(dre-null? left)
right]
[(dre-null? right)
left]
;; r + r => r
[(dre-equal? left right)
left]
;; ¬∅ + r => ¬∅
[(and (dre-negation? left)
(dre-null? (dre-negation-regex left)))
left]
;; (r + s) + t => r + (s + t)
[(dre-or? left)
(dre-or (dre-or-left left)
(dre-or (dre-or-right left)
right))]
[else
(dre-or-raw left right)]
))
```

Just to keep everything notationally interesting, I have used “|” for alternation, but the “practical” paper uses “+”, and I copied the comments in the code from the paper.

There are a couple of slightly odd facts about the Kleene closure.

The closure of the empty language is the language containing nothing but the empty string.

The closure of the closure of an expression, (

*r** )*, is the same as the unnested closure of the expression,*r**.

Weird, eh?

```
(define-record-type dre-closure-t ; Kleene closure; repetition
(dre-closure-raw regex) dre-closure?
(regex dre-closure-regex))
(define (dre-closure regex)
(cond
[(not (dre? regex)) (error "not a regular expression: " regex)]
;; ∅* => ϵ
[(dre-null? regex) dre-empty]
;; ϵ* => ϵ
[(dre-empty? regex) dre-empty]
;; (r*)* => r*
[(dre-closure? regex) regex]
[else (dre-closure-raw regex)]
))
```

Conjunction represents the intersection of the two languages, so the canonical simplifications, similar to alternation, should be self-explanatory.

Don’t you hate it when an author says, “This should be immediately clear.”

```
(define-record-type dre-and-t ; Logical and; intersection
(dre-and-raw left right) dre-and?
(left dre-and-left)
(right dre-and-right))
(define (dre-and left right)
(cond
[(not (dre? left))
(error "not a regular expression: " left)]
[(not (dre? right))
(error "not a regular expression: " right)]
;; ∅ & r => ∅
[(dre-null? left)
dre-null]
[(dre-null? right)
dre-null]
;; r & r => r
[(dre-equal? left right)
left]
;; (r & s) & t => r & (s & t)
[(dre-and? left)
(dre-and (dre-and-left left)
(dre-and (dre-and-right left)
right))]
;; ¬∅ & r => r
[(and (dre-negation? left)
(dre-null? (dre-negation-regex left)))
right]
[else
(dre-and-raw left right)]
))
```

And once again, the negation of a regular expression is the complement of the language set. And the negation of a negation is the original expression and set.

```
(define-record-type dre-negation-t ; Complement
(dre-negation-raw regex) dre-negation?
(regex dre-negation-regex))
(define (dre-negation regex)
;; ¬(¬r) => r
(if (dre-negation? regex)
(dre-negation-regex regex)
(dre-negation-raw regex))
)
```

Regular expression vectors are remarkably simple, all the way through this. That is fortunate, because they become very, very important much later on.

```
(define-record-type dre-vector-t ; A vector of regexs
(dre-vector v) dre-vector?
(v dre-vector-list))
```

I am almost done with the preliminaries, I promise. Now, we have a couple of predicates that are both used in the constructors above (as well as the code below).

```
(define (dre? re)
(or (dre-null? re)
(dre-empty? re)
(dre-chars? re)
(dre-concat? re)
(dre-or? re)
(dre-closure? re)
(dre-and? re)
(dre-negation? re)
(dre-vector? re) ))
```

The function `dre?`

returns true if the argument is a regular expression and false otherwise. It’s built from the predicates for the constituent structures.

**A brief aside about names** I have used `dre`

as the prefix for the structures and functions I am writing here. This stands for “Derivatives of Regular Expressions”. Possibly. Maybe. Or something. This is why I don’t get to name things.

At this point, I feel my choice of Guile Scheme is unfortunate. If I had chosen Racket, I could have called the package “Dr. Dre.” Not that I would ever actually do that. I really shouldn’t name things.

In any case, the next predicate is `dre-equal?`

, a function to compare two regular expressions for equality.

```
(define (dre-equal? left right)
(cond
[(not (dre? left))
#f]
[(not (dre? right))
#f]
[(and (dre-and? left) (dre-and? right))
(let ([l1 (dre-and-left left)]
[l2 (dre-and-right left)]
[r1 (dre-and-left right)]
[r2 (dre-and-right right)])
(or (and (dre-equal? l1 r1)
(dre-equal? l2 r2))
(and (dre-equal? l1 r2)
(dre-equal? l2 r1))))]
[(and (dre-or? left) (dre-or? right))
(let ([l1 (dre-or-left left)]
[l2 (dre-or-right left)]
[r1 (dre-or-left right)]
[r2 (dre-or-right right)])
(or (and (dre-equal? l1 r1)
(dre-equal? l2 r2))
(and (dre-equal? l1 r2)
(dre-equal? l2 r1))))]
[(and (dre-vector? left) (dre-vector? right))
(every dre-equal?
(dre-vector-list left)
(dre-vector-list right))]
[else
(equal? left right)]
))
```

Truly comparing expressions for equality is probably undecidable; it would involve deciding whether they represented the same languages, which is the same kind of problem as deciding whether two functions produce the same output given the same inputs.

What this predicate is doing is comparing the two expressions for limited, structural, equality, with the additional proviso that `dre-and`

and `dre-or`

are symmetric—the order of the arguments doesn’t matter. This, combined with the (minimal) work I put into making the structural representations canonical ought to be good enough, especially since any two expressions being compared were probably derived from some single previous expression.

“Good enough”, in this case, means optimizing the structures above (where minor problems will not affect their semantics) and minimizing the number of states below (much further below, where the argument that the expressions are derived from a single parent is probably more important for correctness).

At this point, I have the structural tools to create representations of regular expressions by gluing things together by hand. But I would very much like not to do that. So, I have (ahem) borrowed a recursive-descent regex parser from Matt Might.

The original grammar (I’ve modified it a bit, specifically to extend to more regular expressions) was:

$$
\begin{array}{rcl}
\textit{regex} & := & \textit{term}\ \textbf{|}\ \textit{regex} \\
& | & \textit{term} \\
\textit{term} & := & \{ \textit{factor} \} \\
\textit{factor} & := & \textit{base} \{ \textbf{*} \} \\
\textit{base} & := & \textit{char} \\
& | & \textbf{(}\ \textit{regex}\ \textbf{)} \\
& | & \textbf{[}\ \textit{set}\ \textbf{]} \\
\textit{set} & := & \{ \textit{char} \} \\
& | & \textbf{^}\ \{ \textit{char} \} \\
\textit{char} & := & \textrm{character} \\
& | & \textbf{\\}\ \textrm{character}
\end{array}
$$

The parser is `string->dre`

; given a string, it produces a regular expression built from the structures above. It reads from a string argument and uses the value `cur`

to record its current location in the string.

```
(define (string->dre str)
(let ([cur 0])
;; Peek at the next character
(define (peek)
(if (more) (string-ref str cur) (error "unexpected end of string")))
;; Advance past the next character
(define (eat ch)
(if (char=? ch (peek)) (set! cur (+ cur 1))
(error "unexpected character:" ch (peek))))
;; Eat and return the next character
(define (next)
(let ([ch (peek)]) (eat ch) ch))
;; Is more input available?
(define (more)
(< cur (string-length str)))
;; The regex rule
(define (regex)
(let ([trm (term)])
(cond
[(and (more) (char=? (peek) #\|))
(eat #\|)
(dre-or trm (regex))]
[else
trm]
)))
;; The term rule
(define (term)
(let loop ([fact dre-empty])
(cond
[(and (more) (and (not (char=? (peek) #\)))
(not (char=? (peek) #\|))))
(loop (dre-concat fact (factor)))]
[else
fact]
)))
;; The factor rule
(define (factor)
(let loop ([b (base)])
(cond
[(and (more) (char=? (peek) #\*))
(eat #\*)
(loop (dre-closure b))]
[else
b]
)))
;; The base rule
(define (base)
(cond
[(char=? (peek) #\()
;; parenthesized sub-pattern
(eat #\()
(let ([r (regex)])
(eat #\))
r)]
[(char=? (peek) #\[)
;; character set
(eat #\[)
(let ([s (set)])
(eat #\])
s)]
[(char=? (peek) #\.)
;; any character except newline
(eat #\.)
(dre-chars-neg '(#\newline))]
[else
;; single character
(dre-chars (list (char)))]
))
;; The set rule
(define (set)
(cond
[(char=? (peek) #\^)
;; negated set
(eat #\^)
(dre-chars-neg (chars))]
[else
;; positive set
(dre-chars (chars))]
))
;; The chars rule
(define (chars)
(let loop ([ch '()])
(if (char=? (peek) #\])
ch
(loop (cons (char) ch)))))
;; The char rule
(define (char)
(cond
[(char=? (peek) #\\)
;; quoted character
(eat #\\)
(next)]
[else
;; unquoted character
(next)]
))
;; Read the regex from the string
(let ([r (regex)])
(if (more)
(error "incomplete regular expression:" (substring str cur))
r))
))
```

Note the brilliant use of lexical scoping, to enclose the functions implementing the grammar rules, which simplifies their interfaces and hides them from the outside. Lexical scoping is damn nice.

And finally, I reach the heart of regular expression derivatives: implementing the functions *δ* and *ν*. They are actually pretty trivial, simply translating the definitions above to use the structures and functions that I’ve defined.

The function *ν* is first, implemented as `nu`

.

```
(define (nu re)
(cond
[(not (dre? re))
(error "not a regular expression: " re)]
[(dre-empty? re)
dre-empty]
[(dre-chars? re)
dre-null]
[(dre-null? re)
dre-null]
[(dre-concat? re)
(dre-and (nu (dre-concat-left re))
(nu (dre-concat-right re)))]
[(dre-or? re)
(dre-or (nu (dre-or-left re))
(nu (dre-or-right re)))]
[(dre-closure? re)
dre-empty]
[(dre-and? re)
(dre-and (nu (dre-and-left re))
(nu (dre-and-right re)))]
[(dre-negation? re)
(if (dre-null? (nu (dre-negation-regex re)))
dre-empty
dre-null)]
))
```

The second function is *δ*, implemented as `delta`

.

```
(define (delta re ch)
(cond
[(not (dre? re))
(error "not a regular expression: " re)]
[(dre-empty? re)
dre-null]
[(dre-null? re)
dre-null]
[(dre-chars? re)
(if (dre-chars-member? re ch) dre-empty dre-null)]
[(dre-concat? re)
(dre-or (dre-concat (delta (dre-concat-left re) ch)
(dre-concat-right re))
(dre-concat (nu (dre-concat-left re))
(delta (dre-concat-right re) ch)))]
[(dre-closure? re)
(dre-concat (delta (dre-closure-regex re) ch) re)]
[(dre-or? re)
(dre-or (delta (dre-or-left re) ch)
(delta (dre-or-right re) ch))]
[(dre-and? re)
(dre-and (delta (dre-and-left re) ch)
(delta (dre-and-right re) ch))]
[(dre-negation? re)
(dre-negation (delta (dre-negation-regex re) ch))]
[(dre-vector? re)
(dre-vector (map (lambda (r) (delta r ch))
(dre-vector-list re)))]
))
```

Ok, so now I have code for representing regular expressions, parsing them, and computing their derivatives with respect to a character. What can I do with it?

How about a simple, quick, and dirty regular expression matcher?

```
(define (dre-match-list? re list)
(cond
[(null? list) (dre-empty? (nu re))]
[else (dre-match-list? (delta re (car list)) (cdr list))]
))
(define (dre-match? re str) (dre-match-list? re (string->list str)))
```

`dre-match?`

returns true if the expression matches the string and false otherwise. It does this by recursively computing the derivative of the starting regular expression with respect to each character in the string, in order. When the end of the string is reached, the regular expression is declared to match if it matches the empty string, using *ν* (in case the current expression has some significant structure).

Things don’t really get much simpler than that.

How does it work? Pretty well:

```
> (dre-match? (string->dre "ab") "ab")
$2 = #t
> (dre-match? (string->dre "ab*") "abbb")
$3 = #t
> (dre-match? (string->dre "ab*") "acbb")
$4 = #f
> (dre-match? (string->dre "\"[^\"]*\"") "\"A string!\"")
$5 = #t
> (dre-match? (string->dre "\"[^\"]*\"") "\"A string!\" not really")
$6 = #f
> (dre-match? (string->dre "\"[^\"]*\"") "\"A \\\"silly\\\" string!\"")
$7 = #f
> (dre-match? (string->dre "\"(\\\"\|[^\"])*\"") "\"A \\\"silly\\\" string!\"")
$8 = #t
```

I apologize for the extra backslashes there. The string in the fourth example is “A string!”; the regex is “[^"]*“. The final example shows regular expressions which match quoted strings with escaped quotes inside.

I know what you’re thinking. You’re saying to yourself, “We went through all that and now we have a moderately poor, slow, regular expression matcher. Yay.” But there are two points to consider:

By my count, including blank lines and comments, and some stuff that I haven’t shown (I had to implement my own set for characters, and I wrote pretty-printers for the structures), we are at about 600 lines. Yeah, that’s not great, but I haven’t been trying to compress things, either.

That’s not all of the fun we can have here. Using derivative techniques, I can convert a regular expression into a deterministic finite automaton. Without first creating a non-deterministic finite automaton, which is the usual process as described in the Dragon book, for example.

This section begins by describing deterministic finite automata and discusses one minor road bump in building one from a regular expression: character sets. Then I go into the Scheme code for finite automata and finally describe the direct conversion from regular expressions to DFAs. Finally, I cover how to use the resulting DFA to build a lexical scanner.

“What’s a deterministic finite automaton?” you’re asking. (Well, maybe not. If not, you can probably take another nap through this section.)

An *automaton* in this context is a machine for recognizing strings that are in a given language, as opposed to strings that are not in the language. A *finite state automaton* is a machine that has a finite number of states. (Ok, that was just redundant.) Unlike other automatons which have external memory, like the stacks of a stack automaton or the tape of a Turing machine, a FA only has a fixed number of states that it can be in. It reads a character, makes a transition into another state or possibly the same one, and then looks to read another character. If the machine is an *accepting state* when the last character has been read, the string is declared to be in the language.

The following illustration shows a DFA for *a**b* * *c*.

State 1 is the starting state, state 3 is accepting, and each edge is either labeled with a character set or not labeled, indicating any character.

More formally, a FA is a directed graph made of nodes (representing states), *S*, and edges (representing transitions), *T*. A *state* *s* ∈ *S* has three important properties:

An identity—each node can be distinguished from every other node.

Whether or not the node is a starting node for the machine. There is only one starting node in a given FA.

^{5}Whether or not the node is an accepting node for the machine. There can be any number of accepting nodes (although there should be at least one, otherwise the language will be ∅).

A *transition* *t* ∈ *T* is a triple of three elements, (*s*, *c*, *s*′). The initial state for the transition is *s*; *c* is an input character (or a set of input characters), and *s*′ is a destination state. What the transition *t* means is that when the FA is in state *s*, if it sees a character in *c*, then it moves to state *s*′.

A *deterministic* finite automaton is an FA were every state has exactly one outgoing transition for every possible input character. As a result, when reading a character, the machine does not need to make any choices; it just matches the current state and the input character to determine the single possible destination state.

“Great!” you say, “Wonderful! But, so what?” A DFA is the fastest way to do regular expression matching. It won’t work for all the fancy features of other regular expression engines, and there are techniques that it cannot match, like using a fast string-search algorithm to look for fixed strings. But a DFA is an *O*(*n*) algorithm with very fast constants, if you can determine which transition matches an input character quickly—frequently this is a table look-up.

Now, about those input characters…

When doing anything with regular expressions, there is a very basic question that I have not asked until now: What are your characters, and how are you handling them? This question suddenly becomes important right at this point.

Up to this point, the answer is pretty simple: we have positive and negative sets of finite numbers of characters and the only operations we have used has been construction and a membership test. But, I defined all those extra operations…

Here’s the deal: traditional DFA construction algorithms iterate over *all* characters in the alphabet, *Σ*, to identify the transitions leaving a state. In effect, if not in practice, they build a matrix where the rows are states and there is one column for all possible input characters; the value of each entry is the next state to enter.

That approach is feasible for small alphabets, such as 7-bit ASCII or 8-bit ISO8859-1 (or -2, or -9, and I hope you’re seeing the problem). But it will not work at all for a Unicode alphabet.

Instead, “Regular expression derivatives reexamined” provides a suggested extension to Brzozowski’s original work on DFA construction using regular expression derivatives, using character sets (which I have been following all along) and computing approximate equivalence classes of characters for each state to identify the outgoing transitions. To quote the paper,

Given an RE

roverΣ, and symbolsaandb∈Σ, we say thata≃_{r}bif and only ifδ(r,a) ≡δ(r,b) [notation modifed]. Thederivative classesofrare the equivalence classesΣ/≃_{r}.

In other words, the derivative classes divide up *Σ* in such a way that all of the symbols in a given derivative class compute the same derivative expression from a starting expression *r*.

Computing these derivative classes exactly would require iteration over the elements of *Σ*, which is exactly what we’re trying to avoid. Fortunately, the authors of the paper are able to define a function *C* by recursing over the structure of *r* that computes an approximation of the derivative classes.

In effect, given a regular expression *r*, *C* returns a partitioning of *Σ* into sets of characters, the derivative classes, that cause *r* to behave differently. This partitioning is approximate; two characters *a* and *b* that cause *r* to “do the same thing” may end up in different derivative classes, but the authors prove that all of the characters in the same derivative class cause *r* to behave the same way.

One thing that is important to note: each derivative class is a character set, the same structures that we have been using all along to handle input.

*C* is:

$$
\begin{array}{rcl}
C(\epsilon) & = & \{ \Sigma \} \\
C(S) & = & \{ S, \Sigma\setminus S \} \textrm{ where $\Sigma\setminus S$ is the negation of $S$} \\
C(r\ \cdot\ s) & = & \left\{ \begin{array}{l;l}
C(r) & \textrm{ when $r$ is not nullable } \\
C(r)\ \hat\cap\ C(s) & \textrm{ otherwise }
\end{array} \right. \\
C(r | s) & = & C(r)\ \hat\cap\ C(s) \\
C(r \wedge s) & = & C(r)\ \hat\cap\ C(s) \\
C(r*) & = & C(r) \\
C(\neg r) & = & C(r)
\end{array}
$$

**Notational note** This is one of the most notationally crazed papers I’ve read in a long time. What I have rendered as

$$c_1 \hat\cap\ c_2$$

is ∧ in the paper, and while most of the definitions are equational like *C*, there is still another oddity coming up.

Speaking of that function (you know, intersection wearing a hat), it is defined as:

$$
C(r)\ \hat\cap\ C(s) = \{ S_r \cap S_s | S_r \in C(r), S_s \in C(s) \}
$$

i.e. the set consisting of the pair-wise intersection of all of the members of *C*(*r*) with the members of *C*(*s*).

In Scheme, *C* and and its auxiliary are defined as follows:

```
(define (C re)
(cond
[(dre-empty? re)
(set dre-chars-sigma)]
[(dre-chars? re)
(let ([elts (set-elts (dre-chars-set re))])
(set (dre-chars elts)
(dre-chars-neg elts)))]
[(dre-concat? re)
(let ([r (dre-concat-left re)]
[s (dre-concat-right re)])
(if (dre-empty? (nu r))
(C-hat (C r) (C s))
(C r)))]
[(dre-or? re)
(C-hat (C (dre-or-left re)) (C (dre-or-right re)))]
[(dre-and? re)
(C-hat (C (dre-and-left re)) (C (dre-and-right re)))]
[(dre-closure? re)
(C (dre-closure-regex re))]
[(dre-negation? re)
(C (dre-negation-regex re))]
[(dre-null? re)
(set dre-chars-sigma)]
[(dre-vector? re)
(fold C-hat
(set dre-chars-sigma)
(map C (dre-vector-list re)))]
[else
(error "unhelpful regular expression:" re)]
))
```

(Oh, and the auxiliary (whose name I apparently cannot use inline, thanks MathJAX) is called `C-hat`

for simplicity and defined using a SRFI-42 list comprehension.)

```
(define (C-hat r s)
;; pair-wise intersection of two sets of character
(list->set (list-ec (:list elt-r (set-elts r))
(:list elt-s (set-elts s))
(dre-chars-intersection elt-r elt-s))))
```

And `C-hat`

uses `dre-chars-intersection`

, another of the `dre-chars`

operations.

Now, since we have the input character question settled, I can get on with building a deterministic finite-state automaton.

As I mentioned before, a state has three important properties: an identity, whether it is a starting state and whether it is an accepting state. And, as there is only one starting node, that property is pretty unimpressive. For purposes of implementing a transformation to a DFA, a state structure such as `dre-state-t`

has some extra fields.

`dre-state-number`

: This is the effective identify of the state.`dre-state-regex`

: Given that the regular expression system I’m building is based on the calculation of the derivative of a regular expression with respect to an input character, which is itself a regular expression, it is possible to label each state with the regular expression that represents*what the automata would accept, starting from this state.*`dre-state-accepting`

: True if the state is accepting, false otherwise. (Ignore the regular expression vector lurking behind the curtain. More on it later.)`dre-state-null`

: True if the state is an error state: one from which there is no accepting string.

```
(define-record-type dre-state-t ; A state in the machine
(dre-state-raw n re accept null) dre-state?
(n dre-state-number)
(re dre-state-regex)
(accept dre-state-accepting)
(null dre-state-null))
```

`dre-state`

uses a lexically-scoped counter to generate the state numbers. If you haven’t seen an example of that particular linguistic feature, well, you can’t say that any more.

It also has two auxiliary functions: `accepting`

, which sets `dre-state-accepting`

depending on the regular expression label, and `nulling`

, which sets `dre-state-null`

, also depending on the regular expression.

```
(define dre-state
(let ([state-count 0])
(lambda (re)
(define (accepting re)
(cond
[(dre-vector? re) (map accepting (dre-vector-list re))]
[else (dre-empty? (nu re))] ))
(define (nulling re)
(cond
[(dre-vector? re)
(every (lambda (b) b) (map nulling (dre-vector-list re)))]
[else
(dre-null? re)] ))
(set! state-count (+ state-count 1))
(let* ([accept
(accepting re)]
[error-state
(nulling re)])
(dre-state-raw state-count re accept error-state)))))
```

The machine also needs to record transitions. These contain two states, an origin and a destination, and the character set labeling the transition.

```
(define-record-type dre-transition-t ; <state, input, state'> transition
(dre-transition state input state') dre-transition?
(state dre-transition-origin)
(input dre-transition-input)
(state' dre-transition-destination))
```

Finally, there is the deterministic finite state machine itself, made up of a group of states, a group of transitions, and the identified start state.

```
(define-record-type dre-machine-t ; Finite state machine
(dre-machine states start transitions) dre-machine?
(states dre-machine-states)
(start dre-machine-start)
(transitions dre-machine-transitions))
```

The function `dre-transitions-for`

implements the process of picking out the transitions whose origin is a given state.

```
(define (dre-transitions-for machine state)
(let ([state-num (dre-state-number state)])
(remove (lambda (trans)
(not (eq? state-num
(dre-state-number (dre-transition-origin trans)))))
(set-elts (dre-machine-transitions machine)))))
```

Finally, we have all of the components necessary to convert a regular expression directly into a deterministic finite automaton.

The basic idea is to create a state, *s*_{i}, labeled with a regular expression. Then, compute the derivative classes for the regular expression. For each class, pick out a representative member using `dre-chars-choice`

and use it to compute the derivative of the regular expression. If this new expression is equivalent to the expression labeling an existing state (using `dre-equal?`

), say *s*_{j}, record a transition from *s*_{i} to *s*_{j} associated with the derivative class. If the new expression is not equivalent to any existing state’s label, create a new state for it, *s*_{k}, and record the transition from *s*_{i} to *s*_{k}. Follow this process recursively until all states have had all derivative classes examined.

The result is a deterministic finite state automaton.

**Another note on notation** This is the final place where “Regular-expression derivatives reexamined” gets a little wacky. The algorithm for this transformation, unlike everything else in the paper, is given in mutually recursive pseudocode. Fortunately, it is very close to the Scheme code below.

```
(define (dre->dfa r)
(define (goto q S engine)
(let* ([Q (car engine)]
[d (cdr engine)]
[c (dre-chars-choice S)]
[qc (delta (dre-state-regex q) c)]
[q' (set-find Q (lambda (q') (dre-equal? (dre-state-regex q') qc)))])
(if q'
(cons Q (set-union d (set (dre-transition q S q'))))
(let ([q' (dre-state qc)])
(explore (set-union Q (set q'))
(set-union d (set (dre-transition q S q')))
q')) )))
(define (explore Q d q)
(fold (lambda (S engine) (goto q S engine))
(cons Q d)
(remove dre-chars-empty?
(set-elts (C (dre-state-regex q))))))
(let* ([q0 (dre-state r)]
[engine (explore (set q0) (set) q0)]
[states (car engine)]
[transitions (cdr engine)])
(dre-machine states q0 transitions)
))
```

There are a few minor nits left to pick.

`explore`

computes the derivative classes for the labels of each visited state, then calls`goto`

with each. It filters out empty character sets (using`dre-chars-empty?`

) since (a) those don’t have any members to be representative, and (b) an empty set wouldn’t be a meaningful transition anyway.`goto`

picks a representative element from the derivative class / character set, uses it to compute the derivative, and makes the choice on whether a new state should be added. If you recall`dre-chars-choice`

, it returns a character from the set if the set is positive and a newly generated Scheme symbol if it isn’t. The symbol is a valid argument to`dre-chars-member?`

which is used when computing the derivative, which is good since the symbol is used to represent*any character outside the (negative) set,*which would otherwise be potentially painful to identify.

Computing a DFA, any way you do it, is a fairly expensive operation. Further, deterministic regular expression engines are unable to use the fancy, more-powerful-than-regular-expressions features of non-deterministic engines such as PCRE. So, what can you do with one?

There is one simple answer: lexical analysis. The scanner part of a parser.

The limitations of a DFA are less cumbersome when you are picking the next token from an input stream. The DFA can be precomputed, meaning that the expense of creating the DFA does not hurt the parser execution. Further, according to the authors of “Regular-expression derivatives reexamined”, the DFA created using this method, while it may have more than the optimal number of states, has fewer states than those created from the traditional method. And finally, running a DFA across an input string is an *O*(*n*) operation, where *n* is the length of the input.

Here’s how it works: a lexical analyzer is specified as a list of regular expressions; one for identifiers, one for numbers, one for each keyword, etc. This is where the *regular vectors* enter the picture. Each of the individual expressions is an element of the vector, which is treated as a single, extended regular expression while creating the DFA. At the end of the process, you have a single finite automaton whose states are the combination of the individual regular expressions.

And this is where we wrap up `dre-state-accepting`

and `dre-state-null`

. A *null state*, *s*_{∅}, is one in which all of the elements of the vector are `dre-null`

, ∅. In this state, it is impossible for the machine to match any input, no matter what follows. As a result, the state *s*_{i}, with a transition into a null state marks the end of a token.

Which token? `dre-state-accepting`

, called on *s*_{i}, returns a vector of booleans, one for each element of the regular vector. If, in state *s*_{i}, there is only one accepting element, then that element identifies the regular expression for a token *t*—*t* is the token that should be passed on.

If there are more than one accepting elements in state *s*_{i}, then theoretically you might have an error in your scanner specification. However, treating that as an error places very harsh requirements on the token regular expressions; the expressions for keywords, such as “if” and “then” frequently overlap with the expressions for identifiers, say. It would be better to use some ordering, such as the order of token specifications, to resolve this ambiguity.

On the other hand, if there are *no* accepting elements in *s*_{i}, then you do have a problem. This would be a lexical error. Fortunately, I note that it would be trivial to identify states creating this error at the time the DFA is generated, by scanning the incoming transitions to *s*_{∅}.

So, in one paper and in a fairly small amount of code (about 770 lines of my very verbose Scheme), I have demonstrated a regular expression parser, matcher, and the conversion to a deterministic finite automaton, with extensions to allow the straightforward construction of a lexical analyzer.

And some people think regular expression derivatives aren’t cool.

Back up at the top, I mentioned that I was interested in parser derivatives because regular expression derivatives were so nice. Unfortunately, things like On the Complexity and Performance of Parsing with Derivatives have brought me to believe that parser derivatives are less cool. Sure, the specification is even simpler than that of regular expressions here. On the other hand, though, there are a number of complications that require memoization, laziness, and computing fixed points.

None of those is particularly difficult, but the combination will be pretty hairy in a more complicated language than Scheme. Further, getting acceptable performance seems to require very tight coding to avoid very slow parsing.

As a result, although I have here the beginning of a very nice scanner using regular expression derivatives, I am currently exploring other context-free grammar parsing algorithms; specifically I am focusing on the Earley algorithm. Let’s see how that goes.

Ahem. Let’s see: Link o’ the day: Matt Might on parsing with derivatives, Parsing with derivatives: introduction, Parsing with derivatives: recursion, and Parsing with derivatives: compaction.↩

Software Tools in Haskell, Chapter 5: Text patterns; Monads and regular expressions; Variations on the theme of monadic regular expressions: Abstraction; Variations on the theme of monadic regular expressions: Records; Variations on the theme of monadic regular expressions: Back references; as well as a post on the specific paper that I am going to be looking at here: Practical regular expression derivatives.↩

Many regular expression engines in use have expressions that are extended with more powerful features, like back-references and the giant bag of things in Perl compatible regular expressions. Strictly speaking, the result is

*not*a regular expression. Mathematically, they are much more powerful than the basic expressions that I have described here, and further (jumping ahead), they cannot be converted to deterministic finite state machines (to my knowledge). So I’m ignoring them.↩The mathematical theory is fine with multiple starting nodes, but handling them simply uses nondeterminism, which I’m avoiding like the plague. Further, automaton constructed using this method only have one start state anyway.↩

These are like footnotes, but more warm and comforting. Like feety pajamas.↩

I ended by wondering if SDE (Syntax Directed Editing) might be the solution. What I’ve discovered is that SDE is a wonderful way of guessing someone’s age: those under 40 or so have rarely heard of SDE; and

those above 45 [2] tend to either convulse in fits of laughter when it’s mentioned, become very afraid, or assume I’m an idiot.In other words, those who know of SDE have a strong reaction to it (which, almost invariably, is against SDE).

[2] I’m not sure if my sample size of 40-45 year olds is insufficient, or whether they really are the “inbetween” generation when it comes to SDE. Fortunately, I will sleep soundly enough without knowing the answer to this particular question.

His statement is entirely true. I note that I am older than 45.

]]>I haven’t hand written a parser in a long time, so that’s what I want to do now. But there is a bit of a catch: it reads from stdin in Pony. And the interface for that is:

```
interface StdinNotify
"""
Notification for data arriving via stdin.
"""
fun ref apply(data: Array[U8] iso) =>
"""
Called when data is available on stdin.
"""
None
fun ref dispose() =>
"""
Called when no more data will arrive on stdin.
"""
None
```

**This is Main, Pony’s pony.**

Main wants to help you, not make you scream like a ruptured duck. Really.

*OMG! I can’t even! That’s asynchronous! Pony is an asynchronous language. It’s gonna be callback hell! It’s Red vs. Blue all over again. Er, wrong link. It’s Red vs. Blue all over again. Devil bunnies! Aaaaaaaaiiigghhh!*

Take a deep breath. Calm down. Remember way back in the ’90s, when object-oriented programming was new? Remember that letter to the editor of *Computer Language*, I think? The one that said an object was…

*…a function with multiple entry points and persistent state? And that we’d discovered it was a bad idea in the ’70s? That object-oriented programming was a horrible mess? Yeah, that was funny.*

And remember what Mittens always says, that back in the ’90s, we were all doing GUIs and it was all callbacks and we never really had any problems with it?

*Yeah, as long as you don’t do a long-running computation on the OS/2 Presentation Manager interface thread and lock up the whole GUI.*

Right. So calm down. It’ll be ok. Pony is an actor-based language, not imperative. All we have to do is figure out how to work with the language, rather than trying to swim upstream, working against it. And we’re not alone here; Simon Peyton Jones has been here before. In fact, here’s a copy of Wearing the hair shirt. Hang on to it, and if you start feeling light-headed, flip through it and look at the pretty clip art.

Let’s start with something simple. Here’s a basic grammar for this version of hoc:

$$
\begin{array}{rcl}
\textrm{Start} & \rightarrow & \textrm{Expr}\ \\
\textrm{Expr} & \rightarrow & \textrm{Term}\ +\ \textrm{Expr} \\
& | & \textrm{Term}\ -\ \textrm{Expr} \\
& | & \textrm{Term}\ \\
\textrm{Term} & \rightarrow & \textrm{Number}\ *\ \textrm{Term} \\
& | & \textrm{Number}\ /\ \textrm{Term} \\
& | & \textrm{Number}\ \\
\textrm{Number} & \rightarrow & [0-9]+
\end{array}
$$

For a recursive descent parser–which only looks at an input character once and has a relatively simple control path—I needed to factor out the common prefixes, putting the grammar in something like Greibach normal form except with more Greibach.

$$
\begin{array}{rcl}
\textrm{Start} & \rightarrow & \textrm{Expr}\ \\
\textrm{Expr} & \rightarrow & \textrm{Term}\ \textrm{Expr2} \\
\textrm{Expr2} & \rightarrow & +\ \textrm{Expr} \\
& | & -\ \textrm{Expr} \\
& | & \textit{empty}\ \\
\textrm{Term} & \rightarrow & \textrm{Number}\ \textrm{Term2} \\
\textrm{Term2} & \rightarrow & *\ \textrm{Term} \\
& | & /\ \textrm{Term} \\
& | & \textit{empty}\ \\
\textrm{Number} & \rightarrow & [0-9]+
\end{array}
$$

That’s pretty simple, right? It also has the neat property that, if a rule is going to fail then it is going to fail while looking at the left-most symbol. As a result, the parser doesn’t have to have any complicated back-tracking machinery to restore the parse state after an individual rule fails.

In fact, to avoid too much complexity, I’m going to make several simplifying assumptions. For one thing, I don’t care about errors. At all, really. Error handling is a pain in the rump at the best of times, and I haven’t written a recursive descent parser since before I sat down and figured out how yacc works. I’m not really enjoying the process this time, either. Another thing I’m not caring about is Unicode. It’s 7-bit ASCII all the way down, here. And whitespace. I don’t care about whitespace; everything can be all mushed together.

Anyway, back to parsing. To build a recursive descent parser normally, you would write functions for each of the rules in the grammar, with each function trying to recognize its own expansion. For example, Expr would try to call Term and, if that succeeded, call Expr2. Term would call Number to read a number and would then call Term2. Term2 is a little more complicated, since it has several options. First, it would try to read a ‘*’. If that failed, without losing track of any input, it would have to try to read a ‘/’. If *that* fails, it needs to succeed while having read *no* input.

None of that works with the asynchronous interface provided by Pony for stdin. Instead of calling a function that waits for input, input from stdin is passed to Pony code through an StdinNotify interface implementation. Now, if you’re familiar with Javascript, your first inclination might be to start slinging inline callbacks all over the place. But to my mind, that is the wrong thing to do in *Javascript.*

But Pony, on the other hand, is an actor oriented language. I do not want to go into the formal definition of actors here, especially since it is not relevant. You can think of an actor as an object that has an associated execution thread: creating an actor also creates a thread that the actor uses to respond to incoming messages. Messages are the only way to communicate between threads (and, fortunately or unfortunately, behaviors (containing the definition of a message handler) syntactically resemble method definitions and message sends resemble method calls). Most of an actor’s lifetime is spent waiting for incoming messages, processing those messages, and sending messages of its own. Actors start when they are created and terminate when they have no outstanding messages to handle and no other actor has a reference to them so that no further messages can be sent to the actor.

One further note on the subject: Pony is an actor oriented language. Actors are the basic, fundamental components of a Pony program. “Main” is the name of the actor initially started by the program. If you attempt to write much serial code in Pony, bad things will happen. For one thing, Pony’s garbage collector operates on individual actors, and is only invoked when the actor is quiescent—not processing a message and not having any messages waiting. Without multiple actors interacting, the garbage collector will never run.

I can’t say I have spent a long time working with actors; Pony is the first actor-oriented programming language I have spent any time with, and, although it does have many concepts in common with the formal tools I *have* spent a few long weekends pondering, there are also significant differences. I believe the basic ideas cross over, though.

Think of actors as top-level concepts: an actor is created, waits to process messages, communicates with other actors, and eventually disappears. In Pony, using an inline callback handler to process input would be unnatural; passing the input to an separately defined, well-defined. actor for processing is more natural. So, that’s how we’re going to do it.

One important way to look at actors, by the way, is as the elements of serialization. Following its creation or beginning handling a message, an actor cannot pause or be interrupted until it finishes and goes back to wait for the next message. The execution of a behavior is serial, and since the only way for actors to interact is through messages, it is possible to think of the execution of a network of actors as happening one behavior execution at a time, atomically.

The problem here is that the sequential structure of the parsed text is represented, in a recursive descent parser, by control flow; the “recursive descent” part exists on the program’s control stack. That won’t work with Pony and asynchronous interfaces, so I need to come up with a representation of structure that will work. There will be stacks involved.

How about this:

A stack of syntactic rule expansions. Whenever a non-terminal is expanded, the expansion is pushed on the top of the stack. Each of the elements of the stack is a sequence of symbols, representing the expansion of the non-terminal. The currently active token or non-terminal is the first element of the top of the stack.

When a symbol succeeds, it is shifted off the top element. When the top element is empty, the expansion it represents has succeeded.

A stack of values. A succeeding token pushes the token on the stack. A succeeding non-terminal pops the rule’s values off the stack, combines them appropriately, and pushes the result on the stack.

This framework can be translated into Pony code, first by describing the symbols of the grammar:

```
trait Nonterminal[T]
"""
A non-terminal symbol in the grammar. Can be expanded, possibly multiple
times, into an array of symbols representing the right-hand side of the rule.
If the rule succeeds, manipulates the value stack to update the state of the
parse.
"""
new create(ruler: Parser[T])
fun ref expand(): Array[Symbol[T] ref] ref ?
fun ref matched(values: ValueStack[T] ref) => None
trait Token[T]
"""
A terminal symbol in the grammar. `apply` and `dispose` are roughly the same
as `StdinNotify`: they feed input to the token. If the token is recognized,
`value` produces the value of the token.
"""
fun ref apply(data: Array[U8] val)
fun ref dispose()
fun ref value(): (T^ | None) => None
type Symbol[T] is (Nonterminal[T] | Token[T])
```

Note that a good chunk of the `Token`

interface is the same as `StdinNotify`

: `apply`

and `dispose`

. Want to guess where input data is going to end up?

The next step is the right-hand side of a rule, the expansion of a non-terminal. For no readily apparent reason, the `expand`

method of `Nonterminal`

doesn’t produce an `RHS`

, but the array it does produce is immediately used to construct a `RHS`

. Que c’est, c’est.

```
class RHS[T]
"""
A nonterminal symbol expands to an array of symbols.
"""
let _a: Array[Symbol[T] ref] ref = _a.create()
new create(n: Nonterminal[T]) => _a.push(n)
new expand(n: Nonterminal[T]) ? => _a.concat(n.expand().values())
fun ref head(): Symbol[T] ? => _a(0)
fun ref shift(): Symbol[T]^ ? => _a.shift()
fun empty(): Bool => _a.size() == 0
fun debug() => Debug("----rhs: " + _a.size().string())
```

The next component is the `RuleStack`

:

```
class RuleStack[T]
let _s: Array[RHS[T]] ref = _s.create()
fun ref top(): RHS[T] ? => _s(_s.size() - 1)
fun ref push(rhs: RHS[T]) => _s.push(rhs)
fun ref pop(): RHS[T] ? => _s.pop()
fun ref current(): Current[T] =>
try
if is_empty() then RuleStackEmpty
elseif is_current_empty() then CurrentEmpty
else top().head()
end
else
IllegalState
end
fun is_empty(): Bool => _s.size() == 0
fun ref is_current_empty(): Bool =>
(not is_empty()) and try top().empty() else false end
fun ref is_current_token(): Bool => _is_current[Token[T]]()
fun ref is_current_nonterminal(): Bool => _is_current[Nonterminal[T]]()
fun ref _is_current[S: Symbol[T]](): Bool =>
try
(not is_empty()) and (not top().empty())
and match top().head() | let t: S => true else false end
else
false
end
fun debug() =>
Debug("---stack: " + _s.size().string())
for i in _s.values() do i.debug() end
```

The most important method here is `current`

, which attempts to return the current, head symbol of the parse state. The type `Current`

is:

```
primitive CurrentEmpty
primitive RuleStackEmpty
primitive IllegalState
type Current[T] is (Symbol[T] | CurrentEmpty | RuleStackEmpty | IllegalState)
```

`CurrentEmpty`

indicates that the top of the stack is an empty RHS. `RuleStackEmpty`

indicates thaht the `RuleStack`

is empty. `IllegalState`

indicates an illegal state for the parser, which *shouldn’t happen* (TM), but is required by Pony’s type system.

Finally, this is the value stack, `ValueStack`

:

```
class ValueStack[T]
let _s: Array[T] = _s.create()
fun ref push(value: T) =>
Debug("++push")
_s.push(consume value)
fun ref is_empty(): Bool => _s.size() == 0
fun ref pop(): (T^ | None) =>
Debug("++pop")
try _s.pop() else None end
fun ref swap() =>
if _s.size() > 1 then
try
let a = _s.pop()
let b = _s.pop()
_s.push(consume a)
_s.push(consume b)
end
end
```

It has a few helpful methods used by rules to manipulate the stack of values.

The parser itself is just the combination of the `RuleStack`

and the `ValueStack`

, along with a couple of other helpful references to other actors: a `Writer`

to write error messages, a `Chain`

used to pass along input once this parser has terminated, and a `ParseResults`

which is notified of the overall success or failure of the parser.

```
class Parser[T]
let _writer: Writer tag
let _chain: Chain tag
let _result: ParseResults tag
let _active: RuleStack[T] = _active.create()
let _values: ValueStack[T] = _values.create()
new start[Start: Nonterminal[T]](
writer: Writer tag,
result: ParseResults tag,
chain: (Chain tag | None) = None)
=>
_writer = writer
_result = result
_chain = match chain | let c: Chain tag => c else NullChain(writer) end
_active.push(RHS[T](Start.create(this)))
```

There is a snazzy type system trick in the `start`

method there. It takes a type parameter of `Start`

, a `Nonterminal`

, which is expanded to start the parsing process. An alternative would be to accept an instance of a `Nonterminal`

or something, but that wouldn’t be typeriffic, would it?

The `Parser`

class has four major methods:

`input`

and`dispose`

, which mirror the methods of the`StdinNotify`

trait and accept input into the parser,`succeed`

, which is called when the currently active token succeeds in reading whatever input it was reading, and`fail`

, which is called when the currently active token*fails*in reading whatever input it was expecting.

`input`

is:

```
fun ref input(data: Array[U8] iso) =>
Debug("-input")
match _active.current()
| let t: Token[T] => t(consume data)
| let n: Nonterminal[T] =>
// The head of the active stack is a un-expanded nonterminal. Attempt to
// expand it and apply the data to the resulting token; if expansion
// fails, the rule stack is empty and we're terminating the parse.
_expand_and_handle(consume data)
| CurrentEmpty =>
// The head of the active stack represents a successful parse of the next
// nonterminal up. Remove it, do_expand, and apply the input to the next
// token, if possible.
try
_active.pop()
(_active.top().shift() as Nonterminal[T]).matched(_values)
_expand_and_handle(consume data)
end
| RuleStackEmpty => _chain.input(consume data)
end
```

The basic idea here is:

If a

`Token`

is current, pass the data to it.If a

`Nonterminal`

is current, try to expand it. If that succeeds, recursively call`input`

and the first branch will take it. If expansion fails, the parse is done, one way or another. Chain the input to the next thingy.This is the purpose of

`_expand_and_handle`

:`fun ref _expand_and_handle(data: Array[U8] iso) => if _do_expand() then input(consume data) else _chain.input(consume data) end`

If the current rule is empty, it has done its job: parsed some input. Pop it off, shift off the nonterminal at the head of the next rule and tell it to do its thing with the values, The try to expand the next symbol.

If the

`RuleStack`

is empty, the parse is unconditionally done. Chain the input on.

`dispose`

is very similar to `input`

except using the `dispose`

method to indicate the end of the input.

```
fun ref dispose() =>
Debug("-dispose")
match _active.current()
| let t: Token[T] => t.dispose()
| let n: Nonterminal[T] =>
// The head of the active stack is a un-expanded nonterminal. Attempt to
// expand it and notify the resulting token of the end of input; if
// expansion fails, the rule stack is empty and we're terminating the parse.
_expand_and_dispose()
| CurrentEmpty =>
// The head of the active stack represents a successful parse of the next
// nonterminal up. Remove it, do_expand, and notify the next token of the
// end of input, if possible.
try
_active.pop()
(_active.top().shift() as Nonterminal[T]).matched(_values)
_expand_and_dispose()
end
| RuleStackEmpty => _chain.dispose()
end
```

It, too, breaks down into four cases with the same logic as `input`

.

When a token succeeds, it calls `succeed`

, passing in the remaining, unread, part of the input.

```
fun ref succeed(remainder: Array[U8] iso) =>
"""
The current token has succeeded. Shift it off and try expanding the
next symbol in this rule. If a token is found, pass it the remaining
input. Otherwise, the rule stack is empty and the parse is terminating.
"""
Debug("-succeed")
if _active.is_current_token() then
try
match (_active.top().shift() as Token[T]).value()
| let t: T => _values.push(consume t)
end
end
_expand_and_handle(consume remainder)
else
_writer.err("Illegal state: succeeded with non-token?")
end
```

`succeed`

removes the current token from the `RuleStack`

, pushes its value (if it produces one) on the `ValueStack`

, then expands the next symbol and continues with the rest of the input.

`fail`

is similar to `succeed`

but with the opposite effect on the `RuleStack`

:

```
fun ref fail(remainder: Array[U8] iso) =>
"""
The current token has failed. Remove the current active rule and
try expanding the previous rule again. If a token is found, reapply
the input. Otherwise, the rule stack is empty and the parse is
terminating.
"""
Debug("-fail")
if _active.is_current_token() then
try
_active.pop()
end
_expand_and_handle(consume remainder)
else
_writer.err("Illegal state: failure with non-token?")
end
```

In this case, the active rule has failed, so it is removed and an attempt is made to re-expand the previous non-terminal (for cases like Expr2 which have multiple right-hand sides).

It is important to note that `remainder`

in this method includes the bytes that were passed to the failing token; it is all of the input from the end of the last token that was recognized.

All of these methods have called a helper, `_do_expand`

, which attempts to arrange for the current symbol at the head of the `RuleStack`

to be a `Token`

.

```
fun ref _do_expand(): Bool =>
Debug("-do_expand")
match _active.current()
| let t: Token[T] => true
| let n: Nonterminal[T] =>
try
// Attempt to expand the nonterminal. If successful, recurse to find
// the current token.
_active.push(RHS[T].expand(n))
_do_expand()
else
// Expanding the nonterminal failed. This rule has failed; remove it
// and recurse on the previous nonterminal.
try
_active.pop()
_do_expand()
else
// NOTREACHED
false
end
end
| CurrentEmpty =>
try
// The current rule has succeeded, probably trivially by expanding.
// Remove it, shift the successful nonterminal, and recurse.
_active.pop()
if not _active.is_empty() then
(_active.top().shift() as Nonterminal[T]).matched(_values)
end
_do_expand()
else
// NOTREACHED
false
end
| RuleStackEmpty =>
// Success (for some value of "success"). Terminate parse.
match _values.pop()
| None => _result.failed()
| let res: Stringable => _result.success(res.string())
end
false
else
_writer.err("Illegal state: rule stack broken on do_expand")
false
end
```

If the current symbol is already a token, it returns true.

If the current symbol is a non-terminal,

`_do_expand`

attempts to expand it, pushing the result on the`RuleStack`

and recurring. If it fails, this rule has failed and it is removed from the`RuleStack`

and, again, it recurs.If the top of the stack is empty, it is removed an the

`Nonterminal`

at the head of the next`RHS`

has succeeded. (There has to be a non-terminal there since nothing else could have expanded to the currently-empty top of the`RuleStack`

.)If the

`RuleStack`

is empty, then we have reached the point where we are terminating the parse. For simplicity, I am assuming that if the parse left a value on the`ValueStack`

, the parse succeeded; the value is passed as to the`ParserResults`

. Otherwise, a failure notice is passed.

The implementation of the grammar starts simply:

`type Value is (ISize | U8)`

A value on the `ValueStack`

is either an `ISize`

(a machine-word-sized signed integer), or a byte. The integer will be either a parsed number or an intermediate result; the byte will be an operator such as ’*‘or’+’.

The grammar is rooted with a starting non-terminal:

```
class StartingNonterminal is Nonterminal[Value]
fun ref expand(): Array[Symbol[Value] ref] ref ? =>
Debug("starting " + _state.string())
match _state
| 0 => _state = 1; Array[Symbol[Value] ref].create().push(Expr(_parser))
else error
end
```

I have elided all but the important part, the method to `expand`

the non-terminal. In this case, there is one right-hand side: expanding to an Expr.

The default implementation of `matched`

, called when the non-terminal has succeeded, leaves the `ValueStack`

alone, which is what we want in this case.

An `Expr`

has a little more meat.

```
class Expr is Nonterminal[Value]
fun ref expand(): Array[Symbol[Value] ref] ref ? =>
Debug("expr " + _state.string())
match _state
| 0 => _state = 1; Array[Symbol[Value]].create().push(Term(_parser)).push(Expr2(_parser))
else error
end
fun ref matched(values: ValueStack[Value] ref) =>
try
match values.pop()
| let v: ISize => values.push(v)
| let op: U8 if op == '+' =>
let snd = values.pop() as ISize
let fst = values.pop() as ISize
values.push(fst + snd)
| let op: U8 if op == '-' =>
let snd = values.pop() as ISize
let fst = values.pop() as ISize
values.push(fst - snd)
else
values.push(ISize(0))
end
else
values.push(ISize(0))
end
```

`expand`

similarly only expands into a single right-hand side:

$$
\begin{array}{rcl}
\textrm{Expr} & \rightarrow & \textrm{Term}\ \textrm{Expr2}
\end{array}
$$

However the result of succeeding is more complicated. `matched`

implements the result of matching an addition or subtraction: it looks at the operator byte, grabs the next two operands from the stack, and performs the operation. (See Expr2 below to find out how the `ValueStack`

got into a state to allow this.)

Expr2 is the grammar rule that recognizes addition or subtraction, or allows a single term to bubble up. (I’ll leave the rant about operator precedence in recursive descent parsers alone for now.)

```
class Expr2 is Nonterminal[Value]
fun ref expand(): Array[Symbol[Value] ref] ref ? =>
Debug("expr2 " + _state.string())
match _state
| 0 => _state = 1; Array[Symbol[Value]].create().push(CharToken(_parser, '+')).push(Expr(_parser))
| 1 => _state = 2; Array[Symbol[Value]].create().push(CharToken(_parser, '-')).push(Expr(_parser))
| 2 => _state = 3; Array[Symbol[Value]].create()
else error
end
fun ref matched(values: ValueStack[Value] ref) =>
match _state
| 1 => values.swap() // Put operator on top of stack
| 2 => values.swap() // Put operator on top of stack
| 3 => None
else
None
end
```

Remember that the Expr2 rule has three possibilities:

$$
\begin{array}{rcl}
\textrm{Expr2} & \rightarrow & +\ \textrm{Expr} \\
& | & -\ \textrm{Expr} \\
& | & \textit{empty}
\end{array}
$$

An addition, a subtraction, and an empty rule. This is what `expand`

implements: the first time it is called, this non-terminal expands as an addition; if that fails, it becomes a subtraction; if *that* fails, then it becomes an empty rule, which always succeeds. This is what `expand`

implements.

`matched`

, on the other hand, has a bit of a peculiar responsibility. At the point when this rule succeeds, the `ValueStack`

will have (at least) three values: a number from the Term part of Expr, a byte representing the operator, and another number from the Expr. All in that order.

To make the processing up in `Expr.matched`

easier, this rule swaps the top two elements, leaving the operator on top of the stack.

Oh, and how does it know the difference between the first and second branches and the third, which *doesn’t* have an operator? If `_state`

is 1, the first branch succeeded; likewise, 2 and 3.

Term and Term2 are very similar to Expr and Expr2, albeit recognizing multiplication and division.

There are two `Token`

classes: single-character operators and numbers.

```
class CharToken is Token[Value]
let _parser: Parser[Value]
let _ch: U8
new create(parser: Parser[Value], ch: U8) => _parser = parser; _ch = ch
fun ref apply(data: Array[U8] val) =>
try
if (data.size() < 1) or (data(0) != _ch) then
_parser.fail(recover Array[U8].append(data) end)
else
_parser.succeed(recover Array[U8].append(data,1) end)
end
end
fun ref dispose() => _parser.fail(recover Array[U8] end)
fun value(): U8 => _ch
```

The `CharToken`

is the simpler of the two; the character to look for is passed to the constructor. If that character is not seen at the head of the input, `CharToken`

fails, returning all the input to the `Parser`

. Otherwise, it succeeds, allowing the character to be placed on the `ValueStack`

and returning the rest of the input.

`NumberToken`

has a few more moving parts. When `apply`

is called with fresh input, it scans the input, recording those characters that are numerical digits.

```
class NumberToken is Token[Value]
let _token: Array[U8] = _token.create()
fun ref apply(data: Array[U8] val) =>
try
for i in Range(0, data.size()) do
if (data(i) >= 0x30) and (data(i) <= 0x39) then
_token.push(data(i))
elseif _token.size() > 0 then
_parser.succeed(recover Array[U8].append(data, i) end)
return
else
_parser.fail(recover Array[U8].append(data) end)
return
end
end
end
fun ref dispose() =>
if _token.size() > 0 then
_parser.succeed(recover Array[U8] end)
else
_parser.fail(recover Array[U8] end)
end
fun ref value(): ISize =>
let a = recover iso Array[U8]() end
for v in _token.values() do
a.push(v)
end
try
String.from_array(consume a).isize()
else
0
end
```

However, it cannot report success until it finds a character which is *not* a digit, or finds the end of the file. Otherwise, it could leave parts of a number on the input stream, resulting in a parse error. Managing this one group of characters is the one example of backtracking in this parser.

Anyway, once the `NumberToken`

has found the first non-numeric character, it can declare success if it has previously seen digits, or failure if it has not.

I implied at the top of this article that there would be actors involved, but there hasn’t been nary a sight of one yet. It’s all been plain, sequential Pony code. Let’s remedy that.

```
actor ParserActor[T]
let _parser: Parser[T]
new start[Start: Nonterminal[T]](
writer: Writer tag,
result: ParseResults tag,
chain: (Chain tag | None) = None)
=>
_parser = Parser[T].start[Start](writer, result, chain)
be apply(data: Array[U8] iso) => _parser.input(consume data)
be dispose() => _parser.dispose()
```

This is the actor which owns the parser state. It’s created with the same arguments as the `Parser`

, and creates an instance of the `Parser`

using those arguments. I mentioned earlier that an actor was the unit of sequential execution in Pony; this actor controls the execution of the `Parser`

, ensuring, for example, that a given block of input is fully consumed before the message containing next input is accepted.

Heck, yeah, it works. Think I’d post this article if it didn’t work?

```
$ echo -n 2+3*4 | ./hoc 2>&1
Parse succeeded: 14
Chain: 0 bytes
```

How does it work? Let’s take a look at the state of the rule stack during the execution. The first input causes the `StartingNonterminal`

to be expanded, producing:

```
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ NumberToken Term2 ]
```

*Note: In computer science, stacks and trees grow down. Don’t ask me why, I just work here.*

At this point, a token, `NumberToken`

, is the current symbol. It reads the ‘2’, producing the following rule stack:

```
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
```

and the value stack:

`2`

The rule stack gets expanded to:

```
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ * Term ]
```

The next character, however, is a ‘+’. So that fails, and `Term2`

is re-expanded.

```
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ / Term ]
```

That still fails.

```
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ ]
```

That succeeds, though. In this particular state, where it accepts *empty*, `Term2`

does not do anything to the value stack.

```
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ ]
```

Likewise, `Term`

, when succeeding, leaves the value stack alone when the top element is a number.

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
```

`Expr2`

is expanded (remember, we’re still looking at the ‘+’ in the input “2+3*4“) to:

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ + Expr ]
```

Yay! This accepts our ‘+’! The `CharToken(+)`

is shifted from the current rule, and the value stack becomes:

```
2
+
```

The `Expr`

is expanded, producing the rule stack:

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ NumberToken Term2 ]
```

The `NumberToken`

accepts the ‘3’, is shifted from the rule stack, and the `Term2`

is expanded, producing a value stack of:

```
2
+
3
```

and a rule stack of:

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ * Term ]
```

The ’*’ is accepted and the value and rule stacks are transformed into:

```
2
+
3
*
```

and

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ Term ]
[ NumberToken Term2 ]
```

The final ‘4’ is read leaving us with a value stack of:

```
2
+
3
*
4
```

and a rule stack of:

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ Term ]
[ Term2 ]
```

Since the input is completed at this point, neither of the first two options for `Term2`

can succeed and the empty branch is taken. `Term2`

succeeds without modifying the value stack, as does `Term`

. The rule stack is now:

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
```

And the value stack is:

```
2
+
3
*
4
```

This `Term2`

is the one that accepted the ’*’, however; it is in a state to modify the value stack on succeeding:

```
2
+
3
4
*
```

Then, when the `Term`

succeeds, it performs the multiplication, leaving the value stack as:

```
2
+
12
```

The rule stack now is:

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Expr2 ]
```

Neither of the first two branches of `Expr2`

can succeed, so it accepts with the empty branch followed by `Expr`

succeeding. Neither of these modify the value stack.

```
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
```

This `Expr2`

accepted the ‘+’ in “2+3*4“, so it rearranges the value stack when it succeeds.

```
2
12
+
```

So that when `Expr`

succeeds, it performs the addition.

`14`

The original `StartingNonterminal`

succeeds as well, resulting in a terminal state: an empty rule stack. The value of the computation is 14, and no additional input needed to be passed on to the chained receiver.

So, what have we achieved here?

Certainly not a parsing system that I’d want to use again. Bleah. I don’t even want to look at it now.

Fortunately, there are some positive outcomes.

This is a proof-of-concept online parser in Pony that doesn’t rely on blocking threads, blocking coroutines, or blocking anything. That’s good, right? Online algorithms are darned useful, especially for network-related tasks like parsing HTTP (ok, that’s easy anyway), and in Pony (which, I remind you, doesn’t have any blocking operations) they’re a necessity. That’s something you can’t say for traditionally developed recursive descent parsers, or the results of most parser generators like yacc.

There are other online parsing techniques, however. Derivative-based parsing is one appealing approach, but GLR parsing and possibly CYK parsing and Earley parsing are other options.

Pony’s actors work with asynchronous interfaces to write code that, nasty parsing algorithms aside, is relatively simple and clear. It takes some work to get used to the coding style, to transition from an imperative (or traditionally functional) style. Whether many programmers are willing to make the transition is another question, as is whether it is ultimately worthwhile. But as for me, I like it.

The real challenge in computer architecture today is not memory

capacity,but memoryspeed.Your brand new shiny red Pentium chip [1994, remember?] isn’t going to win you anything if your software is actually constrained by disk and memory latency (access time). To be precise, there is a wide and increasing gap between memory and CPU performance. Over the past decade CPU’s have doubled in speed every one-and-a-half to two years. Memory gets twice as dense (64-Kb chips increase to 128Kb [Ahem.]) in the same period, but its access time only improves by 10%. Main memory access time will be even more important on huge address space machines. When you have access to huge amounts of data, the latency for moving it around will start to dominate software performance. Expect to see a lot more use of ache and related technologies in the future.

I claim independent discovery. Thpth.

]]>I’m different!

– Crow T. Robot

Well, lookie here! I’ve done it. I’ve converted this blog from Blogger to a self-hosted Hakyll setup. And it only took several hundred automated Lego ferrets a couple of days.

The process wasn’t actually too difficult. Blogger supplies a .xml backup, which can be parsed by Jekyll‘s jekyll-import command to produce .html files which can be simply plopped into a Hakyll posts directory. Almost. jekyll-import leaves the `tags`

component of the posts’ metadata blocks in a YAML-style, one-tag-per-line with a hyphen prefix; Hakyll wants them to be comma-separated on one line. Further, setting `replace-internal-link`

to true is needed to transform internal links in the posts into a URL that can be man-handled into something relative.

Here’s an **ed** script to do the heavy lifting.

```
/tags:/+1,/^m/-1 s/$/,/
/tags:/+1,/^m/-1 s/-//
/tags:/,/^m/-1 j
s/,$//
1,$ s/{{ site.baseurl }}//g
1,$ s/{% post_url \([^ ]*\) %}/\/posts\/\1.html/g
w
q
```

The first four lines handle the tag situation, first adding a comma to the end of every tag line, then stripping off the hyphen, and then joining the lines into one, and finally removing the terminal, extraneous comma. The next two lines handle URLS, first by removing the `site.baseurl`

variable and then transforming the `post_url`

thing into a default Hakyll `/posts/...`

local URL.

**Note:** I’ve since altered how Hakyll generates URLs for posts, so the internal links are all broken. I just realized I need to train a ferret to go through and fix them up again. Geeze. Is my work never done? Do you all realize how much training snackies cost these days?

Copying comments out of the backup required a combination of hacking on jekyll-import and manual work; existing comments should be preserved but won’t be pretty. Sorry.

Anyway, for anyone interested, here’s my current `site.hs`

:

```
main :: IO ()
main = hakyllWith hakyllConfig $ do
match "images/*" $ do
route idRoute
compile copyFileCompiler
match "css/*" $ do
route idRoute
compile compressCssCompiler
```

These two rules copy the images and css directories into their appropriate locations.

```
tags <- buildTags "posts/*" (fromCapture "tags/*.html")
tagsRules tags $ \tag pattern -> do
let title = "Posts tagged \"" ++ tag ++ "\""
route idRoute
compile $ do
posts <- chronological =<< loadAll pattern
let ctx = constField "title" title `mappend`
listField "posts" postCtx (return posts) `mappend`
defaultContext
makeItem ""
>>= loadAndApplyTemplate "templates/tag.html" ctx
>>= loadAndApplyTemplate "templates/default.html" ctx
>>= relativizeUrls
```

This bit is pretty much cobbled together from examples in the documentation, to generate tags pages (click on the haskell link above to see one).

```
match "pages/*" $ do
route $ gsubRoute "pages/" (const "p/") `composeRoutes` setExtension "html"
let tagsCtx = postCtxWithTags $ sorted tags
compile $ pandocCompiler
>>= loadAndApplyTemplate "templates/post.html" tagsCtx
>>= loadAndApplyTemplate "templates/default.html" tagsCtx
>>= relativizeUrls
match "posts/*" $ do
route $ customRoute oldStylePath `composeRoutes` setExtension "html"
let tagsCtx = postCtxWithTags $ sorted tags
compile $ pandocCompiler
>>= saveSnapshot "content"
>>= loadAndApplyTemplate "templates/post.html" tagsCtx
>>= loadAndApplyTemplate "templates/default.html" tagsCtx
>>= relativizeUrls
```

Next are rules for `pages`

and `posts`

directories; pages are the Web Authentication and Parsing with Derivatives pages in the header. What you are currently reading is a post.

```
create ["atom.xml"] $ do
route idRoute
compile $ do
let feedCtx = postCtx `mappend` bodyField "description"
posts <- fmap (take 10) . recentFirst =<< loadAllSnapshots "posts/*" "content"
renderAtom feedConfiguration feedCtx posts
create ["rss.xml"] $ do
route idRoute
compile $ do
let feedCtx = postCtx `mappend` bodyField "description"
posts <- fmap (take 10) . recentFirst =<< loadAllSnapshots "posts/*" "content"
renderRss feedConfiguration feedCtx posts
```

And these two rules create the RSS and Atom feeds, as plain XML files.

```
create ["archive.html"] $ do
route idRoute
compile $ do
posts <- recentFirst =<< loadAll "posts/*"
let archiveCtx =
listField "posts" postCtx (return posts) `mappend`
constField "title" "Archives" `mappend`
defaultContext
makeItem ""
>>= loadAndApplyTemplate "templates/archive.html" archiveCtx
>>= loadAndApplyTemplate "templates/default.html" archiveCtx
>>= relativizeUrls
match "index.html" $ do
route idRoute
compile $ do
posts <- fmap (take 10) . recentFirst =<< loadAll "posts/*"
let indexCtx =
listField "posts" postCtx (return posts) `mappend`
constField "title" "Home" `mappend`
tagCloudField "tagCloud" 50 150 tags `mappend`
defaultContext
getResourceBody
>>= applyAsTemplate indexCtx
>>= loadAndApplyTemplate "templates/default.html" indexCtx
>>= relativizeUrls
match "templates/*" $ compile templateBodyCompiler
```

Then it creates the archive and index pages.

The only part that isn’t taken more or less directly from some tutorial or example code is the custom route for pages, which is used to convert what Hakyll would come up with normally into what Blogger used.

```
-- Blogger paths for posts were /year/month/titlish; this function converts a
-- filename starting with year-month-date-titlish and turns it into a filepath
-- matching the Blogger format.
oldStylePath :: Identifier -> FilePath
oldStylePath ident = year </> month </> titlish
where basename = takeBaseName $ toFilePath ident
parts = splitAll "-" basename
[year,month] = take 2 parts
titlish = intercalate "-" $ drop 3 parts
```

The formatting is done with Bootstrap, the fonts Merriweather and Merriweather Sans are from Google Fonts, and I added a Disqus comment thingy. I have also added MathJax for the occasional math formatting (so don’t complain if you momentarily see some TeX which is replaced by pretty mathy stuff).

And then a miracle occurs, and Maniagnosis exsists. Can I get an Amen?

Now, as to why I did it…I’m afraid I don’t understand the question.

]]>