Applied Formal Logic: The bug in Quick Search

Posted on June 23, 2017 by Tommy M. McGuire

Labels: c, applied formal logic, math, notation, programming language, software development, theory, toy problems, applied formal logic

In my last post, I presented a brief introduction to Frama-C and to the process of verifying properties about a very simple C function, a brute force string search. This time around, I intended to do roughly the same thing, using a slightly more complicated function, a faster string searching algorithm. Unfortunately, in doing so, I found a bug in the algorithm. Admittedly, the bug is rather minor and should not affect the actual behavior of an implementation in practice, but still, it is doing something it shouldn’t ought to be doing.

The string searching algorithm I am looking at this time is Quick Search, so named by Daniel M. Sunday in 1990 because it “is a simple, fast, practical algorithm [and] because it can be both coded and debugged quickly” (“A Very Fast Substring Search Algorithm” [PDF]. It is my personal favorite string search because it is, indeed, simple, fast, and practical. Let me quickly show why.

The algorithm I presented last time is slow. It is slow because it tries to match the needle (the string to be looked for) against the haystack (the string being searched) at every possible position in the haystack, one after another. As a result, it does potentially n character comparisons (trying to match a needle of length n), h times. (h is the length of the haystack; technically, it is h-n times, but the difference isn’t very important.) In short, it’s an O(n^2) algorithm.

On the other hand, the current winner of the string searching algorithm competition is known as Boyer-Moore, an algorithm I’m not really going to try to describe in detail. It’s complicated. Boyer-Moore, like the other faster searches, achieves its performance by not searching for the needle at every location in the haystack; rather it does some extra work up-front to be able to advance by more than one location on a failing comparison. (All algorithms still potentially do n character comparisons for each visited location; there is no better way to handle that. The faster algorithms can only be smarter by reducing the h-side of things.) Boyer-Moore uses two tricks to increase its advances:

One, sometimes called a “bad character shift”, is the work-horse; it is normally what is used to skip forward in the haystack.
The second, sometimes called a “good suffix shift”, is there to prevent O(n^2) behavior in the cases where the bad character shift would allow it.

Combining these two tricks, Boyer-Moore is an O(n) algorithm (specifically O(3h - h/n) according to the reference I’m looking at right now, if I’m correctly translating the variable names). It is fast. But it is also complicated.

(As an aside, there are other fast string searching algorithms with similar guarantees; Knuth-Morris-Pratt comes to mind. But Boyer-Moore is popular, for some reason.)

Given the complexity, other researchers presented algorithms that simplify Boyer-Moore by removing bits, typically the good suffix shift. Quick Search is one of those algorithms. (Note that this simplification may have performance benefits; Boyer-Moore does a lot of work and if Quick Search does less but still gets the algorithmic benefits of being smart in the normal case, then in the normal case Quick Search will be faster. But it still doesn’t have the linear worst-case guarantee of Boyer-Moore.)

Without further ado, here’s Quick Search.

Quick Search

As I mentioned above, faster searches do some work up front to reduce the work they have to do later. In the case of Quick Search, hereafter called QS, the extra work is to examine the needle to create a “bad character shift” table. When a comparison of the needle with part of the haystack fails, QS looks to the character of the haystack that follows the failed match and identifies the largest shift it can make for the next comparison without missing a possible match. If, for example, the next character in the haystack does not occur in the needle, then the shift can be made by the length of the needle, plus one. Perhaps an example might be useful.

Suppose the haystack is “xxxxxxxabcbxx” and the needle is “abcb”. In code I’ll show momentarily, QS creates a bad_shift table from the needle that looks like:

{
  default -> 5,
  'a'     -> 4,
  'b'     -> 1,
  'c'     -> 2
}

The initial state of the algorithm looks like this:

xxxxxxxabcbxx
abcb

This comparison obviously fails, since ‘x’ is not equal to ‘a’. The character after the needle is ‘x’; the shift associated with ‘x’ is 5, so the next state is:

xxxxxxxabcbxx
     abcb

This comparison fails as well, and ‘c’ is the next character to check. The resulting shift is 2:

xxxxxxxabcbxx
       abcb

This comparison succeeds. Yay!

Using the bad_shift table to advance in the haystack potentially improves the performance of the algorithm greatly. The table is initialized by make_bad_shift, the following function.

void
make_bad_shift (char *needle, int n, int bad_shift[], int chars)
{
  int i;
  for (i = 0; i < chars; ++i) {
    bad_shift[i] = n + 1;
  }
  for (i = 0; i < n; ++i) {
    bad_shift[needle[i]] = n - i;
  }
}

There are two kinds of values in the bad shift table:

The default values for bad_shift are one more than the length of the needle. Since a character that doesn’t appear in the needle cannot participate in the match, the shift puts the start of the needle past the character used as an index into the table.
The remaining values in bad_shift are set to the offset of the last instance of the character from the end of the needle. The shift aligns the last instance of the character in the needle with the matching instance of the character in the haystack.

Once you understand the bad character shift, the search function itself, QS, should be easy to understand. That is, if you understood the brute force version in the last post.

int
QS (char *needle, int n, char *haystack, int h)
{
  int i, j, bad_shift[UCHAR_MAX + 1];
  /* Preprocessing */
  make_bad_shift (needle, n, bad_shift, UCHAR_MAX + 1);
  /* Searching */
  i = 0;
  while (i <= h - n) {
    for (j = 0; j < n && needle[j] == haystack[i + j]; ++j);
    if (j >= n) {
      return i;
    }
    i += bad_shift[haystack[i + n]];	/* shift */
  }
  return -1;
}

While the brute force search advanced by one character after each failing comparison, Quick Search advances by at least one (if the last character in the needle matches the next character in the haystack) and often more.

By the way, did you see the bug? It’s there, and once you know what it is, it’s rather glaring. And no, I’m not talking about the use of char for the needle and haystack, even though a signed character can be less than 0, which would do bad things when used as an index into the bad_shift table. I’ll have to fix that, too, though. Anyway, if you didn’t see the bug, you’re in good company; neither did:

Charras, Christian; Thierry Lecroq. “Exact string matching algorithms,” http://www-igm.univ-mlv.fr/~lecroq/string/index.html.
Lecroq, Thierry. “Experimental results on string matching algorithms,” Software - Practice & Experience 25(7):727-765, 1995. http://www-igm.univ-mlv.fr/~lecroq/articles/spe95.pdf.
Lin, Jie; Donald Adjeroh and Yue Jiang. “A faster quick search algorithm,” Algorithms 7(2), 2014. http://www.mdpi.com/1999-4893/7/2/253.
Sunday, Daniel M. “A very fast substring search algorithm,” Communications of the ACM, Vol. 33, No. 8, Aug. 1990. https://csclub.uwaterloo.ca/~pbarfuss/p132-sunday.pdf.

I tripped over the problem when starting to write this post, specifically when I was making the goofy shift diagrams above. Even when I saw the problem, though, I didn’t see what to do about it. But, rather than point out the bug now, how about I go on to the proof of safety and let Frama-C show the problem?

Safety

If you read the previous post, the contract and internal assertions for the function QS are nothing surprising. The first three requires clauses and the assigns clause are copied directly from brute_force, and the final requires clause is nothing too unusual. (To be honest, I’m not sure why it’s required. I have already adjusted the char types to be unsigned char, so the maximum value of any element of needle should already be UCHAR_MAX.

Inside the implementation, the loop assigns clauses are present to prevent anything untoward from being yanked on, and the loop invariant clauses assert that the loop variables have bounds.

/*@
 @ // Original requirements for searching
 @ requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
 @ requires \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX;
 @ requires n <= h;
 @ // the elements of needle are valid indices into bad_shift
 @ requires \forall int i; 0 <= i < n ==> 0 <= needle[i] < UCHAR_MAX + 1;
 @ assigns \nothing;
 */
int
QS (unsigned char *needle, int n, unsigned char *haystack, int h)
{
  int i, j, bad_shift[UCHAR_MAX + 1];
  /* Preprocessing */
  make_bad_shift(needle, n, bad_shift, UCHAR_MAX + 1);
  /* Searching */
  i = 0;
  /*@
   @ loop assigns i, j;
   @ loop invariant 0 <= i <= h + 1;
   */
  while (i <= h - n) {
    /*@
     @ loop assigns j;
     @ loop invariant 0 <= j <= n;
     */
    for (j = 0; j < n && needle[j] == haystack[i + j]; ++j);
    if (j >= n) {
      return i;
    }
    i += bad_shift[ haystack[i + n] ];	/* shift */
  }
  return -1;
}

One interesting detail is the loop invariant for the outer loop: 0 ≤ i ≤ h + 1. The upper bound cannot be h − n, copied from the condition of the while, because the shift is being calculated from the bad_shift table. However, I already know that the maximum value of i for the last iteration of the loop is h − n and the maximum value in the table is n + 1, so the maximum value that i can achieve on the final iteration is

(h − n)+(n + 1) = h + 1

Which brings me to the safety of make_bad_shift. The requirement for bounds on the bad shift table means that getting a proof for the safety of QS calls for more complexity in the contract and proof for make_bad_shift. Here is its contract and function header:

/*@
 @ // needle and bad_shift are valid arrays.
 @ requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
 @ requires \valid(bad_shift + (0 .. chars-1)) && 0 <= chars < INT_MAX;
 @ // needle and bad_shift are separate
 @ requires \separated(needle + (0 .. n-1), bad_shift + (0 .. chars-1));
 @ // the elements of needle are valid indices into bad_shift
 @ requires \forall int i; 0 <= i < n ==> 0 <= needle[i] < chars;
 @ // this function initializes bad_shift
 @ assigns *(bad_shift + (0 .. chars-1));
 @ // bad_shift is initialized to be between 1 and n+1
 @ ensures \forall int i; 0 <= i < chars ==> 1 <= bad_shift[i] <= n+1;
 */
void
make_bad_shift (unsigned char *needle, int n, int bad_shift[], int chars)

Walking down through the clauses in the contract, I have:

Precondition requirements that both needle and bad_shift are valid arrays of length n and chars, respectively.
A requirement that needle and bad_shift (the actual arrays, not the pointers) don’t overlap. If needle and bad_shift overlapped, then writing to bad_shift would change needle, and potentially all of the other properties would go out the window.
A requirement that all of the elements of needle are usable as indices into bad_shift.
An assigns clause indicating that make_bad_shift modifies the array pointed to by bad_shift.
And finally, and possibly most importantly, a postcondition indicating that bad_shift, after being initialized, will contain values between 1 and n + 1. The postcondition is required to make the safety proof of the outer loop in QS work.

The implementation of make_bad_shift needs six clauses, three for each loop.

void
make_bad_shift (unsigned char *needle, int n, int bad_shift[], int chars)
{
  int i;
  /*@
   @ loop assigns i, *(bad_shift + (0 .. chars-1));
   @ loop invariant 0 <= i <= chars;
   @ loop invariant \forall int k; 0 <= k < i ==> bad_shift[k] == n + 1;
   */
  for (i = 0; i < chars; ++i) {
    bad_shift[i] = n + 1;
  }
  /*@
   @ loop assigns i, *(bad_shift + (0 .. chars-1));
   @ loop invariant 0 <= i <= n;
   @ loop invariant \forall int k; 0 <= k < chars ==> 1 <= bad_shift[k] <= n+1;
   */
  for (i = 0; i < n; ++i) {
    bad_shift[needle[i]] = n - i;
  }
}

The first loop

modifies i and the bad_shift array,
using a loop variable i that ranges from 0 to chars, and
all of the elements of bad_shift that have been initialized, have been initialized to n + 1.

Following the first loop, all of the elements of bad_shift have been set to n + 1.

The second loop,

modifies i and the bad_shift array,
using i ranging from 0 to n (in other words, through the needle), and
all of the elements of bad_shift are always between 1 and n + 1.

The last clause is different from the loop invariants seen before, in that it applies to all of the elements of bad_shift initially and at all end-of-loop states. But it does establish the function postcondition.

The assorted assertions, plus those generated automatically to avoid run-time errors, turn into 58 goals for the proof engine. 57 of those are satisfied. The last one is…

The bug

Here is the output of -wp-print for the one remaining failing goal:

Goal Assertion 'rte,mem_access' (file quicksearch.c, line 73):
Let a = shift_uint8(needle_0, j).
Let x = i + j.
Let m = Malloc_0[L_bad_shift_690 <- 256].
Let x_1 = i + n.
Let a_1 = shift_A256_sint32(global(L_bad_shift_690), 0).
Let a_2 = shift_sint32(a_1, 0).
Let a_3 = shift_uint8(needle_0, 0).
Let x_2 = Mint_0[shift_uint8(haystack_0, x_1)].
Let a_4 = shift_uint8(haystack_0, to_sint32(x_1)).
Let x_3 = Mint_0[a_4].
Assume {
  Type: is_sint32(h) /\ is_sint32(i) /\ is_sint32(j) /\ is_sint32(n) /\
      is_uint8(x_2) /\ is_uint8(x_3) /\
      is_sint32(Mint_0[shift_sint32(a_1, x_2)]) /\
      is_sint32(Mint_0[shift_sint32(a_1, x_3)]).
  (* Heap *)
  Have: linked(Malloc_0) /\ (region(haystack_0.base) <= 0) /\
      (region(needle_0.base) <= 0).
  (* Pre-condition *)
  Have: (0 <= n) /\ (n <= 2147483646) /\ valid_rw(Malloc_0, a_3, n).
  (* Pre-condition *)
  Have: (0 <= h) /\ (h <= 2147483646) /\
      valid_rw(Malloc_0, shift_uint8(haystack_0, 0), h).
  (* Pre-condition *)
  Have: n <= h.
  (* Pre-condition *)
  Have: forall i_1 : Z. let x_4 = Mint_1[shift_uint8(needle_0, i_1)] in
      ((i_1 < n) -> ((0 <= i_1) -> ((0 <= x_4) /\ (x_4 <= 255)))).
  (* Call 'make_bad_shift' *)
  Have: valid_rw(m, a_3, n) /\ separated(a_3, n, a_2, 256) /\
      (forall i_1 : Z. let x_4 = Mint_1[shift_uint8(needle_0, i_1)] in
       ((i_1 < n) -> ((0 <= i_1) -> ((0 <= x_4) /\ (x_4 <= 255))))) /\
      (forall i_1 : Z. let x_4 = Mint_0[shift_sint32(a_1, i_1)] in
       ((0 <= i_1) -> ((i_1 <= 255) -> ((0 < x_4) /\ (x_4 <= (1 + n)))))).
  (* Call Effects *)
  Have: havoc(Mint_1, Mint_0, a_2, 256).
  (* Invariant *)
  Have: (0 <= i) /\ (i <= (1 + h)).
  (* Assertion 'rte,signed_overflow' *)
  Have: n <= (2147483648 + h).
  (* Assertion 'rte,signed_overflow' *)
  Have: h <= (2147483647 + n).
  (* Then *)
  Have: x_1 <= h.
  (* Invariant *)
  Have: (0 <= j) /\ (j <= n).
  Have: j < n.
  (* Assertion 'rte,mem_access' *)
  Have: valid_rd(m, a, 1).
  (* Assertion 'rte,signed_overflow' *)
  Have: (-2147483648) <= x.
  (* Assertion 'rte,signed_overflow' *)
  Have: x <= 2147483647.
  (* Assertion 'rte,mem_access' *)
  Have: valid_rd(m, shift_uint8(haystack_0, to_sint32(x)), 1).
  (* Else *)
  Have: Mint_0[a] != Mint_0[shift_uint8(haystack_0, x)].
}
Prove: valid_rd(m, a_4, 1).
Prover Alt-Ergo returns Unknown (Qed:8ms) (4.3s)

Whoo. Nice. To summarize that mess, the goal to be proved is

Let m = Malloc_0[L_bad_shift_690 <- 256].
Let x_1 = i + n.
Let a_4 = shift_uint8(haystack_0, to_sint32(x_1)).
valid_rd(m, a_4, 1).

Or in English, “mumble, mumble, valid read, bad_shift, mumble, haystack, i + n, mumble”.

Line 73 is

i += bad_shift[ haystack[i + n] ];	/* shift */

If I break the right hand side apart, the error can be found in the expression haystack[i + n]. (Can you see it now?)

The maximum value of i in the loop is h − n; therefore the maximum value of the array index in that expression is h − n + n or h. But the array itself is h elements long; it only goes from 0 to h − 1.

In that line, the code tries to access one element past the end of the valid haystack array. This is, as I mentioned back at the start, not normally a problem. With C strings, that element will be the 0 character marking the end of the string, and 0 is a valid index into bad_shift. If it is not a C string, it will access some random byte immediately after the array, which should not cause anything to blow up, and the byte will also be a valid index into bad_shift, because any value < 256 is valid. Further, all of the values in bad_shift are greater than one, so in the next step, when the code returns to the while conditional, i will be greater than h − n and the loop will terminate. But still, it’s the principle of the thing.

One way to fix the bug would be to insert a test just before line 73, checking if i is h − n. At this point, the algorithm already ensures that no match has been found; this statement merely breaks out of the loop before trying the invalid access to haystack.

if (i == h - n) { break; }

On the other hand, this is supposed to be a fast string search algorithm, and inserting an extra condition in the loop may harm performance. An alternative would be to change the outer loop condition to be i < h - n and to add a test for the final match after the outer loop.

  if (i == h - n) {
    /*@
     @ loop assigns j;
     @ loop invariant 0 <= j <= n;
     */
    for (j = 0; j < n && needle[j] == haystack[i + j]; ++j);
    if (j >= n) {
      return i;
    }
  }

Some quick benchmarking should tell which option is faster, or if it matters at all.

With either fix, all of the goals are satisfied and the Quick Search function is verified as being safe.

Functional correctness, or progress, is another kettle of fish. In order to prove that the code finds the needle in the haystack (or possibly finds the left-most needle in the haystack), or not, it is necessary to show that when the algorithm skips characters using the bad_shift, it cannot miss a match. I see why it can’t, but I have not figured out how to express that in ACSL. Watch this space.

By the way, the code is available on github: https://github.com/tmmcguire/frama-c-toys/blob/master/string-search/quicksearch.c. Enjoy!