Applied Formal Logic: Brute Force String Search

Posted on June 19, 2017 by Tommy M. McGuire
Labels: c, applied formal logic, math, notation, programming language, software development, theory, toy problems, applied formal logic

Need is there, but tools are not.

– zzz95

Let’s play with Frama-C.

Frama-C, which apparently stands for “Framework for Modular Analysis of C programs”, is “a suite of tools dedicated to the analysis of the source code of software written in C”. According to the description,

Frama-C is closer to these [bug-finding] heuristic tools than it is to software metrics tools, but it has two important differences with them: it aims at being correct, that is, never to remain silent for a location in the source code where an error can happen at run-time. And it allows its user to manipulate functional specifications, and to prove that the source code satisfies these specifications.

That sounds like fun, right? Who wouldn’t want to prove their source code satisfies formal specifications?

Well, ok, many people wouldn’t. Formal methods have a long, troubled history in computer science and computer programming. They’re regarded as the best idea since sliced bread by some of the more bondage and discipline-loving elements of the academic community. But in practice, nobody (much) uses them, because they’re a gigantic pain in the rump. The famously cranky (and socks-and-sandals wearing) Edsger Dijkstra pushed hard for correctness-by-construction by calculating programs from their specifications. You can probably guess how far that got. Various very smart people introduced specification languages and tools like Promela and Spin, Z, and TLA+ (and Alloy, and probably some I’m forgetting now), in order to get away from the fine details of actual programming while proving stuff about systems. The greatest success of these tools is that sometimes people recognize the names. And then there’s programming languages: types are periodically hot, and more types are more better, so there’s a hearty push for dependent typing and programming languages with names like ATS, Agda, and Idris. Dependently typed programming languages have one great advantage: they introduce completely new shapes of learning curves. (Vertical? I swear, ATS looks like it hangs over at the top.)

So, yeah, few people are really interested in proving properties about code.

But still, mistakes are embarrassing. Code gets complicated quickly and all the other methods for not screwing up suck, too. My interest here is in seeing how far I can get, without putting too much effort into the process. Maybe I can get some decent bang without putting too many bucks in.

How does it work?

If you are going to play with formal methods, what language do you choose? How about the worst possible case: The language everyone loves to hate because of its impressive safety record. The language that invented the term “undefined behavior” (and “sequence point”). Here it is: C. C is actually one of my favorite languages, and I occasionally refer to it as my native tongue. Sure, it’s a room full of rabid, poisonous animals in shoddily-constructed cages, but it’s my room full of poisonous animals.

Frama-C is what links C and formal methods.

Frama-C is a framework for parsing and statically analyzing ANSI/ISO C code, using a collection of plugins to perform specific analyses. According to the -plugins argument to frama-c, there are more than a few options available. I’m starting with WP, which “implements a weakest precondition calculus for ACSL annotations through C programs”. A “weakest precondition calculus” is that logical, formal system about code created by Tony Hoare and Robert Floyd, and popularized by our old socks-n-sandals buddy, EWD. (Did I mention he also wore a cowboy hat?) Hoare logic involves predicates before and after each statement (hey, like sequence points!) describing the state of the program before and after the statement. ACSL describes the syntax of the predicates: The ANSI/ISO C Specification Language is a language for logical predicates (and related information) written in C comments. ACSL is not the only such language; JML, for Java, and Spark, a subset of Ada, are very close relatives. And the research language Dafny has the same bag of tricks built in.

The WP plugin, like JML, Spark, and so forth, work by parsing the ACSL comments and the C code, and translating them to the language used by SMT solvers. Those solvers then produce an indication of success, indicating that the properties of the specification are valid according to the code, or failure and a more-or-less indecipherable error.

To play with a simple example, I’ll start with a function performing a string search: the brute force algorithm from Exact String Matching Algorithms.

/*
 * Search for needle in haystack, using a brute force algorithm.
 *
 * Expected complexity is O(n*h), where n in the length of the
 * needle and h is the length of the haystack.
 */
int
brute_force (char *needle, int n, char *haystack, int h)
{
  int i, j;

  for (i = 0; i <= h - n; ++i) {
    for (j = 0; j < n && needle[j] == haystack[j + i]; ++j);
    if (j >= n) {
      return i;
    }
  }
  return -1;
}

This is about as bare-bones as a string search could be. Given a needle string, of length n, and a haystack string, of length h, return the index where needle is located in haystack. If needle is not found, return -1. The outer loop takes i through the possible indices of the haystack (0 to h-n, since the needle has some length), and for each position tries to find the needle starting at that position with the inner loop using j.

You’ll be seeing this code several times in this post. I intend to re-use it, as-is, and just add the necessary ACSL specifications to get it to pass verification with an appropriate set of correctness properties. I’ll start with safety properties and then go on to functional correctness guarantees.

What happens when I rub Frama-C with the WP plugin against this code?

    $ frama-c -wp brute-force.c
    [kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing)
    [kernel] Parsing brute-force.c (with preprocessing)
    [wp] warning: Missing RTE guards
    [wp] 0 goal scheduled
    [wp] Proved goals:    0 / 0

A warning? That can’t be good. The problem is the lack of another plugin, Rtegen, which “generates annotations for runtime error checking and preconditions at call sites”. Using the RTE plugin requires another argument to Frama-C:

    $ frama-c -wp -wp-rte brute-force.c
    [kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing)
    [kernel] Parsing brute-force.c (with preprocessing)
    [rte] annotating function brute_force
    brute-force.c:16:[wp] warning: Missing assigns clause (assigns 'everything' instead)
    brute-force.c:15:[wp] warning: Missing assigns clause (assigns 'everything' instead)
    [wp] 8 goals scheduled
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access_2 : Unknown (108ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access : Unknown (Qed:4ms) (107ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_2 : Unknown (105ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow : Unknown (104ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_4 : Unknown (Qed:4ms) (55ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_3 : Unknown (54ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_6 : Unknown (Qed:4ms) (159ms)
    [wp] Proved goals:    1 / 8
         Qed:             0  (4ms)
         Alt-Ergo:        1  (28ms) (41) (unknown: 7)

That looks like I’m making progress; I have some goals and one of them is even satisfied already. But there are a couple of warnings, which I think I should address first.

Lines 15 and 16 refer to the two for loops in the code, and in ACSL, an “assigns” clause identifies the memory locations that code modifies. With no assigns clause, the code is allowed to modify any location anywhere, which is potentially problematic as far as verification is concerned. Getting rid of the warnings provides the first introduction to ACSL.

int
brute_force (char *needle, int n, char *haystack, int h)
{
  int i, j;

  /*@
   loop assigns i,j;
   */
  for (j = 0; j <= h - n; ++j) {
    /*@
     loop assigns i;
    */
    for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
    if (i >= n) {
      return j;
    }
  }
  return -1;
}

The special ACSL comments start with “/*@” and in this case apply to the for statement immediately following. An assigns clause can apply to a function, as part of the function’s contract, or to a loop as in this case. The inner loop modifies only the variable i, while the outer loop modifies j and (indirectly) i.

C has certain…quirks. For example, any given block of code can modify just about anything else. In order to reason about this code, Frama-C needs to know that the code is not going to alter the loop variables i and j in any way other than the obviously visible loop increments. Hence the necessity of the assigns clauses; they say that i and j are going to be modified, but nothing else is. With this code and that knowledge, it is then possible to deduce that the only way they can be changed is the visible increments.

With that modification, I can get down to business.

    $ frama-c -wp -wp-rte brute-force.c
    [kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing)
    [kernel] Parsing brute-force.c (with preprocessing)
    [rte] annotating function brute_force
    [wp] 10 goals scheduled
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_2 : Unknown (54ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow : Unknown (Qed:4ms) (53ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access_2 : Unknown (107ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_mem_access : Unknown (105ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_3 : Unknown (58ms)
    [wp] [Alt-Ergo] Goal typed_brute_force_assert_rte_signed_overflow_6 : Unknown (Qed:4ms) (103ms)
    [wp] Proved goals:    4 / 10
         Qed:             2 
         Alt-Ergo:        2  (16ms) (39) (unknown: 6)

So now, I have 10 goals (including the loop assignment checks), 4 of which have already been satisfied (including the loop assignment checks). How do I find out more about the unsatisfied goals? Adding -wp-pretty generates the following output:

    $ frama-c -wp -wp-rte -wp-print brute-force.c
    ...
    [wp] Proved goals:    4 / 10
         Qed:             2  (4ms)
         Alt-Ergo:        2  (16ms-20ms) (39) (unknown: 6)
    ------------------------------------------------------------
      Function brute_force
    ------------------------------------------------------------
    
    Goal Assertion 'rte,signed_overflow' (file brute-force.c, line 18):
    Assume {
      Type: is_sint32(h) /\ is_sint32(n).
      (* Heap *)
      Have: linked(Malloc_0) /\ sconst(Mchar_0) /\
          (region(haystack_0.base) <= 0) /\ (region(needle_0.base) <= 0).
    }
    Prove: n <= (2147483648 + h).
    Prover Alt-Ergo returns Unknown (53ms)
    
    ------------------------------------------------------------
    
    Goal Assertion 'rte,signed_overflow' (file brute-force.c, line 18):
    Assume {
      Type: is_sint32(h) /\ is_sint32(n).
      (* Heap *)
      Have: linked(Malloc_0) /\ sconst(Mchar_0) /\
    ...

By the looks of that first block, the particular goal has something to do with run-time errors and signed overflow, around line 18 of the file. The particular property to be proved is “n <= (2147483648 + h)”.

This, by the way, is the last time I am going to show the command invocation (and you will notice I’ve already elided the summary output produced initially.) Most of the output with -wp-pretty resembles this section: a description of the goal under consideration and its location, the collection of information the solver has at that point (which can be much more complicated than this example), the actual goal to be proved, and the results of one of the solvers (in this case Alt-Ergo).

In any case, 2147483648 is 2^31, or 2^32 / 2, or the maximum signed integer representable in 32 bits. Line 18 is:

for (j = 0; j <= h - n; ++j) {

And n and h are the lengths of the input strings. Maybe at this point I should formally describe what I know about the inputs to the function. That might clear up some of those pesky overflow goals.

Now, obviously, needle and haystack have to be valid pointers, and the blocks they point to have to be n and h bytes long, respectively. Further, n and h have to be less than INT_MAX, the C constant describing the maximum value of an int. n and h have to be strictly less than INT_MAX because they are being used here in for loops and range from 0 to b for some meaningful b, one more than the valid range of indices from 0 to b-1. Iterator boundary conditions, and all that.

Finally, n has to be at most h. Otherwise, the needle cannot be in the haystack, right?

/*@
 requires \valid(needle + (0 .. n-1)) && n < INT_MAX;
 requires \valid(haystack + (0 .. h-1)) && h < INT_MAX;
 requires n <= h;
 */
int
brute_force (char *needle, int n, char *haystack, int h)
{
  int i, j;

  /*@
   loop assigns i,j;
   */
  for (j = 0; j <= h - n; ++j) {
    /*@
     loop assigns i;
    */
    for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
    if (i >= n) {
      return j;
    }
  }
  return -1;
}

As a result, that one goal is satisfied, but nothing more. A little disappointing.

    ...
    ------------------------------------------------------------
      Function brute_force
    ------------------------------------------------------------
    
    Goal Assertion 'rte,signed_overflow' (file brute-force.c, line 23):
    Assume {
      Type: is_sint32(h) /\ is_sint32(n).
      (* Heap *)
      Have: linked(Malloc_0) /\ sconst(Mchar_0) /\
          (region(haystack_0.base) <= 0) /\ (region(needle_0.base) <= 0).
      (* Pre-condition *)
      Have: (n <= 2147483646) /\ valid_rw(Malloc_0, shift_sint8(needle_0, 0), n).
      (* Pre-condition *)
      Have: (h <= 2147483646) /\
          valid_rw(Malloc_0, shift_sint8(haystack_0, 0), h).
      (* Pre-condition *)
      Have: n <= h.
    }
    Prove: n <= (2147483648 + h).
    Prover Alt-Ergo returns Valid (16ms) (16)
    
    ------------------------------------------------------------
    ...

Here’s the scorecard of satisfied and unsatisfied goals:

Property Proved? Notes
n <= (2147483648 + h) Valid
h <= (2147483647 + n) Unknown
valid_rd(Malloc_0, a, 1) Unknown a = shift_sint8(needle_0, i)
valid_rd(Malloc_0, shift_sint8(haystack_0, to_sint32(x)), 1) Unknown x = i + j
(-2147483648) <= x Unknown x = i + j
x <= 2147483647 Valid x = i + j
i <= 2147483646 Valid
j <= 2147483646 Unknown
Loop assigns Valid
Loop assigns Valid

Most of the unknown goals involve i, j, and the loops. It appears that Frama-C is smart enough to identify possible bounds violations but not smart enough to notice that they shouldn’t happen. I need to give it some help.

The simplest possible help I can see is to specify loop invariants describing the indices of the loops. For the outer loop, j varies from 0 to h-n+1 inclusive (remember the final value); for the inner loop, i varies from 0 to n inclusive.

/*@
 requires \valid(needle + (0 .. n-1)) && n < INT_MAX;
 requires \valid(haystack + (0 .. h-1)) && h < INT_MAX;
 requires n <= h;
 */
int
brute_force (char *needle, int n, char *haystack, int h)
{
  int i, j;

  /*@
   loop assigns i,j;
   loop invariant 0 <= j <= (h-n) + 1;
   */
  for (j = 0; j <= h - n; ++j) {
    /*@
     loop assigns i;
     loop invariant 0 <= i <= n;
    */
    for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
    if (i >= n) {
      return j;
    }
  }
  return -1;
}

In ACSL, the “loop invariant” clause specifies the invariant. The important parts about loop invariants, if you remember anything about them, is that:

That helps quite a lot. Running Frama-C now, the scorecard looks like:

Property Proved? Notes
n <= (2147483648 + h) Valid
h <= (2147483647 + n) Unknown
valid_rd(Malloc_0, a, 1) Valid a = shift_sint8(needle_0, i)
valid_rd(Malloc_0, shift_sint8(haystack_0, to_sint32(x)), 1) Valid x = i + j
(-2147483648) <= x Valid x = i + j
x <= 2147483647 Valid x = i + j
i <= 2147483646 Valid
j <= 2147483646 Valid
Goal Loop assigns Valid
Goal Loop assigns Valid
(-1) <= j Valid
n <= (1 + h) Valid
-1 <= i Valid
0 <= n Unknown

12 out of 14 goals have been satisfied. I’m cookin’ with gas, now!

Looking at the two that haven’t been satisfied, I realize that there is some additional information about the arguments to the function that I haven’t given Frama-C: lower bounds. Obviously, if n or h is less than zero, something is not going to work right.

/*@
 requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
 requires \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX;
 requires n <= h;
 */
int
brute_force (char *needle, int n, char *haystack, int h)
{
  int i, j;

  /*@
   loop assigns i,j;
   loop invariant 0 <= j <= (h-n) + 1;
   */
  for (j = 0; j <= h - n; ++j) {
    /*@
     loop assigns i;
     loop invariant 0 <= i <= n;
    */
    for (i = 0; i < n && needle[i] == haystack[i + j]; ++i);
    if (i >= n) {
      return j;
    }
  }
  return -1;
}

There. Neither n nor h can be less than zero. Given that addition to the requires clauses in the function contract, all 14 of the goals are satisfied, and Frama-C has proved that this function does not have any errors. It is safe.

If, that is, it is called with the right arguments: needle and haystack must be valid, n and h must be between 0 and INT_MAX, and needle must be smaller than haystack. Otherwise, all bets are off. Fortunately, if Frama-C is used to ensure that brute_force is only called with the right arguments, then everything will work out all right.

Progress?

When I write that the function doesn’t have any errors, I mean that it’s safe. It can’t go wrong. If the arguments are ok, the function cannot access out-of-bounds memory, trip over an arithmetic error, or do anything else nasty. But I haven’t demonstrated that it’ll do anything right, either.

What I need to do next is to describe in ACSL the function’s functional correctness properties; what the brute-force search is supposed to do: find a needle string in a haystack string.

I chose to do so in a couple of steps, since this lets me abstract out the details and avoid repeating them several times. I will skip most of the description of the ACSL syntax. It will hopefully be reasonably obvious to anyone who has seen both predicate logic and C before.

The ACSL block below shows some fun special features. First, ACSL comments start with “//”; yes, that’s a comment within a comment. Secondly, it defines a logical predicate, partial_match_at, which specifies what it means for a prefix of a needle to be found in a haystack at a specific location.

/*@
  // There is a partial match of the needle at location loc in the
  // haystack, of length len.
  predicate partial_match_at(char *needle, char *haystack, int loc, int len) =
    \forall int i; 0 <= i < len ==> needle[i] == haystack[loc + i];
 */

English translation: there is a partial match of length len of a needle in a haystack at a location loc if all of the elements of needle less than len match the corresponding elements of haystack starting at loc.

The predicate is only used by Frama-C static analysis (although I believe there are Frama-C plugins which generate run-time assertions from function contracts). The first place it is used is in another predicate match_at. This predicate supplies n, the total length of the needle as the length of the partial match, as a result requiring a complete match of the needle. It also adds a requirement that loc be less than h-n, so that it will be an acceptable location, as well as adding the other requirements from the brute_force contract. Why? I have no idea. All of those clauses are preconditions for the function. Belt-and-suspenders mathematics, anyone?

/*@
  // There is a complete match of the needle at location loc in the
  // haystack.
  predicate match_at(char *needle, int n, char *haystack, int h, int loc) =
    \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX &&
    \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX &&
    n <= h && loc <= h - n &&
    partial_match_at(needle, haystack, loc, n);
 */

Anyway, match_at specifies a valid match of needle in haystack at location loc.

The updated version of brute_force has a number of changes, both in the function contract and in the loop invariants.

In the contract, I added an assigns clause, assigns \nothing;, to indicate that the function has no visible side effects. Then, I added a basic postcondition on the return value:

ensures -1 <= \result <= (h-n);

Here, \result refers to the return value, and this postcondition indicates that the result will be between -1 and h-n, inclusive. The two behavior specifications add detail to the result:

/*@
 requires \valid(needle + (0 .. n-1)) && 0 <= n < INT_MAX;
 requires \valid(haystack + (0 .. h-1)) && 0 <= h < INT_MAX;
 requires n <= h;
 assigns \nothing;
 ensures -1 <= \result <= (h-n);
 behavior success:
    ensures \result >= 0 ==> match_at(needle, n, haystack, h, \result);
 behavior failure:
    ensures \result == -1 ==>
      \forall int i; 0 <= i < h ==>
        !match_at(needle, n, haystack, h, i);
 */
int
brute_force (char *needle, int n, char *haystack, int h)
{
  int i, j;

  /*@
    loop assigns i, j;
    loop invariant 0 <= i <= (h-n) + 1;
    loop invariant \forall int k; 0 <= k < i ==>
      !match_at(needle, n, haystack, h, k);
   */
  for (i = 0; i <= h - n; ++i) {
    /*@
      loop assigns j;
      loop invariant 0 <= j <= n;
      loop invariant partial_match_at(needle, haystack, i, j);
    */
    for (j = 0; j < n && needle[j] == haystack[j + i]; ++j);
    if (j >= n) {
      return i;
    }
  }
  return -1;
}

To convince Frama-C that this function satisfies those new postconditions, I added two loop invariants.

With these changes, the following goals are verified by Frama-C:

Property Proved? Notes
((-1) <= brute_force_0) /\ ((brute_force_0 + n) <= h) Valid brute_force_0 is the result
(-1) <= i Valid
n <= (1 + h) Valid
!P_match_at(Malloc_0, Mchar_0, needle_0, n, haystack_0, h, i) Valid
Establishment of Invariant (line 58) Valid
n <= (2147483648 + h) Valid
h <= (2147483647 + n) Valid
(-1) <= j Valid
Establishment of Invariant (line 65) Valid
P_partial_match_at(Mchar_0, needle_0, haystack_0, i, x_4) Valid x_4 = 1 + j
P_partial_match_at(Mchar_0, needle_0, haystack_0, i, 0) Valid
valid_rd(Malloc_0, shift_sint8(needle_0, j), 1) Valid
valid_rd(Malloc_0, shift_sint8(haystack_0, to_sint32(i + j)), 1) Valid
(-2147483648) <= (i + j) Valid
(i + j) <= 2147483647 Valid
j <= 2147483646 Valid
i <= 2147483646 Valid
Loop assigns (line 56) Valid
Loop assigns (line 64) Valid
Assigns nothing in brute_force (line 61) Valid
Assigns nothing in brute_force (line 61) Valid
Assigns nothing in brute_force (line 68) Valid
Assigns nothing in brute_force (line 68) Valid
Assigns nothing in brute_force (line 70) Valid
Assigns nothing in brute_force (line 73) Valid
!P_match_at(Malloc_0, Mchar_0, needle_0, n, haystack_0, h, i) Valid
P_match_at(Malloc_0, Mchar_0, needle_0, n, haystack_0, h, brute_force_0) Valid

And with those goals satisfied, I think I can safely say that brute_force has been verified as both safe and functionally correct.

This has been a long mass of verbiage, but I think the actual amount of work done, modulo the learning curve for Frama-C, hasn’t been too great. Certainly, not more than what would be required to test brute_force to anywhere near the same level of certainty.

There is one…issue (I won’t use the word “error”) with this verification, as written. It’s related to a problem described in An empirical study on the correctness of formally verified distributed systems. I’ll leave it as an exercise to the reader to find it.

If there are any other problems with this code-and-specification, I’d love to hear it. I may have missed something—that’s always the worry when dealing with formal specifications—or I may be using the tools wrong (yeah, I don’t actually know what I’m doing). Please, let me know!

The code, if you are interested, can be found on GitHub.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote quotes R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.