Flex- and BisonModules updated (sort of)

Posted on December 28, 2013 by Tommy McGuire
Labels: c, python, programming language

The short version: I've added a github repository for FlexModule and BisonModule: tools to create Python modules from flex-based lexical scanners and bison-based parsers. The code is old, but seems to work with a recent Python 2.7.

A while back, I got email from Jerry Barrington about a project that I had actually completely forgotten about: FlexModule and BisonModule.

This all started many years ago, when Python was my favorite language and I was just starting to work on the Austin Protocol Compiler. I needed a way to write the parser in Python and could not find any satisfactory existing Python parser libraries. Fortunately, I was fairly familiar with flex and bison, the replacements for lex and yacc.

In what became one of my favorite hacks, I decided to write semi-generic wrappers for flex-based lexical analyzers and bison-based parsers. Flex is easy---code using it calls the scanner to deliver the next token, which can then be wrapped in a Python data structure trivially. All the lexer has to do is return the number representing the token; the wrapper can get everything else from the lexer state.

As an example, the lexer for my favorite four-function calculator example is:


%{
#include "hocgrammar.h"
#include "FlexModule.h"
%}

num [0-9]*\.?[0-9]+
id [a-z]
ws [ \t]+
str \"([^"\\\n]|\\.)*\"

%%

"input "{str} { PUSH_FILE_YYTEXT(7,yyleng-1); }
{num} { return(NUMBER); }
{id} { return(VAR); }
{ws} { ADVANCE; /* and skip */ }
.|\n { return(yytext[0]); }

%%

TokenValues module_tokens[] = {
{"NUMBER", NUMBER},
{"VAR", VAR},
{0,0}
};

FLEXMODULEINIT(hoclexer, module_tokens);

And as an advanced feature, that includes importing files recursively. Further, most of the lexer rules take care of the state updates to handle tracking the location for error messages.

Bison is a bit more of a problem, because it assumes control flow and calls back, into the lexer to get tokens and into the user's code for rule execution. My wrapper is simplified because I am only interested in building a parse tree, not executing any domain specific code. On the other hand, it is complicated by bison's syntax error handling procedure. In the end, though, building a parse tree (and handling errors) is reasonably easy. This is a sample from the four-function calculator grammar:


start: list { RETURNTREE($1); }

list: /* nothing */ { $$ = REDUCE(LIST); }
| list '\n' { $$ = $1; }
| list expr '\n' { $$ = REDUCELEFT($1, $2); }
| list error '\n' { $$ = REDUCELEFT($1, REDUCEERROR); }
;

expr: NUMBER { $$ = $1; }
| VAR { $$ = $1; }
| VAR '=' expr { $$ = REDUCELEFT($2, $1, $3); }
| expr '+' expr { $$ = REDUCELEFT($2, $1, $3); }
| expr '-' expr { $$ = REDUCELEFT($2, $1, $3); }
| expr '*' expr { $$ = REDUCELEFT($2, $1, $3); }
| expr '/' expr { $$ = REDUCELEFT($2, $1, $3); }
| '(' expr ')' { $$ = $2; }
| '-' expr %prec UNARYMINUS
{ $$ = REDUCELEFT($1, $2); }
;

RETURNTREE sets the root of the parse tree into a well-known location, where it can be returned to Python code. REDUCE introduces a new Symbol object, with the LEFT symbol id. REDUCELEFT handles most recursive duties by adding its second and subsequent arguments as children of its first argument. Finally REDUCEERROR cleans up after errors and creates an error symbol to insert into the parse tree.

As I said, this is one of my favorite hacks, and following the email from Jerry, I dusted it off, verified that it seems to work with Python 2.7, and put a copy out on github. The only major thing I had to deal with is actually building modules. (I would appreciate any assistance with the setup.py in the hoc example code.) Take a look if you are interested in such things.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.