Push-parsing and Pony
Hey, let’s write a quick parser! A simple one, for the simple version of hoc from The Unix Programming Environment. The chapter on hoc is used to demonstrate the use of the yacc parser; it’s a dandy example to play with parsing.
I haven’t hand written a parser in a long time, so that’s what I want to do now. But there is a bit of a catch: it reads from stdin in Pony. And the interface for that is:
interface StdinNotify
"""
Notification for data arriving via stdin.
"""
fun ref apply(data: Array[U8] iso) =>
"""
Called when data is available on stdin.
"""
None
fun ref dispose() =>
"""
Called when no more data will arrive on stdin.
"""
None
This is Main, Pony’s pony.
Main wants to help you, not make you scream like a ruptured duck. Really.
OMG! I can’t even! That’s asynchronous! Pony is an asynchronous language. It’s gonna be callback hell! It’s Red vs. Blue all over again. Er, wrong link. It’s Red vs. Blue all over again. Devil bunnies! Aaaaaaaaiiigghhh!
Take a deep breath. Calm down. Remember way back in the ’90s, when object-oriented programming was new? Remember that letter to the editor of Computer Language, I think? The one that said an object was…
…a function with multiple entry points and persistent state? And that we’d discovered it was a bad idea in the ’70s? That object-oriented programming was a horrible mess? Yeah, that was funny.
And remember what Mittens always says, that back in the ’90s, we were all doing GUIs and it was all callbacks and we never really had any problems with it?
Yeah, as long as you don’t do a long-running computation on the OS/2 Presentation Manager interface thread and lock up the whole GUI.
Right. So calm down. It’ll be ok. Pony is an actor-based language, not imperative. All we have to do is figure out how to work with the language, rather than trying to swim upstream, working against it. And we’re not alone here; Simon Peyton Jones has been here before. In fact, here’s a copy of Wearing the hair shirt. Hang on to it, and if you start feeling light-headed, flip through it and look at the pretty clip art.
A grammar
Let’s start with something simple. Here’s a basic grammar for this version of hoc:
$$
\begin{array}{rcl}
\textrm{Start} & \rightarrow & \textrm{Expr}\ \\
\textrm{Expr} & \rightarrow & \textrm{Term}\ +\ \textrm{Expr} \\
& | & \textrm{Term}\ -\ \textrm{Expr} \\
& | & \textrm{Term}\ \\
\textrm{Term} & \rightarrow & \textrm{Number}\ *\ \textrm{Term} \\
& | & \textrm{Number}\ /\ \textrm{Term} \\
& | & \textrm{Number}\ \\
\textrm{Number} & \rightarrow & [0-9]+
\end{array}
$$
For a recursive descent parser–which only looks at an input character once and has a relatively simple control path—I needed to factor out the common prefixes, putting the grammar in something like Greibach normal form except with more Greibach.
$$
\begin{array}{rcl}
\textrm{Start} & \rightarrow & \textrm{Expr}\ \\
\textrm{Expr} & \rightarrow & \textrm{Term}\ \textrm{Expr2} \\
\textrm{Expr2} & \rightarrow & +\ \textrm{Expr} \\
& | & -\ \textrm{Expr} \\
& | & \textit{empty}\ \\
\textrm{Term} & \rightarrow & \textrm{Number}\ \textrm{Term2} \\
\textrm{Term2} & \rightarrow & *\ \textrm{Term} \\
& | & /\ \textrm{Term} \\
& | & \textit{empty}\ \\
\textrm{Number} & \rightarrow & [0-9]+
\end{array}
$$
That’s pretty simple, right? It also has the neat property that, if a rule is going to fail then it is going to fail while looking at the left-most symbol. As a result, the parser doesn’t have to have any complicated back-tracking machinery to restore the parse state after an individual rule fails.
In fact, to avoid too much complexity, I’m going to make several simplifying assumptions. For one thing, I don’t care about errors. At all, really. Error handling is a pain in the rump at the best of times, and I haven’t written a recursive descent parser since before I sat down and figured out how yacc works. I’m not really enjoying the process this time, either. Another thing I’m not caring about is Unicode. It’s 7-bit ASCII all the way down, here. And whitespace. I don’t care about whitespace; everything can be all mushed together.
Anyway, back to parsing. To build a recursive descent parser normally, you would write functions for each of the rules in the grammar, with each function trying to recognize its own expansion. For example, Expr would try to call Term and, if that succeeded, call Expr2. Term would call Number to read a number and would then call Term2. Term2 is a little more complicated, since it has several options. First, it would try to read a ‘*’. If that failed, without losing track of any input, it would have to try to read a ‘/’. If that fails, it needs to succeed while having read no input.
Pony
None of that works with the asynchronous interface provided by Pony for stdin. Instead of calling a function that waits for input, input from stdin is passed to Pony code through an StdinNotify interface implementation. Now, if you’re familiar with Javascript, your first inclination might be to start slinging inline callbacks all over the place. But to my mind, that is the wrong thing to do in Javascript.
But Pony, on the other hand, is an actor oriented language. I do not want to go into the formal definition of actors here, especially since it is not relevant. You can think of an actor as an object that has an associated execution thread: creating an actor also creates a thread that the actor uses to respond to incoming messages. Messages are the only way to communicate between threads (and, fortunately or unfortunately, behaviors (containing the definition of a message handler) syntactically resemble method definitions and message sends resemble method calls). Most of an actor’s lifetime is spent waiting for incoming messages, processing those messages, and sending messages of its own. Actors start when they are created and terminate when they have no outstanding messages to handle and no other actor has a reference to them so that no further messages can be sent to the actor.
One further note on the subject: Pony is an actor oriented language. Actors are the basic, fundamental components of a Pony program. “Main” is the name of the actor initially started by the program. If you attempt to write much serial code in Pony, bad things will happen. For one thing, Pony’s garbage collector operates on individual actors, and is only invoked when the actor is quiescent—not processing a message and not having any messages waiting. Without multiple actors interacting, the garbage collector will never run.
I can’t say I have spent a long time working with actors; Pony is the first actor-oriented programming language I have spent any time with, and, although it does have many concepts in common with the formal tools I have spent a few long weekends pondering, there are also significant differences. I believe the basic ideas cross over, though.
Think of actors as top-level concepts: an actor is created, waits to process messages, communicates with other actors, and eventually disappears. In Pony, using an inline callback handler to process input would be unnatural; passing the input to an separately defined, well-defined. actor for processing is more natural. So, that’s how we’re going to do it.
One important way to look at actors, by the way, is as the elements of serialization. Following its creation or beginning handling a message, an actor cannot pause or be interrupted until it finishes and goes back to wait for the next message. The execution of a behavior is serial, and since the only way for actors to interact is through messages, it is possible to think of the execution of a network of actors as happening one behavior execution at a time, atomically.
A parser framework
The problem here is that the sequential structure of the parsed text is represented, in a recursive descent parser, by control flow; the “recursive descent” part exists on the program’s control stack. That won’t work with Pony and asynchronous interfaces, so I need to come up with a representation of structure that will work. There will be stacks involved.
How about this:
A stack of syntactic rule expansions. Whenever a non-terminal is expanded, the expansion is pushed on the top of the stack. Each of the elements of the stack is a sequence of symbols, representing the expansion of the non-terminal. The currently active token or non-terminal is the first element of the top of the stack.
When a symbol succeeds, it is shifted off the top element. When the top element is empty, the expansion it represents has succeeded.
A stack of values. A succeeding token pushes the token on the stack. A succeeding non-terminal pops the rule’s values off the stack, combines them appropriately, and pushes the result on the stack.
This framework can be translated into Pony code, first by describing the symbols of the grammar:
trait Nonterminal[T]
"""
A non-terminal symbol in the grammar. Can be expanded, possibly multiple
times, into an array of symbols representing the right-hand side of the rule.
If the rule succeeds, manipulates the value stack to update the state of the
parse.
"""
new create(ruler: Parser[T])
fun ref expand(): Array[Symbol[T] ref] ref ?
fun ref matched(values: ValueStack[T] ref) => None
trait Token[T]
"""
A terminal symbol in the grammar. `apply` and `dispose` are roughly the same
as `StdinNotify`: they feed input to the token. If the token is recognized,
`value` produces the value of the token.
"""
fun ref apply(data: Array[U8] val)
fun ref dispose()
fun ref value(): (T^ | None) => None
type Symbol[T] is (Nonterminal[T] | Token[T])
Note that a good chunk of the Token
interface is the same as StdinNotify
: apply
and dispose
. Want to guess where input data is going to end up?
The next step is the right-hand side of a rule, the expansion of a non-terminal. For no readily apparent reason, the expand
method of Nonterminal
doesn’t produce an RHS
, but the array it does produce is immediately used to construct a RHS
. Que c’est, c’est.
class RHS[T]
"""
A nonterminal symbol expands to an array of symbols.
"""
let _a: Array[Symbol[T] ref] ref = _a.create()
new create(n: Nonterminal[T]) => _a.push(n)
new expand(n: Nonterminal[T]) ? => _a.concat(n.expand().values())
fun ref head(): Symbol[T] ? => _a(0)
fun ref shift(): Symbol[T]^ ? => _a.shift()
fun empty(): Bool => _a.size() == 0
fun debug() => Debug("----rhs: " + _a.size().string())
The next component is the RuleStack
:
class RuleStack[T]
let _s: Array[RHS[T]] ref = _s.create()
fun ref top(): RHS[T] ? => _s(_s.size() - 1)
fun ref push(rhs: RHS[T]) => _s.push(rhs)
fun ref pop(): RHS[T] ? => _s.pop()
fun ref current(): Current[T] =>
try
if is_empty() then RuleStackEmpty
elseif is_current_empty() then CurrentEmpty
else top().head()
end
else
IllegalState
end
fun is_empty(): Bool => _s.size() == 0
fun ref is_current_empty(): Bool =>
(not is_empty()) and try top().empty() else false end
fun ref is_current_token(): Bool => _is_current[Token[T]]()
fun ref is_current_nonterminal(): Bool => _is_current[Nonterminal[T]]()
fun ref _is_current[S: Symbol[T]](): Bool =>
try
(not is_empty()) and (not top().empty())
and match top().head() | let t: S => true else false end
else
false
end
fun debug() =>
Debug("---stack: " + _s.size().string())
for i in _s.values() do i.debug() end
The most important method here is current
, which attempts to return the current, head symbol of the parse state. The type Current
is:
primitive CurrentEmpty
primitive RuleStackEmpty
primitive IllegalState
type Current[T] is (Symbol[T] | CurrentEmpty | RuleStackEmpty | IllegalState)
CurrentEmpty
indicates that the top of the stack is an empty RHS. RuleStackEmpty
indicates thaht the RuleStack
is empty. IllegalState
indicates an illegal state for the parser, which shouldn’t happen (TM), but is required by Pony’s type system.
Finally, this is the value stack, ValueStack
:
class ValueStack[T]
let _s: Array[T] = _s.create()
fun ref push(value: T) =>
Debug("++push")
_s.push(consume value)
fun ref is_empty(): Bool => _s.size() == 0
fun ref pop(): (T^ | None) =>
Debug("++pop")
try _s.pop() else None end
fun ref swap() =>
if _s.size() > 1 then
try
let a = _s.pop()
let b = _s.pop()
_s.push(consume a)
_s.push(consume b)
end
end
It has a few helpful methods used by rules to manipulate the stack of values.
A parser
The parser itself is just the combination of the RuleStack
and the ValueStack
, along with a couple of other helpful references to other actors: a Writer
to write error messages, a Chain
used to pass along input once this parser has terminated, and a ParseResults
which is notified of the overall success or failure of the parser.
class Parser[T]
let _writer: Writer tag
let _chain: Chain tag
let _result: ParseResults tag
let _active: RuleStack[T] = _active.create()
let _values: ValueStack[T] = _values.create()
new start[Start: Nonterminal[T]](
writer: Writer tag,
result: ParseResults tag,
chain: (Chain tag | None) = None)
=>
_writer = writer
_result = result
_chain = match chain | let c: Chain tag => c else NullChain(writer) end
_active.push(RHS[T](Start.create(this)))
There is a snazzy type system trick in the start
method there. It takes a type parameter of Start
, a Nonterminal
, which is expanded to start the parsing process. An alternative would be to accept an instance of a Nonterminal
or something, but that wouldn’t be typeriffic, would it?
The Parser
class has four major methods:
input
anddispose
, which mirror the methods of theStdinNotify
trait and accept input into the parser,succeed
, which is called when the currently active token succeeds in reading whatever input it was reading, andfail
, which is called when the currently active token fails in reading whatever input it was expecting.
input
is:
fun ref input(data: Array[U8] iso) =>
Debug("-input")
match _active.current()
| let t: Token[T] => t(consume data)
| let n: Nonterminal[T] =>
// The head of the active stack is a un-expanded nonterminal. Attempt to
// expand it and apply the data to the resulting token; if expansion
// fails, the rule stack is empty and we're terminating the parse.
_expand_and_handle(consume data)
| CurrentEmpty =>
// The head of the active stack represents a successful parse of the next
// nonterminal up. Remove it, do_expand, and apply the input to the next
// token, if possible.
try
_active.pop()
(_active.top().shift() as Nonterminal[T]).matched(_values)
_expand_and_handle(consume data)
end
| RuleStackEmpty => _chain.input(consume data)
end
The basic idea here is:
If a
Token
is current, pass the data to it.If a
Nonterminal
is current, try to expand it. If that succeeds, recursively callinput
and the first branch will take it. If expansion fails, the parse is done, one way or another. Chain the input to the next thingy.This is the purpose of
_expand_and_handle
:fun ref _expand_and_handle(data: Array[U8] iso) => if _do_expand() then input(consume data) else _chain.input(consume data) end
If the current rule is empty, it has done its job: parsed some input. Pop it off, shift off the nonterminal at the head of the next rule and tell it to do its thing with the values, The try to expand the next symbol.
If the
RuleStack
is empty, the parse is unconditionally done. Chain the input on.
dispose
is very similar to input
except using the dispose
method to indicate the end of the input.
fun ref dispose() =>
Debug("-dispose")
match _active.current()
| let t: Token[T] => t.dispose()
| let n: Nonterminal[T] =>
// The head of the active stack is a un-expanded nonterminal. Attempt to
// expand it and notify the resulting token of the end of input; if
// expansion fails, the rule stack is empty and we're terminating the parse.
_expand_and_dispose()
| CurrentEmpty =>
// The head of the active stack represents a successful parse of the next
// nonterminal up. Remove it, do_expand, and notify the next token of the
// end of input, if possible.
try
_active.pop()
(_active.top().shift() as Nonterminal[T]).matched(_values)
_expand_and_dispose()
end
| RuleStackEmpty => _chain.dispose()
end
It, too, breaks down into four cases with the same logic as input
.
When a token succeeds, it calls succeed
, passing in the remaining, unread, part of the input.
fun ref succeed(remainder: Array[U8] iso) =>
"""
The current token has succeeded. Shift it off and try expanding the
next symbol in this rule. If a token is found, pass it the remaining
input. Otherwise, the rule stack is empty and the parse is terminating.
"""
Debug("-succeed")
if _active.is_current_token() then
try
match (_active.top().shift() as Token[T]).value()
| let t: T => _values.push(consume t)
end
end
_expand_and_handle(consume remainder)
else
_writer.err("Illegal state: succeeded with non-token?")
end
succeed
removes the current token from the RuleStack
, pushes its value (if it produces one) on the ValueStack
, then expands the next symbol and continues with the rest of the input.
fail
is similar to succeed
but with the opposite effect on the RuleStack
:
fun ref fail(remainder: Array[U8] iso) =>
"""
The current token has failed. Remove the current active rule and
try expanding the previous rule again. If a token is found, reapply
the input. Otherwise, the rule stack is empty and the parse is
terminating.
"""
Debug("-fail")
if _active.is_current_token() then
try
_active.pop()
end
_expand_and_handle(consume remainder)
else
_writer.err("Illegal state: failure with non-token?")
end
In this case, the active rule has failed, so it is removed and an attempt is made to re-expand the previous non-terminal (for cases like Expr2 which have multiple right-hand sides).
It is important to note that remainder
in this method includes the bytes that were passed to the failing token; it is all of the input from the end of the last token that was recognized.
All of these methods have called a helper, _do_expand
, which attempts to arrange for the current symbol at the head of the RuleStack
to be a Token
.
fun ref _do_expand(): Bool =>
Debug("-do_expand")
match _active.current()
| let t: Token[T] => true
| let n: Nonterminal[T] =>
try
// Attempt to expand the nonterminal. If successful, recurse to find
// the current token.
_active.push(RHS[T].expand(n))
_do_expand()
else
// Expanding the nonterminal failed. This rule has failed; remove it
// and recurse on the previous nonterminal.
try
_active.pop()
_do_expand()
else
// NOTREACHED
false
end
end
| CurrentEmpty =>
try
// The current rule has succeeded, probably trivially by expanding.
// Remove it, shift the successful nonterminal, and recurse.
_active.pop()
if not _active.is_empty() then
(_active.top().shift() as Nonterminal[T]).matched(_values)
end
_do_expand()
else
// NOTREACHED
false
end
| RuleStackEmpty =>
// Success (for some value of "success"). Terminate parse.
match _values.pop()
| None => _result.failed()
| let res: Stringable => _result.success(res.string())
end
false
else
_writer.err("Illegal state: rule stack broken on do_expand")
false
end
If the current symbol is already a token, it returns true.
If the current symbol is a non-terminal,
_do_expand
attempts to expand it, pushing the result on theRuleStack
and recurring. If it fails, this rule has failed and it is removed from theRuleStack
and, again, it recurs.If the top of the stack is empty, it is removed an the
Nonterminal
at the head of the nextRHS
has succeeded. (There has to be a non-terminal there since nothing else could have expanded to the currently-empty top of theRuleStack
.)If the
RuleStack
is empty, then we have reached the point where we are terminating the parse. For simplicity, I am assuming that if the parse left a value on theValueStack
, the parse succeeded; the value is passed as to theParserResults
. Otherwise, a failure notice is passed.
Rules
The implementation of the grammar starts simply:
type Value is (ISize | U8)
A value on the ValueStack
is either an ISize
(a machine-word-sized signed integer), or a byte. The integer will be either a parsed number or an intermediate result; the byte will be an operator such as ’*‘or’+’.
Non-terminals
The grammar is rooted with a starting non-terminal:
class StartingNonterminal is Nonterminal[Value]
fun ref expand(): Array[Symbol[Value] ref] ref ? =>
Debug("starting " + _state.string())
match _state
| 0 => _state = 1; Array[Symbol[Value] ref].create().push(Expr(_parser))
else error
end
I have elided all but the important part, the method to expand
the non-terminal. In this case, there is one right-hand side: expanding to an Expr.
The default implementation of matched
, called when the non-terminal has succeeded, leaves the ValueStack
alone, which is what we want in this case.
An Expr
has a little more meat.
class Expr is Nonterminal[Value]
fun ref expand(): Array[Symbol[Value] ref] ref ? =>
Debug("expr " + _state.string())
match _state
| 0 => _state = 1; Array[Symbol[Value]].create().push(Term(_parser)).push(Expr2(_parser))
else error
end
fun ref matched(values: ValueStack[Value] ref) =>
try
match values.pop()
| let v: ISize => values.push(v)
| let op: U8 if op == '+' =>
let snd = values.pop() as ISize
let fst = values.pop() as ISize
values.push(fst + snd)
| let op: U8 if op == '-' =>
let snd = values.pop() as ISize
let fst = values.pop() as ISize
values.push(fst - snd)
else
values.push(ISize(0))
end
else
values.push(ISize(0))
end
expand
similarly only expands into a single right-hand side:
$$
\begin{array}{rcl}
\textrm{Expr} & \rightarrow & \textrm{Term}\ \textrm{Expr2}
\end{array}
$$
However the result of succeeding is more complicated. matched
implements the result of matching an addition or subtraction: it looks at the operator byte, grabs the next two operands from the stack, and performs the operation. (See Expr2 below to find out how the ValueStack
got into a state to allow this.)
Expr2 is the grammar rule that recognizes addition or subtraction, or allows a single term to bubble up. (I’ll leave the rant about operator precedence in recursive descent parsers alone for now.)
class Expr2 is Nonterminal[Value]
fun ref expand(): Array[Symbol[Value] ref] ref ? =>
Debug("expr2 " + _state.string())
match _state
| 0 => _state = 1; Array[Symbol[Value]].create().push(CharToken(_parser, '+')).push(Expr(_parser))
| 1 => _state = 2; Array[Symbol[Value]].create().push(CharToken(_parser, '-')).push(Expr(_parser))
| 2 => _state = 3; Array[Symbol[Value]].create()
else error
end
fun ref matched(values: ValueStack[Value] ref) =>
match _state
| 1 => values.swap() // Put operator on top of stack
| 2 => values.swap() // Put operator on top of stack
| 3 => None
else
None
end
Remember that the Expr2 rule has three possibilities:
$$
\begin{array}{rcl}
\textrm{Expr2} & \rightarrow & +\ \textrm{Expr} \\
& | & -\ \textrm{Expr} \\
& | & \textit{empty}
\end{array}
$$
An addition, a subtraction, and an empty rule. This is what expand
implements: the first time it is called, this non-terminal expands as an addition; if that fails, it becomes a subtraction; if that fails, then it becomes an empty rule, which always succeeds. This is what expand
implements.
matched
, on the other hand, has a bit of a peculiar responsibility. At the point when this rule succeeds, the ValueStack
will have (at least) three values: a number from the Term part of Expr, a byte representing the operator, and another number from the Expr. All in that order.
To make the processing up in Expr.matched
easier, this rule swaps the top two elements, leaving the operator on top of the stack.
Oh, and how does it know the difference between the first and second branches and the third, which doesn’t have an operator? If _state
is 1, the first branch succeeded; likewise, 2 and 3.
Term and Term2 are very similar to Expr and Expr2, albeit recognizing multiplication and division.
Tokens
There are two Token
classes: single-character operators and numbers.
class CharToken is Token[Value]
let _parser: Parser[Value]
let _ch: U8
new create(parser: Parser[Value], ch: U8) => _parser = parser; _ch = ch
fun ref apply(data: Array[U8] val) =>
try
if (data.size() < 1) or (data(0) != _ch) then
_parser.fail(recover Array[U8].append(data) end)
else
_parser.succeed(recover Array[U8].append(data,1) end)
end
end
fun ref dispose() => _parser.fail(recover Array[U8] end)
fun value(): U8 => _ch
The CharToken
is the simpler of the two; the character to look for is passed to the constructor. If that character is not seen at the head of the input, CharToken
fails, returning all the input to the Parser
. Otherwise, it succeeds, allowing the character to be placed on the ValueStack
and returning the rest of the input.
NumberToken
has a few more moving parts. When apply
is called with fresh input, it scans the input, recording those characters that are numerical digits.
class NumberToken is Token[Value]
let _token: Array[U8] = _token.create()
fun ref apply(data: Array[U8] val) =>
try
for i in Range(0, data.size()) do
if (data(i) >= 0x30) and (data(i) <= 0x39) then
_token.push(data(i))
elseif _token.size() > 0 then
_parser.succeed(recover Array[U8].append(data, i) end)
return
else
_parser.fail(recover Array[U8].append(data) end)
return
end
end
end
fun ref dispose() =>
if _token.size() > 0 then
_parser.succeed(recover Array[U8] end)
else
_parser.fail(recover Array[U8] end)
end
fun ref value(): ISize =>
let a = recover iso Array[U8]() end
for v in _token.values() do
a.push(v)
end
try
String.from_array(consume a).isize()
else
0
end
However, it cannot report success until it finds a character which is not a digit, or finds the end of the file. Otherwise, it could leave parts of a number on the input stream, resulting in a parse error. Managing this one group of characters is the one example of backtracking in this parser.
Anyway, once the NumberToken
has found the first non-numeric character, it can declare success if it has previously seen digits, or failure if it has not.
The actor
I implied at the top of this article that there would be actors involved, but there hasn’t been nary a sight of one yet. It’s all been plain, sequential Pony code. Let’s remedy that.
actor ParserActor[T]
let _parser: Parser[T]
new start[Start: Nonterminal[T]](
writer: Writer tag,
result: ParseResults tag,
chain: (Chain tag | None) = None)
=>
_parser = Parser[T].start[Start](writer, result, chain)
be apply(data: Array[U8] iso) => _parser.input(consume data)
be dispose() => _parser.dispose()
This is the actor which owns the parser state. It’s created with the same arguments as the Parser
, and creates an instance of the Parser
using those arguments. I mentioned earlier that an actor was the unit of sequential execution in Pony; this actor controls the execution of the Parser
, ensuring, for example, that a given block of input is fully consumed before the message containing next input is accepted.
Does it work?
Heck, yeah, it works. Think I’d post this article if it didn’t work?
$ echo -n 2+3*4 | ./hoc 2>&1
Parse succeeded: 14
Chain: 0 bytes
How does it work? Let’s take a look at the state of the rule stack during the execution. The first input causes the StartingNonterminal
to be expanded, producing:
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ NumberToken Term2 ]
Note: In computer science, stacks and trees grow down. Don’t ask me why, I just work here.
At this point, a token, NumberToken
, is the current symbol. It reads the ‘2’, producing the following rule stack:
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
and the value stack:
2
The rule stack gets expanded to:
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ * Term ]
The next character, however, is a ‘+’. So that fails, and Term2
is re-expanded.
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ / Term ]
That still fails.
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ ]
That succeeds, though. In this particular state, where it accepts empty, Term2
does not do anything to the value stack.
[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ ]
Likewise, Term
, when succeeding, leaves the value stack alone when the top element is a number.
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
Expr2
is expanded (remember, we’re still looking at the ‘+’ in the input “2+3*4“) to:
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ + Expr ]
Yay! This accepts our ‘+’! The CharToken(+)
is shifted from the current rule, and the value stack becomes:
2
+
The Expr
is expanded, producing the rule stack:
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ NumberToken Term2 ]
The NumberToken
accepts the ‘3’, is shifted from the rule stack, and the Term2
is expanded, producing a value stack of:
2
+
3
and a rule stack of:
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ * Term ]
The ’*’ is accepted and the value and rule stacks are transformed into:
2
+
3
*
and
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ Term ]
[ NumberToken Term2 ]
The final ‘4’ is read leaving us with a value stack of:
2
+
3
*
4
and a rule stack of:
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ Term ]
[ Term2 ]
Since the input is completed at this point, neither of the first two options for Term2
can succeed and the empty branch is taken. Term2
succeeds without modifying the value stack, as does Term
. The rule stack is now:
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
And the value stack is:
2
+
3
*
4
This Term2
is the one that accepted the ’*’, however; it is in a state to modify the value stack on succeeding:
2
+
3
4
*
Then, when the Term
succeeds, it performs the multiplication, leaving the value stack as:
2
+
12
The rule stack now is:
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Expr2 ]
Neither of the first two branches of Expr2
can succeed, so it accepts with the empty branch followed by Expr
succeeding. Neither of these modify the value stack.
[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
This Expr2
accepted the ‘+’ in “2+3*4“, so it rearranges the value stack when it succeeds.
2
12
+
So that when Expr
succeeds, it performs the addition.
14
The original StartingNonterminal
succeeds as well, resulting in a terminal state: an empty rule stack. The value of the computation is 14, and no additional input needed to be passed on to the chained receiver.
Conclusion
So, what have we achieved here?
Certainly not a parsing system that I’d want to use again. Bleah. I don’t even want to look at it now.
Fortunately, there are some positive outcomes.
This is a proof-of-concept online parser in Pony that doesn’t rely on blocking threads, blocking coroutines, or blocking anything. That’s good, right? Online algorithms are darned useful, especially for network-related tasks like parsing HTTP (ok, that’s easy anyway), and in Pony (which, I remind you, doesn’t have any blocking operations) they’re a necessity. That’s something you can’t say for traditionally developed recursive descent parsers, or the results of most parser generators like yacc.
There are other online parsing techniques, however. Derivative-based parsing is one appealing approach, but GLR parsing and possibly CYK parsing and Earley parsing are other options.
Pony’s actors work with asynchronous interfaces to write code that, nasty parsing algorithms aside, is relatively simple and clear. It takes some work to get used to the coding style, to transition from an imperative (or traditionally functional) style. Whether many programmers are willing to make the transition is another question, as is whether it is ultimately worthwhile. But as for me, I like it.