Push-parsing and Pony

Posted on July 2, 2016 by Tommy M. McGuire
Labels: pony, toy problems, programming language

Hey, let’s write a quick parser! A simple one, for the simple version of hoc from The Unix Programming Environment. The chapter on hoc is used to demonstrate the use of the yacc parser; it’s a dandy example to play with parsing.

I haven’t hand written a parser in a long time, so that’s what I want to do now. But there is a bit of a catch: it reads from stdin in Pony. And the interface for that is:

interface StdinNotify
  """
  Notification for data arriving via stdin.
  """
  fun ref apply(data: Array[U8] iso) =>
    """
    Called when data is available on stdin.
    """
    None

  fun ref dispose() =>
    """
    Called when no more data will arrive on stdin.
    """
    None

This is Main, Pony’s pony.

Main wants to help you, not make you scream like a ruptured duck. Really.

OMG! I can’t even! That’s asynchronous! Pony is an asynchronous language. It’s gonna be callback hell! It’s Red vs. Blue all over again. Er, wrong link. It’s Red vs. Blue all over again. Devil bunnies! Aaaaaaaaiiigghhh!

Take a deep breath. Calm down. Remember way back in the ’90s, when object-oriented programming was new? Remember that letter to the editor of Computer Language, I think? The one that said an object was…

…a function with multiple entry points and persistent state? And that we’d discovered it was a bad idea in the ’70s? That object-oriented programming was a horrible mess? Yeah, that was funny.

And remember what Mittens always says, that back in the ’90s, we were all doing GUIs and it was all callbacks and we never really had any problems with it?

Yeah, as long as you don’t do a long-running computation on the OS/2 Presentation Manager interface thread and lock up the whole GUI.

Right. So calm down. It’ll be ok. Pony is an actor-based language, not imperative. All we have to do is figure out how to work with the language, rather than trying to swim upstream, working against it. And we’re not alone here; Simon Peyton Jones has been here before. In fact, here’s a copy of Wearing the hair shirt. Hang on to it, and if you start feeling light-headed, flip through it and look at the pretty clip art.

A grammar

Let’s start with something simple. Here’s a basic grammar for this version of hoc:


$$ \begin{array}{rcl} \textrm{Start} & \rightarrow & \textrm{Expr}\ \\ \textrm{Expr} & \rightarrow & \textrm{Term}\ +\ \textrm{Expr} \\ & | & \textrm{Term}\ -\ \textrm{Expr} \\ & | & \textrm{Term}\ \\ \textrm{Term} & \rightarrow & \textrm{Number}\ *\ \textrm{Term} \\ & | & \textrm{Number}\ /\ \textrm{Term} \\ & | & \textrm{Number}\ \\ \textrm{Number} & \rightarrow & [0-9]+ \end{array} $$

For a recursive descent parser–which only looks at an input character once and has a relatively simple control path—I needed to factor out the common prefixes, putting the grammar in something like Greibach normal form except with more Greibach.


$$ \begin{array}{rcl} \textrm{Start} & \rightarrow & \textrm{Expr}\ \\ \textrm{Expr} & \rightarrow & \textrm{Term}\ \textrm{Expr2} \\ \textrm{Expr2} & \rightarrow & +\ \textrm{Expr} \\ & | & -\ \textrm{Expr} \\ & | & \textit{empty}\ \\ \textrm{Term} & \rightarrow & \textrm{Number}\ \textrm{Term2} \\ \textrm{Term2} & \rightarrow & *\ \textrm{Term} \\ & | & /\ \textrm{Term} \\ & | & \textit{empty}\ \\ \textrm{Number} & \rightarrow & [0-9]+ \end{array} $$

That’s pretty simple, right? It also has the neat property that, if a rule is going to fail then it is going to fail while looking at the left-most symbol. As a result, the parser doesn’t have to have any complicated back-tracking machinery to restore the parse state after an individual rule fails.

In fact, to avoid too much complexity, I’m going to make several simplifying assumptions. For one thing, I don’t care about errors. At all, really. Error handling is a pain in the rump at the best of times, and I haven’t written a recursive descent parser since before I sat down and figured out how yacc works. I’m not really enjoying the process this time, either. Another thing I’m not caring about is Unicode. It’s 7-bit ASCII all the way down, here. And whitespace. I don’t care about whitespace; everything can be all mushed together.

Anyway, back to parsing. To build a recursive descent parser normally, you would write functions for each of the rules in the grammar, with each function trying to recognize its own expansion. For example, Expr would try to call Term and, if that succeeded, call Expr2. Term would call Number to read a number and would then call Term2. Term2 is a little more complicated, since it has several options. First, it would try to read a ‘*’. If that failed, without losing track of any input, it would have to try to read a ‘/’. If that fails, it needs to succeed while having read no input.

Pony

None of that works with the asynchronous interface provided by Pony for stdin. Instead of calling a function that waits for input, input from stdin is passed to Pony code through an StdinNotify interface implementation. Now, if you’re familiar with Javascript, your first inclination might be to start slinging inline callbacks all over the place. But to my mind, that is the wrong thing to do in Javascript.

But Pony, on the other hand, is an actor oriented language. I do not want to go into the formal definition of actors here, especially since it is not relevant. You can think of an actor as an object that has an associated execution thread: creating an actor also creates a thread that the actor uses to respond to incoming messages. Messages are the only way to communicate between threads (and, fortunately or unfortunately, behaviors (containing the definition of a message handler) syntactically resemble method definitions and message sends resemble method calls). Most of an actor’s lifetime is spent waiting for incoming messages, processing those messages, and sending messages of its own. Actors start when they are created and terminate when they have no outstanding messages to handle and no other actor has a reference to them so that no further messages can be sent to the actor.

One further note on the subject: Pony is an actor oriented language. Actors are the basic, fundamental components of a Pony program. “Main” is the name of the actor initially started by the program. If you attempt to write much serial code in Pony, bad things will happen. For one thing, Pony’s garbage collector operates on individual actors, and is only invoked when the actor is quiescent—not processing a message and not having any messages waiting. Without multiple actors interacting, the garbage collector will never run.

I can’t say I have spent a long time working with actors; Pony is the first actor-oriented programming language I have spent any time with, and, although it does have many concepts in common with the formal tools I have spent a few long weekends pondering, there are also significant differences. I believe the basic ideas cross over, though.

Think of actors as top-level concepts: an actor is created, waits to process messages, communicates with other actors, and eventually disappears. In Pony, using an inline callback handler to process input would be unnatural; passing the input to an separately defined, well-defined. actor for processing is more natural. So, that’s how we’re going to do it.

One important way to look at actors, by the way, is as the elements of serialization. Following its creation or beginning handling a message, an actor cannot pause or be interrupted until it finishes and goes back to wait for the next message. The execution of a behavior is serial, and since the only way for actors to interact is through messages, it is possible to think of the execution of a network of actors as happening one behavior execution at a time, atomically.

A parser framework

The problem here is that the sequential structure of the parsed text is represented, in a recursive descent parser, by control flow; the “recursive descent” part exists on the program’s control stack. That won’t work with Pony and asynchronous interfaces, so I need to come up with a representation of structure that will work. There will be stacks involved.

How about this:

This framework can be translated into Pony code, first by describing the symbols of the grammar:

trait Nonterminal[T]
  """
  A non-terminal symbol in the grammar. Can be expanded, possibly multiple
  times, into an array of symbols representing the right-hand side of the rule.
  If the rule succeeds, manipulates the value stack to update the state of the
  parse.
  """
  new create(ruler: Parser[T])
  fun ref expand(): Array[Symbol[T] ref] ref ?
  fun ref matched(values: ValueStack[T] ref) => None

trait Token[T]
  """
  A terminal symbol in the grammar. `apply` and `dispose` are roughly the same
  as `StdinNotify`: they feed input to the token. If the token is recognized,
  `value` produces the value of the token.
  """
  fun ref apply(data: Array[U8] val)
  fun ref dispose()
  fun ref value(): (T^ | None) => None

type Symbol[T] is (Nonterminal[T] | Token[T])

Note that a good chunk of the Token interface is the same as StdinNotify: apply and dispose. Want to guess where input data is going to end up?

The next step is the right-hand side of a rule, the expansion of a non-terminal. For no readily apparent reason, the expand method of Nonterminal doesn’t produce an RHS, but the array it does produce is immediately used to construct a RHS. Que c’est, c’est.

class RHS[T]
  """
  A nonterminal symbol expands to an array of symbols.
  """
  let _a: Array[Symbol[T] ref] ref = _a.create()

  new create(n: Nonterminal[T]) => _a.push(n)
  new expand(n: Nonterminal[T]) ? => _a.concat(n.expand().values())

  fun ref head(): Symbol[T] ? => _a(0)
  fun ref shift(): Symbol[T]^ ? => _a.shift()
  fun empty(): Bool => _a.size() == 0

  fun debug() => Debug("----rhs: " + _a.size().string())

The next component is the RuleStack:

class RuleStack[T]
  let _s: Array[RHS[T]] ref = _s.create()

  fun ref top(): RHS[T] ?   => _s(_s.size() - 1)
  fun ref push(rhs: RHS[T]) => _s.push(rhs)
  fun ref pop(): RHS[T] ?   => _s.pop()

  fun ref current(): Current[T] =>
    try
      if is_empty() then RuleStackEmpty
      elseif is_current_empty() then CurrentEmpty
      else top().head()
      end
    else
      IllegalState
    end

  fun is_empty(): Bool    => _s.size() == 0

  fun ref is_current_empty(): Bool =>
    (not is_empty()) and try top().empty() else false end

  fun ref is_current_token(): Bool => _is_current[Token[T]]()
  fun ref is_current_nonterminal(): Bool => _is_current[Nonterminal[T]]()

  fun ref _is_current[S: Symbol[T]](): Bool =>
    try
      (not is_empty()) and (not top().empty())
      and match top().head() | let t: S => true else false end
    else
      false
    end

  fun debug() =>
    Debug("---stack: " + _s.size().string())
    for i in _s.values() do i.debug() end

The most important method here is current, which attempts to return the current, head symbol of the parse state. The type Current is:

primitive CurrentEmpty
primitive RuleStackEmpty
primitive IllegalState
type Current[T] is (Symbol[T] | CurrentEmpty | RuleStackEmpty | IllegalState)

CurrentEmpty indicates that the top of the stack is an empty RHS. RuleStackEmpty indicates thaht the RuleStack is empty. IllegalState indicates an illegal state for the parser, which shouldn’t happen (TM), but is required by Pony’s type system.

Finally, this is the value stack, ValueStack:

class ValueStack[T]
  let _s: Array[T] = _s.create()

  fun ref push(value: T) =>
    Debug("++push")
    _s.push(consume value)

  fun ref is_empty(): Bool => _s.size() == 0

  fun ref pop(): (T^ | None) =>
    Debug("++pop")
    try _s.pop() else None end

  fun ref swap() =>
    if _s.size() > 1 then
      try
        let a = _s.pop()
        let b = _s.pop()
        _s.push(consume a)
        _s.push(consume b)
      end
    end

It has a few helpful methods used by rules to manipulate the stack of values.

A parser

The parser itself is just the combination of the RuleStack and the ValueStack, along with a couple of other helpful references to other actors: a Writer to write error messages, a Chain used to pass along input once this parser has terminated, and a ParseResults which is notified of the overall success or failure of the parser.

class Parser[T]
  let _writer: Writer tag
  let _chain: Chain tag
  let _result: ParseResults tag
  let _active: RuleStack[T] = _active.create()
  let _values: ValueStack[T] = _values.create()

  new start[Start: Nonterminal[T]](
    writer: Writer tag,
    result: ParseResults tag,
    chain: (Chain tag | None) = None)
   =>
    _writer = writer
    _result = result
    _chain = match chain | let c: Chain tag => c else NullChain(writer) end
    _active.push(RHS[T](Start.create(this)))

There is a snazzy type system trick in the start method there. It takes a type parameter of Start, a Nonterminal, which is expanded to start the parsing process. An alternative would be to accept an instance of a Nonterminal or something, but that wouldn’t be typeriffic, would it?

The Parser class has four major methods:

input is:

  fun ref input(data: Array[U8] iso) =>
    Debug("-input")
    match _active.current()
    | let t: Token[T] => t(consume data)
    | let n: Nonterminal[T] =>
      // The head of the active stack is a un-expanded nonterminal. Attempt to
      // expand it and apply the data to the resulting token; if expansion
      // fails, the rule stack is empty and we're terminating the parse.
      _expand_and_handle(consume data)
    | CurrentEmpty =>
      // The head of the active stack represents a successful parse of the next
      // nonterminal up. Remove it, do_expand, and apply the input to the next
      // token, if possible.
      try
        _active.pop()
        (_active.top().shift() as Nonterminal[T]).matched(_values)
        _expand_and_handle(consume data)
      end
    | RuleStackEmpty => _chain.input(consume data)
    end

The basic idea here is:

dispose is very similar to input except using the dispose method to indicate the end of the input.

  fun ref dispose() =>
    Debug("-dispose")
    match _active.current()
    | let t: Token[T] => t.dispose()
    | let n: Nonterminal[T] =>
      // The head of the active stack is a un-expanded nonterminal. Attempt to
      // expand it and notify the resulting token of the end of input; if
      // expansion fails, the rule stack is empty and we're terminating the parse.
      _expand_and_dispose()
    | CurrentEmpty =>
      // The head of the active stack represents a successful parse of the next
      // nonterminal up. Remove it, do_expand, and notify the next token of the
      // end of input, if possible.
      try
        _active.pop()
        (_active.top().shift() as Nonterminal[T]).matched(_values)
        _expand_and_dispose()
      end
    | RuleStackEmpty => _chain.dispose()
    end

It, too, breaks down into four cases with the same logic as input.

When a token succeeds, it calls succeed, passing in the remaining, unread, part of the input.

  fun ref succeed(remainder: Array[U8] iso) =>
    """
    The current token has succeeded. Shift it off and try expanding the
    next symbol in this rule. If a token is found, pass it the remaining
    input. Otherwise, the rule stack is empty and the parse is terminating.
    """
    Debug("-succeed")
    if _active.is_current_token() then
      try
        match (_active.top().shift() as Token[T]).value()
        | let t: T => _values.push(consume t)
        end
      end
      _expand_and_handle(consume remainder)
    else
      _writer.err("Illegal state: succeeded with non-token?")
    end

succeed removes the current token from the RuleStack, pushes its value (if it produces one) on the ValueStack, then expands the next symbol and continues with the rest of the input.

fail is similar to succeed but with the opposite effect on the RuleStack:

  fun ref fail(remainder: Array[U8] iso) =>
    """
    The current token has failed. Remove the current active rule and
    try expanding the previous rule again. If a token is found, reapply
    the input. Otherwise, the rule stack is empty and the parse is
    terminating.
    """
    Debug("-fail")
    if _active.is_current_token() then
      try
        _active.pop()
      end
      _expand_and_handle(consume remainder)
    else
      _writer.err("Illegal state: failure with non-token?")
    end

In this case, the active rule has failed, so it is removed and an attempt is made to re-expand the previous non-terminal (for cases like Expr2 which have multiple right-hand sides).

It is important to note that remainder in this method includes the bytes that were passed to the failing token; it is all of the input from the end of the last token that was recognized.

All of these methods have called a helper, _do_expand, which attempts to arrange for the current symbol at the head of the RuleStack to be a Token.

  fun ref _do_expand(): Bool =>
    Debug("-do_expand")
    match _active.current()
    | let t: Token[T] => true
    | let n: Nonterminal[T] =>
      try
        // Attempt to expand the nonterminal. If successful, recurse to find
        // the current token.
        _active.push(RHS[T].expand(n))
        _do_expand()
      else
        // Expanding the nonterminal failed. This rule has failed; remove it
        // and recurse on the previous nonterminal.
        try
          _active.pop()
          _do_expand()
        else
          // NOTREACHED
          false
        end
      end
    | CurrentEmpty =>
      try
        // The current rule has succeeded, probably trivially by expanding.
        // Remove it, shift the successful nonterminal, and recurse.
        _active.pop()
        if not _active.is_empty() then
          (_active.top().shift() as Nonterminal[T]).matched(_values)
        end
        _do_expand()
      else
        // NOTREACHED
        false
      end
    | RuleStackEmpty =>
      // Success (for some value of "success"). Terminate parse.
      match _values.pop()
      | None => _result.failed()
      | let res: Stringable => _result.success(res.string())
      end
      false
    else
      _writer.err("Illegal state: rule stack broken on do_expand")
      false
    end

Rules

The implementation of the grammar starts simply:

type Value is (ISize | U8)

A value on the ValueStack is either an ISize (a machine-word-sized signed integer), or a byte. The integer will be either a parsed number or an intermediate result; the byte will be an operator such as ’*‘or’+’.

Non-terminals

The grammar is rooted with a starting non-terminal:

class StartingNonterminal is Nonterminal[Value]

  fun ref expand(): Array[Symbol[Value] ref] ref ? =>
    Debug("starting " + _state.string())
    match _state
    | 0 => _state = 1; Array[Symbol[Value] ref].create().push(Expr(_parser))
    else error
    end

I have elided all but the important part, the method to expand the non-terminal. In this case, there is one right-hand side: expanding to an Expr.

The default implementation of matched, called when the non-terminal has succeeded, leaves the ValueStack alone, which is what we want in this case.

An Expr has a little more meat.

class Expr is Nonterminal[Value]

  fun ref expand(): Array[Symbol[Value] ref] ref ? =>
    Debug("expr " + _state.string())
    match _state
    | 0 => _state = 1; Array[Symbol[Value]].create().push(Term(_parser)).push(Expr2(_parser))
    else error
    end

  fun ref matched(values: ValueStack[Value] ref) =>
    try
      match values.pop()
      | let v: ISize => values.push(v)
      | let op: U8 if op == '+' =>
        let snd = values.pop() as ISize
        let fst = values.pop() as ISize
        values.push(fst + snd)
      | let op: U8 if op == '-' =>
        let snd = values.pop() as ISize
        let fst = values.pop() as ISize
        values.push(fst - snd)
      else
        values.push(ISize(0))
      end
    else
      values.push(ISize(0))
    end

expand similarly only expands into a single right-hand side:


$$ \begin{array}{rcl} \textrm{Expr} & \rightarrow & \textrm{Term}\ \textrm{Expr2} \end{array} $$

However the result of succeeding is more complicated. matched implements the result of matching an addition or subtraction: it looks at the operator byte, grabs the next two operands from the stack, and performs the operation. (See Expr2 below to find out how the ValueStack got into a state to allow this.)

Expr2 is the grammar rule that recognizes addition or subtraction, or allows a single term to bubble up. (I’ll leave the rant about operator precedence in recursive descent parsers alone for now.)

class Expr2 is Nonterminal[Value]

  fun ref expand(): Array[Symbol[Value] ref] ref ? =>
    Debug("expr2 " + _state.string())
    match _state
    | 0 => _state = 1; Array[Symbol[Value]].create().push(CharToken(_parser, '+')).push(Expr(_parser))
    | 1 => _state = 2; Array[Symbol[Value]].create().push(CharToken(_parser, '-')).push(Expr(_parser))
    | 2 => _state = 3; Array[Symbol[Value]].create()
    else error
    end

  fun ref matched(values: ValueStack[Value] ref) =>
    match _state
    | 1 => values.swap() // Put operator on top of stack
    | 2 => values.swap() // Put operator on top of stack
    | 3 => None
    else
      None
    end

Remember that the Expr2 rule has three possibilities:


$$ \begin{array}{rcl} \textrm{Expr2} & \rightarrow & +\ \textrm{Expr} \\ & | & -\ \textrm{Expr} \\ & | & \textit{empty} \end{array} $$

An addition, a subtraction, and an empty rule. This is what expand implements: the first time it is called, this non-terminal expands as an addition; if that fails, it becomes a subtraction; if that fails, then it becomes an empty rule, which always succeeds. This is what expand implements.

matched, on the other hand, has a bit of a peculiar responsibility. At the point when this rule succeeds, the ValueStack will have (at least) three values: a number from the Term part of Expr, a byte representing the operator, and another number from the Expr. All in that order.

To make the processing up in Expr.matched easier, this rule swaps the top two elements, leaving the operator on top of the stack.

Oh, and how does it know the difference between the first and second branches and the third, which doesn’t have an operator? If _state is 1, the first branch succeeded; likewise, 2 and 3.

Term and Term2 are very similar to Expr and Expr2, albeit recognizing multiplication and division.

Tokens

There are two Token classes: single-character operators and numbers.

class CharToken is Token[Value]
  let _parser: Parser[Value]
  let _ch: U8

  new create(parser: Parser[Value], ch: U8) => _parser = parser; _ch = ch

  fun ref apply(data: Array[U8] val) =>
    try
      if (data.size() < 1) or (data(0) != _ch) then
        _parser.fail(recover Array[U8].append(data) end)
      else
        _parser.succeed(recover Array[U8].append(data,1) end)
      end
    end

  fun ref dispose() => _parser.fail(recover Array[U8] end)

  fun value(): U8 => _ch

The CharToken is the simpler of the two; the character to look for is passed to the constructor. If that character is not seen at the head of the input, CharToken fails, returning all the input to the Parser. Otherwise, it succeeds, allowing the character to be placed on the ValueStack and returning the rest of the input.

NumberToken has a few more moving parts. When apply is called with fresh input, it scans the input, recording those characters that are numerical digits.

class NumberToken is Token[Value]
  let _token: Array[U8] = _token.create()

  fun ref apply(data: Array[U8] val) =>
    try
      for i in Range(0, data.size()) do
        if (data(i) >= 0x30) and (data(i) <= 0x39) then
          _token.push(data(i))
        elseif _token.size() > 0 then
          _parser.succeed(recover Array[U8].append(data, i) end)
          return
        else
          _parser.fail(recover Array[U8].append(data) end)
          return
        end
      end
    end

  fun ref dispose() =>
    if _token.size() > 0 then
      _parser.succeed(recover Array[U8] end)
    else
      _parser.fail(recover Array[U8] end)
    end

  fun ref value(): ISize =>
    let a = recover iso Array[U8]() end
    for v in _token.values() do
      a.push(v)
    end
    try
      String.from_array(consume a).isize()
    else
      0
    end

However, it cannot report success until it finds a character which is not a digit, or finds the end of the file. Otherwise, it could leave parts of a number on the input stream, resulting in a parse error. Managing this one group of characters is the one example of backtracking in this parser.

Anyway, once the NumberToken has found the first non-numeric character, it can declare success if it has previously seen digits, or failure if it has not.

The actor

I implied at the top of this article that there would be actors involved, but there hasn’t been nary a sight of one yet. It’s all been plain, sequential Pony code. Let’s remedy that.

actor ParserActor[T]
  let _parser: Parser[T]

  new start[Start: Nonterminal[T]](
    writer: Writer tag,
    result: ParseResults tag,
    chain: (Chain tag | None) = None)
  =>
    _parser = Parser[T].start[Start](writer, result, chain)

  be apply(data: Array[U8] iso) => _parser.input(consume data)

  be dispose() => _parser.dispose()

This is the actor which owns the parser state. It’s created with the same arguments as the Parser, and creates an instance of the Parser using those arguments. I mentioned earlier that an actor was the unit of sequential execution in Pony; this actor controls the execution of the Parser, ensuring, for example, that a given block of input is fully consumed before the message containing next input is accepted.

Does it work?

Heck, yeah, it works. Think I’d post this article if it didn’t work?

$ echo -n 2+3*4 | ./hoc  2>&1 
Parse succeeded: 14
Chain: 0 bytes

How does it work? Let’s take a look at the state of the rule stack during the execution. The first input causes the StartingNonterminal to be expanded, producing:

[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ NumberToken Term2 ]

Note: In computer science, stacks and trees grow down. Don’t ask me why, I just work here.

At this point, a token, NumberToken, is the current symbol. It reads the ‘2’, producing the following rule stack:

[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]

and the value stack:

2

The rule stack gets expanded to:

[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ * Term ]

The next character, however, is a ‘+’. So that fails, and Term2 is re-expanded.

[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ / Term ]

That still fails.

[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ ]

That succeeds, though. In this particular state, where it accepts empty, Term2 does not do anything to the value stack.

[ StartingNonterminal ]
[ Expr ]
[ Term Expr2 ]
[ ]

Likewise, Term, when succeeding, leaves the value stack alone when the top element is a number.

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]

Expr2 is expanded (remember, we’re still looking at the ‘+’ in the input “2+3*4“) to:

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ + Expr ]

Yay! This accepts our ‘+’! The CharToken(+) is shifted from the current rule, and the value stack becomes:

2
+

The Expr is expanded, producing the rule stack:

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ NumberToken Term2 ]

The NumberToken accepts the ‘3’, is shifted from the rule stack, and the Term2 is expanded, producing a value stack of:

2
+
3

and a rule stack of:

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ * Term ]

The ’*’ is accepted and the value and rule stacks are transformed into:

2
+
3
*

and

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ Term ]
[ NumberToken Term2 ]

The final ‘4’ is read leaving us with a value stack of:

2
+
3
*
4

and a rule stack of:

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]
[ Term ]
[ Term2 ]

Since the input is completed at this point, neither of the first two options for Term2 can succeed and the empty branch is taken. Term2 succeeds without modifying the value stack, as does Term. The rule stack is now:

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Term Expr2 ]
[ Term2 ]

And the value stack is:

2
+
3
*
4

This Term2 is the one that accepted the ’*’, however; it is in a state to modify the value stack on succeeding:

2
+
3
4
*

Then, when the Term succeeds, it performs the multiplication, leaving the value stack as:

2
+
12

The rule stack now is:

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]
[ Expr ]
[ Expr2 ]

Neither of the first two branches of Expr2 can succeed, so it accepts with the empty branch followed by Expr succeeding. Neither of these modify the value stack.

[ StartingNonterminal ]
[ Expr ]
[ Expr2 ]

This Expr2 accepted the ‘+’ in “2+3*4“, so it rearranges the value stack when it succeeds.

2
12
+

So that when Expr succeeds, it performs the addition.

14

The original StartingNonterminal succeeds as well, resulting in a terminal state: an empty rule stack. The value of the computation is 14, and no additional input needed to be passed on to the chained receiver.

Conclusion

So, what have we achieved here?

Certainly not a parsing system that I’d want to use again. Bleah. I don’t even want to look at it now.

Fortunately, there are some positive outcomes.

  1. This is a proof-of-concept online parser in Pony that doesn’t rely on blocking threads, blocking coroutines, or blocking anything. That’s good, right? Online algorithms are darned useful, especially for network-related tasks like parsing HTTP (ok, that’s easy anyway), and in Pony (which, I remind you, doesn’t have any blocking operations) they’re a necessity. That’s something you can’t say for traditionally developed recursive descent parsers, or the results of most parser generators like yacc.

    There are other online parsing techniques, however. Derivative-based parsing is one appealing approach, but GLR parsing and possibly CYK parsing and Earley parsing are other options.

  2. Pony’s actors work with asynchronous interfaces to write code that, nasty parsing algorithms aside, is relatively simple and clear. It takes some work to get used to the coding style, to transition from an imperative (or traditionally functional) style. Whether many programmers are willing to make the transition is another question, as is whether it is ultimately worthwhile. But as for me, I like it.

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.