Packrat: Simple Packrat Parsing

2.4

top ← prev up next →

Packrat: Simple Packrat Parsing

David Van Horn <dvanhorn@ccs.neu.edu>
and Simon Haines <simon.haines@scalardata.com>

package: Packrat

This module provides a small library of parsing combinators and a syntax for defining parsers based on the portable packrat parsing library by Tony Garnock-Jones: http://www.lshift.net/~tonyg/packrat.pdf

Bug reports, suggestions and pull requests are welcome via GitHub.

1 Combinator library

(require packrat/combinator)

package: Packrat

struct
(struct parse-position (filename line column)
    #:extra-constructor-name make-parse-position)
  filename : string?
  line : number?
  column : number?

A parse-position structure represents a character location in an input stream.

struct
(struct parse-result (successful? semantic-value next error)
    #:extra-constructor-name make-parse-result)
  successful? : boolean?
  semantic-value : any/c
  next : (or/c false? parse-results?)
  error : (or/c false? parse-error?)

A parse-result structure describes the results of an attempt at a parse at a particular position in the input stream. It can either record a successful parse, in which case it contains an associated semantic-value, or a failed parse, in which case it contains a parse-error structure.

struct
(struct parse-results (position base next* map)
    #:extra-constructor-name make-parse-results)
  position : (or/c false? parse-position?)
  base : any/c
  next* : (or/c false? parse-results? (-> parse-results?))
  map : (hash/c symbol? (or/c false? parse-result?))

A parse-results structure notionally describes all possible parses that can be attempted from a particular point in an input stream, and the results of those parses. It contains a parse-position structure, which corresponds to the position in the input stream that this parse-results represents, and a map associated "key objects" with instances of parse-result.

Atomic objects (known as "base values"; usually either character or token/semantic-value pairs) are represented specially in the parse-results data structure, as an optimisation: the two fields base and next* represent the implicit successful parse of a base value at the current position. The base field contains a pair of a token-class-identifier and a semantic value unless the parse-results data structure as a whole is representing the end of the input stream, in which case it will contain #f.

struct
(struct parse-error (position expected messages)
    #:extra-constructor-name make-parse-error)
  position : (or/c parse-position? false?)
  expected : (or/c false? (listof any/c))
  messages : (listof string?)

A parse-error structure represents collected error information from attempted parses. It contains two kinds of error report: a collection of "expected token" messages, and a collection of free-format message strings.

procedure
(top-parse-position filename) → parse-position?
filename : string?

Constructs a parse-position representing the very beginning of an input stream. The argument is passed into make-parse-position as the filename parameter.

procedure
(update-parse-position pos ch) → parse-position?
pos : parse-position?
ch : char?

Given a position, and the character occuring at that position, returns the position of the next character in the input stream. Most characters simple increment the column number. Expections to this rule are: #\return, which resets the column number to zero; #\newline, which both resets the column number to zero and increments the line number; and #\tab, which increments the column number to the nearest multiple of eight, just as a terminal with an eight-column tab stop setting might do.

procedure
(empty-results pos) → parse-results?
pos : (or/c parse-position? false?)

Creates an empty parse-results structure using the pos argument as the current position.

procedure
(make-results pos base next-generator) → parse-results?
  pos : (or/c parse-position? false?)
  base : (or/c false? (cons/c any/c any/c))
  next-generator : (-> parse-results?)

Constructs a parse-results instance with the supplied pos, base and next-generator, and a hash-eq dictionary for the map.

procedure
(make-error-expected pos thing) → parse-error?
pos : (or/c parse-position? false?)
thing : any/c

Constructs a parse-error with the supplied pos and thing as the "expected token" message.

procedure
(make-error-message pos msg) → parse-error?
pos : parse-position?
msg : string?

Constructs a parse-error with the supplied pos and msg as the "general error" message.

procedure
(make-result semantic-value next) → parse-result?
semantic-value : any/c
next : parse-results?

Constructs a successful parse-result with the supplied semantic-value and next instance.

procedure
(parse-error->parse-result err) → parse-result?
err : parse-error?

Transforms the err into a failed parse-result.

procedure
(make-expected-result pos thing) → parse-result?
pos : (or/c parse-position? false?)
thing : any/c

Combines make-error-expected and parse-error->parse-result to construct a failed parse-result with the supplied pos and thing as the "expected token" message.

procedure
(make-message-result pos msg) → parse-result?
pos : (or/c parse-position? false?)
msg : string?

Combines make-error-message and parse-error->parse-result to construct a failed parse-result with the supplied pos and msg as the "general error" message.

procedure
(base-generator->results generator) → parse-results?
generator :
(-> (values (or/c parse-position? false?)
(or/c (cons/c any/c any/c) false?)))

This function is used to set up an initial input stream of base tokens. The argument is to be a nullary function returning multiple values, the first is a parse-position or #f, and the second is a base token–a pair of a token class identifier and a semantic value. The argument is called every time the parser needs to read a fresh base token from the input stream.

procedure
(parse-results-next results) → parse-results?
results : parse-results?

Accesses the next value of a parse-results instance, and removes this value from the list of parse-results. If the next value is #f an error is generated.

procedure
(results->result results key fn) → parse-result?
  results : parse-results?
  key : symbol?
  fn : (-> parse-result?)

This is the central function that drives the parsing process. It examines the result map in the parse-results, searching for an entry matching the key. If an entry is found, the associated parse-result is returned; otherwise, the fn is called and the resulting parse-result is both stored in the result map and returned.

procedure
(parse-position>? a b) → boolean?
a : (or/c parse-position? false?)
b : (or/c parse-position? false?)

Compares the two parse-position instances, and returns true if the first position is later in the input stream (according to the line and col values) than the second position.

procedure
(parse-error-empty? e) → boolean?
e : parse-error?

A predicate for parse-error instances that returns #t if the instance contains no "expected token" or "general error" messages.

procedure
(merge-parse-errors e1 e2) → (or/c parse-error? false?)
e1 : (or/c parse-error? false?)
e2 : (or/c parse-error? false?)

Merges two parse-error instances. If one instance represents a position earlier in the input stream than the other, that instance is returned. If they both represent the same position, the "expected token" sets are united and the "general error" sets are appended to form a new parse-error at the same position.

procedure
(merge-result-errors result errs) → parse-result?
result : parse-result?
errs : (or/c parse-error? false?)

Merges parse-result and parse-error instances to create a new parse-result with the errors merged with merge-parse-errors.

procedure
(packrat-check-base token-kind k)
→ (-> parse-results? parse-result?)
token-kind : any/c
k : (-> any/c (-> parse-results? parse-result?))

Returns a combinator which, if the next base token has a token class identifier equal to token-kind, evaluates k with the semantic value of the next base token. The result should be another parser combinator, which is applied to the parse-results representing the remainder of the input stream.

procedure
(packrat-check-pred token-pred k)
→ (-> parse-results? parse-result?)
token-pred : (-> any/c boolean?)
k : (-> any/c (-> parse-results? parse-result?))

Returns a combinator which tests the next base token with token-pred and if the result is #t, evaluates k with the semantic value of the token, similarly to packrat-check-base.

procedure
(packrat-check parser k) → (-> parse-results? parse-result?)
parser : (-> parse-results? parse-result?)
k : (-> any/c (-> parse-results? parse-result?))

Returns a combinator which attempts to parse using parser, and if the parse is successful, hands the resulting semantic value to k and continues parsing using the resulting combinator.

procedure
(packrat-or p1 p2) → (-> parse-results? parse-result?)
p1 : (-> parse-results? parse-result?)
p2 : (-> parse-results? parse-result?)

Returns a combinator which attempts to parse using p1, only trying p2 if p1 fails to parse the input. This is the basic combinator used to implement a choice among several alternative means of parsing an input stream.

procedure
(packrat-unless explanation p1 p2)
→ (-> parse-results? parse-result?)
  explanation : string?
  p1 : (-> parse-results? parse-result?)
  p2 : (-> parse-results? parse-result?)

The combinator returned from this function first tries p1. If it fails, p2 is tried; otherwise an error containing explanation is returned. This can be used to assert that a particular sequence of tokens does not occur at the current position before continuing.

procedure
(packrat-port-results filename p) → parse-results?
filename : string?
p : port?

Returns an input stream generator from the filename and p port. See base-generator->results.

procedure
(packrat-string-results filename s) → parse-results?
filename : string?
s : string?

Returns an input stream generator from the filename and s string. See base-generator->results.

procedure
(packrat-list-results tokens) → parse-results?
tokens : (listof any/c)

Returns an input stream generator from the tokens. See base-generator->results.

2 Parser syntax

(require packrat/parse)

package: Packrat

syntax
(parse id (nonterminal-id (sequence body body0 ...) ...) ...)

sequence = (part ...)

part = (! part ...)
| (/ sequence ...)
| (? expr)
| nonterminal-id
| id := 'kind
| id := @
| id := nonterminal-id
| id := (? expr)

This macro provides syntactic sugar for building complex parser combinators from simpler combinators.

Each nonterminal definition expands into a parser-combinator, and the result of the form is the parser-combinator for the id nonterminal, which must be defined as one of the nonterminal-id forms.

The (! part ...) syntax expands into a packrat-unless form.

The (/ sequence ...) syntax expands into a packrat-or form.

The (? expr) syntax expands into a packrat-check-pred form.

Each nonterminal-id expands into a results->result formed from the body of the nonterminal definition.

The id := parts create a variable binding for id in the body expressions:

The 'kind form expands into packrat-check-base.
The @ binds id to the parse-position at that point in the input stream.
The nonterminal-id form expands into packrat-check with the procedure associated with the nonterminal passed as the combinator argument.
The ? expands into packrat-check-pred.

3 Examples

Here is an example of a simple calculator.

Examples:

> (define calc
    (parse expr
           (expr ((a := mulexp '+ b := mulexp)
                  (+ a b))
                 ((a := mulexp) a))
           (mulexp ((a := simple '* b := simple)
                    (* a b))
                   ((a := simple '* b := simple)
                    (* a b))
                   ((a := simple) a))
           (simple ((a := 'num) a)
                   (('oparen a := expr 'cparen) a))))
> (define g
          (packrat-list-results '((num . 1) (+) (num . 2) (*) (num . 3))))
> (parse-result-semantic-value (calc g))
7

See the tests source file for an example of a parser for a simplified Scheme grammar.

4 Test suite

(require packrat/test)

package: Packrat

This module contains the test suite which is run during package installation, or from the command line: raco test -p packrat.

top ← prev up next →