brag: a better Racket AST generator

7.7

top ← prev up next →

brag: a better Racket AST generator

Danny Yoo (95%)
and Matthew Butterick (5%)

This is a fork of the ragg package. It has a variety of bugfixes and new features. Some of these features have required new notation that’s not necessarily compatible with all existing ragg files.
Originally brag was built to support the Beautiful Racket project. But it has no dependencies on any Beautiful Racket libraries (and will remain thus). So it can be used independently.

1 Quick start

Suppose we’re given the following string:

"(radiant (humble))"

How would we turn this string into a structured value? That is, how would we parse it? (Let’s also suppose we’ve never heard of read.)

First, we need to consider the structure of the things we’d like to parse. The string above looks like a nested list of words. Good start.

Second, how might we describe this formally — meaning, in a way that a computer could understand? A common notation to describe the structure of these things is Backus-Naur Form (BNF). So let’s try to notate the structure of nested word lists in BNF.

nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN

What we intend by this notation is this: nested-word-list is either a WORD, or a parenthesized list of nested-word-lists. We use the character * to represent zero or more repetitions of the previous thing. We treat the uppercased LEFT-PAREN, RIGHT-PAREN, and WORD as placeholders for tokens (a token being the smallest meaningful item in the parsed string):

Here are a few examples of tokens:

> (require brag/support)
> (token 'LEFT-PAREN)
(token-struct 'LEFT-PAREN #f #f #f #f #f #f)
> (token 'WORD "crunchy" #:span 7)
(token-struct 'WORD "crunchy" #f #f #f 7 #f)
> (token 'RIGHT-PAREN)
(token-struct 'RIGHT-PAREN #f #f #f #f #f #f)

This BNF description is also known as a grammar. Just as it does in a natural language like English or French, a grammar describes something in terms of what elements can fit where.

Have we made progress? We have a valid grammar. But we’re still missing a parser: a function that can use that description to make structures out of a sequence of tokens.

Meanwhile, it’s clear that we don’t yet have a valid program because there’s no #lang line. Let’s add one: put #lang brag at the top of the grammar, and save it as a file called "nested-word-list.rkt".

"nested-word-list.rkt"
#lang brag
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN

Now it’s a proper program. But what does it do?

> (require "nested-word-list.rkt")
> parse
#<procedure:parse>

It gives us a parse function. Let’s investigate what parse does. What happens if we pass it a sequence of tokens?

> (define a-parsed-value
    (parse (list (token 'LEFT-PAREN "(")
                 (token 'WORD "some")
                 (token 'LEFT-PAREN "[")
                 (token 'WORD "pig")
                 (token 'RIGHT-PAREN "]")
                 (token 'RIGHT-PAREN ")"))))
> a-parsed-value
#<syntax:unknown::-1 (nested-word-list "(" (nested-word-list "some") (nested-word-list "[" (nested-word-list "pig") "]") ")")>

Those who have messed around with macros will recognize this as a syntax object.

> (syntax->datum a-parsed-value)
'(nested-word-list
  "("
  (nested-word-list "some")
  (nested-word-list "[" (nested-word-list "pig") "]")
  ")")

That’s (some [pig]), essentially.

What happens if we pass our parse function a bigger source of tokens?

; tokenize: string -> (sequenceof token-struct?)
; Generate tokens from a string:
> (define (tokenize s)
    (for/list ([str (regexp-match* #px"\$|\$|\\w+" s)])
              (match str
                ["("
                 (token 'LEFT-PAREN str)]
                [")"
                 (token 'RIGHT-PAREN str)]
                [else
                 (token 'WORD str)])))
; For example:
> (define token-source (tokenize "(welcome (to (((brag)) ())))"))
> (define v (parse token-source))
> (syntax->datum v)
'(nested-word-list
  "("
  (nested-word-list "welcome")
  (nested-word-list
   "("
   (nested-word-list "to")
   (nested-word-list
    "("
    (nested-word-list
     "("
     (nested-word-list "(" (nested-word-list "brag") ")")
     ")")
    (nested-word-list "(" ")")
    ")")
   ")")
  ")")

Welcome to brag.

2 Introduction

brag is a parser generator designed to be easy to use:

It provides a #lang for writing BNF grammars. A module written in #lang brag automatically generates a parser. The grammar controls the structure of the syntax objects it generates.
The language uses a few conventions to simplify the expression of grammars. The first rule in the grammar is assumed to be the starting production. Identifiers in UPPERCASE are treated as terminal tokens. All other identifiers are treated as nonterminals.
Tokenizers can be developed independently of parsers. brag takes a liberal view on tokens: they can be strings, symbols, or instances constructed with token. Tokens can optionally provide source location, in which case a syntax object generated by the parser will too.
The parser can usually handle ambiguous grammars.
It integrates with the rest of the Racket language toolchain.

2.1 Example: a small DSL for ASCII diagrams

Suppose we’d like to define a language for drawing simple ASCII diagrams. So if we write something like this:

3 9 X;
6 3 b 3 X 3 b;
3 9 X;

It should generate the following picture:

XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
   XXX
   XXX
   XXX
   XXX
   XXX
   XXX
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX

This makes sense in a casual way. But let’s be more precise about how the language works.

Each line of the program has a semicolon at the end, and describes the output of several rows of the line drawing. Let’s look at two of the lines in the example:

3 9 X;: “Repeat the following 3 times: print "X" nine times, followed by a newline.”
6 3 b 3 X 3 b;: “Repeat the following 6 times: print " " three times, followed by "X" three times, followed by " " three times, followed by a newline.”

Then each line consists of a repeat number, followed by pairs of (number, character) chunks. We’ll assume here that the intent of the lowercased character b is to represent the printing of a 1-character whitespace " ", and for other uppercase letters to represent the printing of themselves.

By understanding the pieces of each line, we can more easily capture that meaning in a grammar. Once we have each instruction of our ASCII DSL in a structured format, we should be able to parse it.

Here’s a first pass at expressing the structure of these line-drawing programs.

2.2 Parsing the concrete syntax

"simple-line-drawing.rkt"
#lang brag
drawing: rows*
rows: repeat chunk+ ";"
repeat: INTEGER
chunk: INTEGER STRING

Syntax and terminology describes brag’s syntax in more detail.

We write a brag program as an BNF grammar, where patterns can be:

the names of other rules (e.g. chunk)
literal and symbolic token names (e.g. ";", INTEGER)
quantified patterns (e.g. + to represent one or more repetitions)

The result of a brag program is a module with a parse function that can parse tokens and produce a syntax object as a result.

Let’s try this function:

> (require brag/support)
> (require "simple-line-drawing.rkt")
> (define stx
    (parse (list (token 'INTEGER 6)
                 (token 'INTEGER 2)
                 (token 'STRING " ")
                 (token 'INTEGER 3)
                 (token 'STRING "X")
                 ";")))
> (syntax->datum stx)
'(drawing (rows (repeat 6) (chunk 2 " ") (chunk 3 "X") ";"))

A token is the smallest meaningful element of a source program. Tokens can be strings, symbols, or instances of the token data structure. (Plus a few other special cases, which we’ll discuss later.) Usually, a token holds a single character from the source program. But sometimes it makes sense to package a sequence of characters into a single token, if the sequence has an indivisible meaning.

If possible, we also want to attach source location information to each token. Why? Because this information will be incorporated into the syntax objects produced by parse.

A parser often works in conjunction with a helper function called a lexer that converts the raw code of the source program into tokens. The br-parser-tools/lex library can help us write a position-sensitive tokenizer:

> (require br-parser-tools/lex)
> (define (tokenize ip)
    (port-count-lines! ip)
    (define my-lexer
      (lexer-src-pos
       [(repetition 1 +inf.0 numeric)
        (token 'INTEGER (string->number lexeme))]
       [upper-case
        (token 'STRING lexeme)]
       ["b"
        (token 'STRING " ")]
       [";"
        (token ";" lexeme)]
       [whitespace
        (token 'WHITESPACE lexeme #:skip? #t)]
       [(eof)
        (void)]))
    (define (next-token) (my-lexer ip))
    next-token)
> (define a-sample-input-port (open-input-string "6 2 b 3 X;"))
> (define token-thunk (tokenize a-sample-input-port))
; Now we can pass token-thunk to the parser:
> (define another-stx (parse token-thunk))
> (syntax->datum another-stx)
'(drawing (rows (repeat 6) (chunk 2 " ") (chunk 3 "X") ";"))
; The syntax object has location information:
> (syntax-line another-stx)
1
> (syntax-column another-stx)
0
> (syntax-span another-stx)
10

Note also from this lexer example:

parse accepts as input either a sequence of tokens, or a function that produces tokens (which parse will call repeatedly to get the next token).
As an alternative to the basic token structure, a token can also be an instance of the position-token structure (also found in br-parser-tools/lex). In that case, the token will try to derive its position from that of the position-token.
parse will stop if it gets void (or 'eof) as a token.
parse will skip any token that has #:skip? attribute set to #t. For instance, tokens representing comments often use #:skip?.

2.3 From parsing to interpretation

We now have a parser for programs written in this simple-line-drawing language. Our parser will return syntax objects:

> (define parsed-program
    (parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
> (syntax->datum parsed-program)
'(drawing
  (rows (repeat 3) (chunk 9 "X") ";")
  (rows (repeat 6) (chunk 3 " ") (chunk 3 "X") (chunk 3 " ") ";")
  (rows (repeat 3) (chunk 9 "X") ";"))

Better still, these syntax objects will have a predictable structure that follows the grammar:

(drawing (rows (repeat <number>)
(chunk <number> <string>) ... ";")
...)

where drawing, rows, repeat, and chunk should be treated literally, and everything else will be numbers or strings.

Still, these syntax-object values are just inert structures. How do we interpret them, and make them print? We claimed at the beginning of this section that these syntax objects should be easy to interpret. So let’s do it.

This is a very quick-and-dirty treatment of syntax-parse. See the syntax/parse documentation for a gentler guide to its features.

Racket provides a special form called syntax-parse in the syntax/parse library. syntax-parse lets us do a structural case-analysis on syntax objects: we provide it a set of patterns to parse and actions to perform when those patterns match.

As a simple example, we can write a function that looks at a syntax object and says #t if it’s the literal yes, and #f otherwise:

> (require syntax/parse)
; yes-syntax-object?: syntax-object -> boolean
; Returns true if the syntax-object is yes.
> (define (yes-syntax-object? stx)
    (syntax-parse stx
      [(~literal yes)
       #t]
      [else
       #f]))
> (yes-syntax-object? #'yes)
#t
> (yes-syntax-object? #'nooooooooooo)
#f

Here, we use ~literal to let syntax-parse know that yes should show up literally in the syntax object. The patterns can also have some structure to them, such as:

({~literal drawing} rows-stxs ...)

which matches on syntax objects that begin, literally, with drawing, followed by any number of rows (which are syntax objects too).

Now that we know a little bit more about syntax-parse, we can use it to do a case analysis on the syntax objects that our parse function gives us. We start by defining a function on syntax objects of the form (drawing rows-stx ...).

> (define (interpret-drawing drawing-stx)
    (syntax-parse drawing-stx
      [({~literal drawing} rows-stxs ...)

       (for ([rows-stx (syntax->list #'(rows-stxs ...))])
            (interpret-rows rows-stx))]))

When we encounter a syntax object with (drawing rows-stx ...), then interpret-rows each rows-stx.

Let’s define interpret-rows now:

> (define (interpret-rows rows-stx)
    (syntax-parse rows-stx
      [({~literal rows}
        ({~literal repeat} repeat-number)
        chunks ... ";")

       (for ([i (syntax-e #'repeat-number)])
            (for ([chunk-stx (syntax->list #'(chunks ...))])
                 (interpret-chunk chunk-stx))
            (newline))]))

For a rows, we extract out the repeat-number out of the syntax object and use it as the range of the for loop. The inner loop walks across each chunk-stx and calls interpret-chunk on it.

Finally, we need to write a definition for interpret-chunk. We want it to extract out the chunk-size and chunk-string portions, and print to standard output:

> (define (interpret-chunk chunk-stx)
    (syntax-parse chunk-stx
      [({~literal chunk} chunk-size chunk-string)

       (for ([k (syntax-e #'chunk-size)])
            (display (syntax-e #'chunk-string)))]))

Here are the definitions in a single file: interpret.rkt.

With these definitions in hand, now we can pass it syntax objects that we construct directly by hand:

> (interpret-chunk #'(chunk 3 "X"))
XXX
> (interpret-drawing #'(drawing (rows (repeat 5) (chunk 3 "X") ";")))
XXX
XXX
XXX
XXX
XXX

or we can pass it the result generated by our parser:

> (define parsed-program
    (parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
> (interpret-drawing parsed-program)
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
   XXX
   XXX
   XXX
   XXX
   XXX
   XXX
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX

And now we’ve got an interpreter!

2.4 From interpretation to compilation

For a gentler tutorial on writing #lang-based languages, see Beautiful Racket.

(Just as a warning: the following material is slightly more advanced, but shows how writing a compiler for the line-drawing language reuses the ideas for the interpreter.)

Wouldn’t it be nice to be able to write something like:

3 9 X;
6 3 b 3 X 3 b;
3 9 X;

and have Racket automatically compile this down to something like this?

(for ([i 3])
     (for ([k 9]) (displayln "X"))
     (newline))

(for ([i 6])
     (for ([k 3]) (displayln " "))
     (for ([k 3]) (displayln "X"))
     (for ([k 3]) (displayln " "))
     (newline))

(for ([i 3])
     (for ([k 9]) (displayln "X"))
     (newline))

Well, of course it won’t work: we don’t have a #lang line.

Let’s add one.

"letter-i.rkt"
#lang brag/examples/simple-line-drawing
3 9 X;
6 3 b 3 X 3 b;
3 9 X;

Now "letter-i.rkt" is a program.

How does this work? From the previous sections, we’ve seen how to take the contents of a file and interpret it. What we want to do now is teach Racket how to compile programs labeled with this #lang line. We’ll do two things:

Tell Racket to use the brag-generated parser and lexer we defined earlier whenever it sees a program written with #lang brag/examples/simple-line-drawing.
Define transformation rules for drawing, rows, and chunk to rewrite these into standard Racket forms.

The second part, the writing of the transformation rules, will look very similar to the definitions we wrote for the interpreter, but the transformation will happen at compile-time. (We could just resort to simply calling into the interpreter we just wrote up, but this section is meant to show that compilation is also viable.)

We do the first part by defining a module reader: a module reader tells Racket how to parse and compile a file. Whenever Racket sees a #lang <name>, it looks for a corresponding module reader in "<name>/lang/reader".

Here’s the definition for "brag/examples/simple-line-drawing/lang/reader.rkt":

"brag/examples/simple-line-drawing/lang/reader.rkt"
#lang s-exp syntax/module-reader
brag/examples/simple-line-drawing/semantics
#:read my-read
#:read-syntax my-read-syntax
#:whole-body-readers? #t

(require brag/examples/simple-line-drawing/lexer
brag/examples/simple-line-drawing/grammar)

(define (my-read in)
(syntax->datum (my-read-syntax #f in)))

(define (my-read-syntax src ip)
(list (parse src (tokenize ip))))

We use a helper module syntax/module-reader, which provides utilities for creating a module reader. It uses the lexer and brag-generated parser we defined earlier, and also tells Racket that it should compile the forms in the syntax object using a module called "semantics.rkt".

Let’s look into "semantics.rkt" and see what’s involved in compilation:

"brag/examples/simple-line-drawing/semantics.rkt"
#lang racket/base
(require (for-syntax racket/base syntax/parse))

(provide #%module-begin
;; We reuse Racket's treatment of raw datums, specifically
;; for strings and numbers:
#%datum

;; And otherwise, we provide definitions of these three forms.
;; During compiliation, Racket uses these definitions to
;; rewrite into for loops, displays, and newlines.
drawing rows chunk)

;; Define a few compile-time functions to do the syntax rewriting:
(begin-for-syntax
(define (compile-drawing drawing-stx)
(syntax-parse drawing-stx
[({~literal drawing} rows-stxs ...)

(syntax/loc drawing-stx
(begin rows-stxs ...))]))

(define (compile-rows rows-stx)
(syntax-parse rows-stx
[({~literal rows}
({~literal repeat} repeat-number)
chunks ...
";")

(syntax/loc rows-stx
(for ([i repeat-number])
chunks ...
(newline)))]))

(define (compile-chunk chunk-stx)
(syntax-parse chunk-stx
[({~literal chunk} chunk-size chunk-string)

(syntax/loc chunk-stx
(for ([k chunk-size])
(display chunk-string)))])))

;; Wire up the use of "drawing", "rows", and "chunk" to these
;; transformers:
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)

The semantics hold definitions for compile-drawing, compile-rows, and compile-chunk, similar to what we had for interpretation with interpret-drawing, interpret-rows, and interpret-chunk. However, compilation is not the same as interpretation: each definition does not immediately execute the act of drawing, but rather returns a syntax object whose evaluation will do the actual work.

There are a few things to note:

By the way, we can just as easily rewrite the semantics so that compile-rows does explicitly call compile-chunk. Often, though, it’s easier to write the transformation functions in this piecemeal way and depend on the Racket macro expansion system to do the rewriting as it encounters each of the forms.

brag’s native data structure is the syntax object because the majority of Racket’s language-processing infrastructure knows how to read and write this structured value.
Unlike in interpretation, compile-rows doesn’t compile each chunk by directly calling compile-chunk. Rather, it depends on the Racket macro expander to call each compile-XXX function as it encounters a drawing, rows, or chunk in the parsed value. The three statements at the bottom of "semantics.rkt" inform the macro expansion system to do this:
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)

Altogether, brag’s intent is to be a parser generator for Racket that’s easy and fun to use. It’s meant to fit naturally with the other tools in the Racket language toolchain. Hopefully, it will reduce the friction in making new languages with alternative concrete syntaxes.

The rest of this document describes the brag language and the parsers it generates.

3 The language

3.1 Syntax and terminology

A program in the brag language consists of the language line #lang brag, followed by a collection of rules and possibly line comments or multiline comments.

A rule is a sequence consisting of: a rule identifier, a separator (either ":" or "::="), and a pattern.

A rule identifier is an identifier that is not in upper case.

A symbolic token identifier is an identifier that is in upper case.

A line comment begins with either # or ; and continues till the end of the line.

A multiline comment begins with (* and ends with *).

An identifier is a sequence of letters, numbers, or characters in the set "-.!$%&/<=>?^_~@". It must not contain *, +, or { and }, as those characters are used to denote quantification.

A pattern is one of the following:

an implicit sequence of patterns separated by whitespace or commas.
a terminal: either a literal string or a symbolic token identifier.
When used in a pattern, both kinds of terminals will match the same set of inputs.
A literal string can match the string itself, or a token structure whose type field contains that string (or its symbol form). So "FOO" in a rule pattern would match the tokens "FOO", (token "FOO" "bar"), or (token 'FOO "bar").
A symbolic token identifier can also match the string version of the identifier, or a token whose type field is the symbol or string form of the identifier. So FOO in a rule pattern would also match the tokens "FOO", (token 'FOO "bar"), or (token "FOO" "bar"). (In every case, the value of a token, like "bar", can be anything, and may or may not be the same as the symbolic token identifier.)
Because their underlying meanings are the same, the symbolic token identifier ends up being a notational convenience for readability inside a rule pattern. Typically, the literal string "FOO" is used to connote “match the string "FOO" exactly” and the symbolic token identifier FOO specially connotes “match a token of type 'FOO”.
You cannot use the literal string "error" as a terminal in a grammar, because it’s reserved for brag. You can, however, adjust your lexer to package it inside a token structure — say, (token 'ERROR "error") — and then use the symbolic token identifier ERROR in the grammar to match this token structure.
a rule identifier.
a choice pattern: a sequence of patterns delimited with | characters.
a quantified pattern: a pattern followed by either * (“zero or more”), ? (“zero or one”), or + (“one or more”). Quantification can also be denoted by integers within curly brackets. So {2} means “exactly 2”; {2,5} means “between 2 and 5, inclusive”; {2,} means “2 or more”; and {,5} means “up to 5”.
an optional pattern: a pattern surrounded by [ and ]. (The ? zero-or-one quantifier means the same thing.)
an explicit sequence: a pattern surrounded by ( and ).
the empty set: a special pattern that matches a list of zero tokens. When it appears on the right side of a rule, the empty set will match empty input (which obviously contains zero tokens), but also the “gap” between two existing tokens (which less obviously also contains zero tokens). The empty set can be denoted by () (empty parentheses), ∅ (the Unicode empty-set character), or Ø (the slashed O).

For example, in the following program:

#lang brag
;; A parser for a silly language
sentence: verb optional-adjective object
verb: greeting
optional-adjective: ["happy" | "frumpy"]
greeting: "hello" | "hola" | "aloha"
object: "world" | WORLD

the elements sentence, verb, greeting, and object are rule identifiers. The first rule, sentence: verb optional-adjective object, is a rule whose right side is an implicit pattern sequence of three sub-patterns. The uppercased WORLD is a symbolic token identifier. The fourth rule in the program associates greeting with a choice pattern.

More examples:

A BNF for binary strings that contain an equal number of zeros and ones.
#lang brag
equal: [zero one | one zero]   ;; equal number of "0"s and "1"s.
zero: "0" equal | equal "0"    ;; has an extra "0" in it.
one: "1" equal | equal "1"     ;; has an extra "1" in it.
A BNF for JSON-like structures.
#lang brag
json: number | string
| array | object
number: NUMBER
string: STRING
array: "[" [json ("," json)*] "]"
object: "{" [kvpair ("," kvpair)*] "}"
kvpair: ID ":" json

3.2 Cuts & splices

By default, every matched token shows up in the parse tree. But sometimes that means that the parse tree ends up holding a bunch of tokens that were only needed to complete the parsing. Once they’ve served their purpose, it’s sometimes useful to filter them out (for instance, to simplify the implementation of a language expander). To help with this kind of housekeeping, brag supports cuts and splices.

A cut in a grammar will delete an item from the parse tree. A cut is notated by prefixing either the left-hand rule name or a right-hand pattern element with a slash /.

If the cut is applied to a left-hand rule name, the rule name is omitted from the parse tree, but its node and its matched elements remain.

If the cut is applied to a right-hand pattern element, then that element is omitted from every node matching that rule.

For instance, consider this simple grammar for arithmetic expressions:

#lang brag

expr : term ('+' term)*

term : factor ('*' factor)*

factor : ("0" | "1" | "2" | "3"

| "4" | "5" | "6" | "7"

| "8" | "9")+

If we use it to parse this string:

1+2*3

We get this parse tree:

'(expr (term (factor "1")) "+" (term (factor "2") "*" (factor "3")))

Suppose we felt the + and * characters were superfluous. We can add cuts to the grammar by prefixing these pattern elements with /:

#lang brag

expr : term (/'+' term)*

term : factor (/'*' factor)*

factor : ("0" | "1" | "2" | "3"

| "4" | "5" | "6" | "7"

| "8" | "9")+

Our parse tree changes accordingly:

'(expr (term (factor "1")) (term (factor "2") (factor "3")))

Now suppose we apply a cut on the rule name, factor:

#lang brag

expr : term (/'+' term)*

term : factor (/'*' factor)*

/factor : ("0" | "1" | "2" | "3"

| "4" | "5" | "6" | "7"

| "8" | "9")+

This time, the rule name disappears from the parse tree, but its nodes and elements remain:

'(expr (term ("1")) (term ("2") ("3")))

A splice in a grammar will merge the elements of a node into the surrounding node. A splice is notated by prefixing either the left-hand rule name or a right-hand pattern element with an at sign @.

If the splice is applied to a left-hand rule name, then the splice is applied every time the rule is used in the parse tree.

If the splice is applied to a right-hand pattern element, that element is spliced only when it appears as part of the production for that rule.

Suppose we remove the cut from the factor rule name and instead splice the second appearance of factor in the pattern for the term rule:

#lang brag

expr : term (/'+' term)*

term : factor (/'*' @factor)*

factor : ("0" | "1" | "2" | "3"

| "4" | "5" | "6" | "7"

| "8" | "9")+

The factor elements matching the first position of the term pattern remain as they were, but the factor element matching the second position is spliced into the surrounding node:

'(expr (term (factor "1")) (term (factor "2") "3"))

Finally, suppose we add a splice to the term rule name:

#lang brag

expr : term (/'+' term)*

@term : factor (/'*' @factor)*

factor : ("0" | "1" | "2" | "3"

| "4" | "5" | "6" | "7"

| "8" | "9")+

This time, all the appearances of term nodes in the parse tree will have their elements spliced into the surrounding nodes:

'(expr (factor "1") (factor "2") "3")

As a convenience, when a grammar element is spliced, or a rule name is cut, brag preserves the rule name by adding it as a syntax property to the residual elements, using the rule name as a key, and the original syntax object representing the rule name as the value.

Caveat for the top-level rule: though the rule name can have a cut, it cannot have a splice — once you’re at the top level, there’s nothing above to splice into.

3.3 Syntax errors

Besides the basic syntax errors that can occur with a malformed grammar, there are a few other classes of situations that #lang brag will consider as syntax errors.

brag will raise a syntax error if the grammar:

doesn’t have any rules.
has a rule with the same left hand side as any other rule.
refers to rules that have not been defined. e.g. the following program:
#lang brag
foo: [bar]
should raise an error because bar has not been defined, even though foo refers to it in an optional pattern.
uses the token name EOF; the end-of-file token type is reserved for internal use by brag.
contains a rule that has no finite derivation. e.g. the following program:
#lang brag
infinite-a: "a" infinite-a
should raise an error because no finite sequence of tokens will satisfy infinite-a.

Otherwise, brag should be fairly tolerant and permit even ambiguous grammars.

3.4 Semantics

A program written in #lang brag produces a module that provides a few bindings. The most important of these is parse:

procedure
(parse [source-path] token-source) → syntax?
  source-path : any/c = #f
   token-source :
(or/c (sequenceof token)
      (-> token))

Parses a series of tokens according to the rules in the grammar, using the first rule of the grammar for the initial production. The parse must completely consume token-source. The optional source-path argument is used to enrich the syntax-location fields.

The token source can either be a sequence, or a 0-arity function that produces tokens.

A token in brag can be any of the following values:

a string
a symbol
an instance produced by token
an instance produced by the token constructors of br-parser-tools/lex
an instance of br-parser-tools/lex’s position-token whose position-token-token is a token.

A token whose type is either void or 'EOF terminates the source.

If parse succeeds, it will return a structured syntax object. The structure of the syntax object follows the overall structure of the rules in the BNF grammar. For each rule r and its associated pattern p, parse generates a syntax object #'(r p-value) where p-value’s structure follows a case analysis on p:

For implicit and explicit sequences of patterns p1, p2, ..., the corresponding values, spliced into the structure.
For terminals, the value of the token.
For rule identifiers: the associated parse value for the rule.
For choice patterns: the associated parse value for one of the matching subpatterns.
For quantified patterns and optional patterns: the corresponding values, spliced into the structure.

Thus, it’s only the presence of rule identifiers in a rule’s pattern that tells the parser to introduce nested structure into the syntax object.

If the grammar is ambiguous, brag will choose one of the possible parse results, though it doesn’t guarantee which.

If the parse cannot be performed successfully, or if a token in the token-source uses a type that isn’t mentioned in the grammar, then parse raises an instance of exn:fail:parsing.

procedure
(parse-to-datum [source] token-source) → list?
  source : any/c = #f
   token-source :
(or/c (sequenceof token)
      (-> token))

Same as parse, but the result is converted into a plain datum. Useful for testing or debugging a parser.

syntax
(make-rule-parser name)

Constructs a parser for the name of one of the non-terminals in the grammar.

For example, given the brag program "simple-arithmetic-grammar.rkt":

"simple-arithmetic-grammar.rkt"
#lang brag
expr : term ('+' term)*
term : factor ('*' factor)*
factor : INT

the following interaction shows how to extract a parser for terms.

> (require "simple-arithmetic-grammar.rkt")
> (define term-parse (make-rule-parser term))
> (define tokens (list (token 'INT 3)
                       "*"
                       (token 'INT 4)))
> (syntax->datum (parse tokens))
'(expr (term (factor 3) "*" (factor 4)))
> (syntax->datum (term-parse tokens))
'(term (factor 3) "*" (factor 4))
> (define another-token-sequence
    (list (token 'INT 1) "+" (token 'INT 2)
          "*" (token 'INT 3)))
> (syntax->datum (parse another-token-sequence))
'(expr (term (factor 1)) "+" (term (factor 2) "*" (factor 3)))
; Note that term-parse will break on another-token-sequence
; as it does not know what to do with the "+"
> (term-parse another-token-sequence)
Encountered parsing error near "+" (token '+) while parsing
'unknown [line=#f, column=#f, offset=#f]

value
all-token-types : (setof symbol?)

A set of all the token types used in a grammar.

For example:

> (require "simple-arithmetic-grammar.rkt")
> all-token-types
(set '* '+ 'INT)

4 Support API

(require brag/support)

package: brag-lib

The brag/support module provides functions to interact with brag programs. The most useful is the token function, which produces tokens to be parsed.

In addition to the exports shown below, the brag/support module also provides everything from brag/support, and everything from br-parser-tools/lex.

procedure
(token type
[ val
#:line line
#:column column
#:position position
#:span span
#:skip? skip?]) → token-struct?
  type : (or/c string? symbol?)
  val : any/c = #f
  line : (or/c exact-positive-integer? #f) = #f
  column : (or/c exact-nonnegative-integer? #f) = #f
  position : (or/c exact-positive-integer? #f) = #f
  span : (or/c exact-nonnegative-integer? #f) = #f
  skip? : boolean? = #f

Creates instances of token-structs.

The syntax objects produced by a parse will inject the value val in place of the token name in the grammar.

If #:skip? is true, then the parser will skip over it during a parse.

struct
(struct token-struct (type val position line column span skip?)
    #:extra-constructor-name make-token-struct
    #:transparent)
  type : symbol?
  val : any/c
  position : (or/c exact-positive-integer? #f)
  line : (or/c exact-nonnegative-integer? #f)
  column : (or/c exact-positive-integer? #f)
  span : (or/c exact-nonnegative-integer? #f)
  skip? : boolean?

The token structure type.

Rather than directly using the token-struct constructor, please use the helper function token to construct instances.

struct
(struct exn:fail:parsing exn:fail ( message
continuation-marks
srclocs)
    #:extra-constructor-name make-exn:fail:parsing)
  message : string?
  continuation-marks : continuation-mark-set?
  srclocs : (listof srcloc?)

The exception raised when parsing fails.

exn:fail:parsing implements Racket’s prop:exn:srcloc property, so if this exception reaches DrRacket’s default error handler, DrRacket should highlight the offending locations in the source.

procedure
(apply-port-proc proc [port]) → list?
proc : procedure?
port : (or/c string? input-port?) = (current-input-port)

Repeatedly apply proc to port, gathering the results into a list. port can be an input port or a string (which is converted to a string port). Useful for testing or debugging a lexer or tokenizer.

procedure
(apply-lexer lexer [port]) → list?
lexer : procedure?
port : (or/c string? input-port?) = (current-input-port)

Alias for apply-port-proc.

procedure
(apply-tokenizer-maker tokenizer-maker
[ port]) → list?
tokenizer-maker : procedure?
port : (or/c string? input-port?) = (current-input-port)

Repeatedly apply tokenizer-maker to port, gathering the resulting tokens into a list. port can be an input port or a string (which is converted to a string port).

procedure
(trim-ends left-str str right-str) → string?
  left-str : string?
  str : string?
  right-str : string?

Remove left-str from the left side of str, and right-str from its right side. Intended as a helper function for from/to.

syntax
(:* re ...)

0 or more occurrences of any re pattern.

syntax
(:+ re ...)

1 or more occurrences of any re pattern.

syntax
(:? re ...)

0 or 1 occurrence of any re pattern.

syntax
(:= n re ...)

Exactly n occurrences of any re pattern, where n must be a literal exact, non-negative number.

syntax
(:>= n re ...)

At least n occurrences of any re pattern, where n must be a literal exact, non-negative number.

syntax
(:** n m re ...)

Between n and m (inclusive) occurrences of any re pattern, where n must be a literal exact, non-negative number, and m must be literally either #f, +inf.0, or an exact, non-negative number; a #f value for m is the same as +inf.0.

syntax
(:or re ...)

Same as (:union re ...).

syntax
(:: re ...)
syntax
(:seq re ...)

Both forms concatenate the res into a single, indivisible pattern. In other words, this matches all the res in order, whereas (:union re ...) matches any of the res.

syntax
(:& re ...)

Intersects the res.

syntax
(:- re ...)

The set difference of the res.

syntax
(:~ re ...)

Character-set complement, which each re must match exactly one character.

syntax
(:/ char-or-string ...)

Character ranges, matching characters between successive pairs of characters.

syntax
(from/to open close)

A string that is bounded by open and close. Matching is non-greedy (meaning, it stops at the first occurence of close). The resulting lexeme includes open and close. To remove them, see trim-ends.

syntax
(from/stop-before open close)

Like from/to, a string that is bounded by open and close, except that close is not included in the resulting lexeme. Matching is non-greedy (meaning, it stops at the first occurence of close).

4.1 Differences with ragg

This package is a fork of ragg. The most salient additions:

Cuts & splices.
Improved syntax for grammars, including comments and and quantified patterns.
Support for REPL interactions.
Some new conveniences, like parse-to-datum, apply-lexer, trim-ends, from/to, and from/stop-before.

top ← prev up next →