14.3 Decode
(require pollen/decode) | package: pollen |
The doc export of a Pollen markup file is a simple X-expression. Decoding refers to any post-processing of this X-expression. The pollen/decode module provides tools for creating decoders.
The decode step can happen separately from the compilation of the file. But you can also attach a decoder to the markup file’s root node, so the decoding happens automatically when the markup is compiled, and thus automatically incorporated into doc. (Following this approach, you could also attach multiple decoders to different tags within doc.)
You can, of course, embed function calls within Pollen markup. But since markup is optimized for authors, decoding is useful for operations that can or should be moved out of the authoring layer.
One example is presentation and layout. For instance, decode-paragraphs is a decoder function that lets authors mark paragraphs in their source simply by using two carriage returns.
Another example is conversion of output into a particular data format. Most Pollen functions are optimized for HTML output, but one could write a decoder that targets another format.
procedure
(decode tagged-xexpr [ #:txexpr-tag-proc txexpr-tag-proc #:txexpr-attrs-proc txexpr-attrs-proc #:txexpr-elements-proc txexpr-elements-proc #:txexpr-proc txexpr-proc #:block-txexpr-proc block-txexpr-proc #:inline-txexpr-proc inline-txexpr-proc #:string-proc string-proc #:entity-proc entity-proc #:cdata-proc cdata-proc #:exclude-tags tags-to-exclude #:exclude-attrs attrs-to-exclude]) → (or/c xexpr/c (listof xexpr/c)) tagged-xexpr : txexpr?
txexpr-tag-proc : (txexpr-tag? . -> . txexpr-tag?) = (λ (tag) tag)
txexpr-attrs-proc : (txexpr-attrs? . -> . txexpr-attrs?) = (λ (attrs) attrs)
txexpr-elements-proc : (txexpr-elements? . -> . txexpr-elements?) = (λ (elements) elements)
txexpr-proc : (txexpr? . -> . (or/c xexpr? (listof xexpr?))) = (λ (tx) tx)
block-txexpr-proc : (block-txexpr? . -> . (or/c xexpr? (listof xexpr?))) = (λ (tx) tx)
inline-txexpr-proc : (txexpr? . -> . (or/c xexpr? (listof xexpr?))) = (λ (tx) tx)
string-proc : (string? . -> . (or/c xexpr? (listof xexpr?))) = (λ (str) str)
entity-proc : ((or/c symbol? valid-char?) . -> . (or/c xexpr? (listof xexpr?))) = (λ (ent) ent)
cdata-proc : (cdata? . -> . (or/c xexpr? (listof xexpr?))) = (λ (cdata) cdata) tags-to-exclude : (listof txexpr-tag?) = null attrs-to-exclude : txexpr-attrs? = null
This function doesn’t do much on its own. Rather, it provides the hooks upon which harder-working functions can be hung.
Recall that in Pollen, all Tags are functions. By default, the tagged-xexpr from a source file is tagged with root. So the typical way to use decode is to attach your decoding functions to it, and then define root to invoke your decode function. Then it will be automatically applied to every doc during compile.
For instance, here’s how decode is attached to root in Butterick’s Practical Typography. There’s not much to it —
(define (root . items) (decode (txexpr 'root '() items) #:txexpr-elements-proc decode-paragraphs #:block-txexpr-proc (compose1 hyphenate wrap-hanging-quotes) #:string-proc (compose1 smart-quotes smart-dashes) #:exclude-tags '(style script)))
The hyphenate function is not part of Pollen, but rather the hyphenate package, which you can install separately.
This illustrates another important point: even though decode presents an imposing list of arguments, you’re unlikely to use all of them at once. These represent possibilities, not requirements. For instance, let’s see what happens when decode is invoked without any of its optional arguments.
> (define tx '(root "I wonder" (em "why") "this works.")) > (decode tx) '(root "I wonder" (em "why") "this works.")
Right — nothing. That’s because the default value for the decoding arguments is the identity function, (λ (x) x). So all the input gets passed through intact unless another action is specified.
The *-proc arguments of decode take procedures that are applied to specific categories of elements within txexpr.
The txexpr-tag-proc argument is a procedure that handles X-expression tags.
> (define tx '(p "I'm from a strange" (strong "namespace"))) ; Tags are symbols, so a tag-proc should return a symbol > (decode tx #:txexpr-tag-proc (λ (t) (string->symbol (format "ns:~a" t)))) '(ns:p "I'm from a strange" (ns:strong "namespace"))
The txexpr-attrs-proc argument is a procedure that handles lists of X-expression attributes. (The txexpr module, included at no extra charge with Pollen, includes useful helper functions for dealing with these attribute lists.)
> (define tx '(p ((id "first")) "If I only had a brain.")) ; Attrs is a list, so cons is OK for simple cases > (decode tx #:txexpr-attrs-proc (λ (attrs) (cons '[class "PhD"] attrs))) '(p ((class "PhD") (id "first")) "If I only had a brain.")
Note that txexpr-attrs-proc will change the attributes of every tagged X-expression, even those that don’t have attributes. This is useful, because sometimes you want to add attributes where none existed before. But be careful, because the behavior may make your processing function overinclusive.
> (define tx '(div (p ((id "first")) "If I only had a brain.") (p "Me too."))) ; This will insert the new attribute everywhere > (decode tx #:txexpr-attrs-proc (λ (attrs) (cons '[class "PhD"] attrs)))
'(div
((class "PhD"))
(p ((class "PhD") (id "first")) "If I only had a brain.")
(p ((class "PhD")) "Me too."))
; This will add the new attribute only to non-null attribute lists
> (decode tx #:txexpr-attrs-proc (λ (attrs) (if (null? attrs) attrs (cons '[class "PhD"] attrs)))) '(div (p ((class "PhD") (id "first")) "If I only had a brain.") (p "Me too."))
The txexpr-elements-proc argument is a procedure that operates on the list of elements that represents the content of each tagged X-expression. Note that each element of an X-expression is subject to two passes through the decoder: once now, as a member of the list of elements, and also later, through its type-specific decoder (i.e., string-proc, entity-proc, and so on).
> (define tx '(div "Double" "\n" "toil" amp "trouble")) ; Every element gets doubled ... > (decode tx #:txexpr-elements-proc (λ (es) (append-map (λ (e) (list e e)) es))) '(div "Double" "Double" "\n" "\n" "toil" "toil" amp amp "trouble" "trouble")
; ... but only strings get capitalized
> (decode tx #:txexpr-elements-proc (λ (es) (append-map (λ (e) (list e e)) es)) #:string-proc (λ (s) (string-upcase s))) '(div "DOUBLE" "DOUBLE" "\n" "\n" "TOIL" "TOIL" amp amp "TROUBLE" "TROUBLE")
So why do you need txexpr-elements-proc? Because some types of element decoding depend on context, thus it’s necessary to handle the elements as a group. For instance, paragraph decoding. The behavior is not merely a map across each element, because elements are being removed and altered contextually:
> (define (paras tx) (decode tx #:txexpr-elements-proc decode-paragraphs)) ; Context matters. Trailing whitespace is ignored ... > (paras '(body "The first paragraph." "\n\n")) '(body "The first paragraph.")
; ... but whitespace between strings is converted to a break. > (paras '(body "The first paragraph." "\n\n" "And another.")) '(body (p "The first paragraph.") (p "And another."))
; A combination of both types > (paras '(body "The first paragraph." "\n\n" "And another." "\n\n")) '(body (p "The first paragraph.") (p "And another."))
The txexpr-proc, block-txexpr-proc, and inline-txexpr-proc arguments are procedures that operate on tagged X-expressions. If the X-expression meets the block-txexpr? test, it’s processed by block-txexpr-proc. Otherwise, it’s inline, so it’s processed by inline-txexpr-proc. (Careful, however — these aren’t mutually exclusive, because block-txexpr-proc operates on all the elements of a block, including other tagged X-expressions within.) Then both categories are processed by txexpr-proc.
> (define tx '(div "Please" (em "mind the gap") (h1 "Tuesdays only")))
> (define add-ns (λ (tx) (txexpr (string->symbol (format "ns:~a" (get-tag tx))) (get-attrs tx) (get-elements tx)))) ; div and h1 are block elements, so this will only affect them > (decode tx #:block-txexpr-proc add-ns) '(ns:div "Please" (em "mind the gap") (ns:h1 "Tuesdays only"))
; em is an inline element, so this will only affect it > (decode tx #:inline-txexpr-proc add-ns) '(div "Please" (ns:em "mind the gap") (h1 "Tuesdays only"))
; this will affect all elements > (decode tx #:block-txexpr-proc add-ns #:inline-txexpr-proc add-ns) '(ns:div "Please" (ns:em "mind the gap") (ns:h1 "Tuesdays only"))
; as will this > (decode tx #:txexpr-proc add-ns) '(ns:div "Please" (ns:em "mind the gap") (ns:h1 "Tuesdays only"))
The string-proc, entity-proc, and cdata-proc arguments are procedures that operate on X-expressions that are strings, entities, and CDATA, respectively. Deliberately, the output contracts for these procedures accept any kind of X-expression (meaning, the procedure can change the X-expression type).
; A div with string, entity, and cdata elements > (define tx `(div "Moe" amp 62 ,(cdata #f #f "3 > 2;"))) > (define rulify (λ (x) '(hr))) ; The rulify function is selectively applied to each > (print (decode tx #:string-proc rulify)) (list 'div '(hr) 'amp 62 (cdata #f #f "3 > 2;"))
> (print (decode tx #:entity-proc rulify)) (list 'div "Moe" '(hr) '(hr) (cdata #f #f "3 > 2;"))
> (print (decode tx #:cdata-proc rulify)) '(div "Moe" amp 62 (hr))
Note that entities come in two flavors — symbolic and numeric — and entity-proc affects both. If you only want to affect one or the other, you can add a test within entity-proc. Symbolic entities can be decodeed with symbol?, and numeric entities with valid-char?:
> (define tx `(div amp 62)) > (define symbolic-detonate (λ (x) (if (symbol? x) 'BOOM x))) > (print (decode tx #:entity-proc symbolic-detonate)) '(div BOOM 62)
> (define numeric-detonate (λ (x) (if (valid-char? x) 'BOOM x))) > (print (decode tx #:entity-proc numeric-detonate)) '(div amp BOOM)
The five previous procedures — block-txexpr-proc, inline-txexpr-proc, string-proc, entity-proc, and cdata-proc — can return either a single X-expression, or a list of X-expressions, which will be spliced into the parent at the same point.
For instance, earlier we saw how to double elements by using txexpr-elements-proc. But you can accomplish the same thing on a case-by-case basis by returning a list of values:
; A div with string, entity, and inline-txexpr elements > (define tx `(div "Axl" amp (span "Slash"))) > (define doubler (λ (x) (list x x))) ; The doubler function is selectively applied to each type of element > (print (decode tx #:string-proc doubler)) '(div "Axl" "Axl" amp (span "Slash" "Slash"))
> (print (decode tx #:entity-proc doubler)) '(div "Axl" (amp amp) (span "Slash"))
> (print (decode tx #:inline-txexpr-proc doubler)) '(div "Axl" amp (span "Slash") (span "Slash"))
Caution: when returning list values, it’s possible to trip over the unavoidable ambiguity between a txexpr? and a list of xexpr?s that happens to begin with a symbolic entity:
; An ambiguous expression > (define amb '(guitar "player-name")) > (and (txexpr-elements? amb) (txexpr? amb)) #t
; Ambiguity in context > (define x '(gnr "Izzy" "Slash")) > (define rockit (λ (str) (list 'guitar str)))
; Expecting '(gnr guitar "Izzy" guitar "Slash") from next line, but return value will be treated as tagged X-expression > (decode x #:string-proc rockit) '(gnr (guitar "Izzy") (guitar "Slash"))
; Changing the order makes it unambiguous > (define rockit2 (λ (str) (list str 'guitar))) > (decode x #:string-proc rockit2) '(gnr "Izzy" guitar "Slash" guitar)
The tags-to-exclude argument is a list of tags that will be exempted from decoding. Though you could get the same result by testing the input within the individual decoding functions, that’s tedious and potentially slower.
> (define tx '(p "I really think" (em "italics") "should be lowercase.")) > (decode tx #:string-proc string-upcase) '(p "I REALLY THINK" (em "ITALICS") "SHOULD BE LOWERCASE.")
> (decode tx #:string-proc string-upcase #:exclude-tags '(em)) '(p "I REALLY THINK" (em "italics") "SHOULD BE LOWERCASE.")
The tags-to-exclude argument is useful if you’re decoding source that’s destined to become HTML. According to the HTML spec, material within a <style> or <script> block needs to be preserved literally. In this example, if the CSS and JavaScript blocks are capitalized, they won’t work. So exclude '(style script), and problem solved.
> (define tx '(body (h1 ((class "Red")) "Let's visit Planet Telex.") (style ((type "text/css")) ".Red {color: green;}") (script ((type "text/javascript")) "var area = h * w;"))) > (decode tx #:string-proc string-upcase)
'(body
(h1 ((class "Red")) "LET'S VISIT PLANET TELEX.")
(style ((type "text/css")) ".RED {COLOR: GREEN;}")
(script ((type "text/javascript")) "VAR AREA = H * W;"))
> (decode tx #:string-proc string-upcase #:exclude-tags '(style script))
'(body
(h1 ((class "Red")) "LET'S VISIT PLANET TELEX.")
(style ((type "text/css")) ".Red {color: green;}")
(script ((type "text/javascript")) "var area = h * w;"))
Finally, the attrs-to-exclude argument works the same way as tags-to-exclude, but instead of excluding an element based on its tag, it excludes based on whether the element has a matching attribute/value pair.
> (define tx '(p (span "No attrs") (span ((id "foo")) "One attr"))) > (decode tx #:string-proc string-upcase) '(p (span "NO ATTRS") (span ((id "foo")) "ONE ATTR"))
> (decode tx #:string-proc string-upcase #:exclude-attrs '((id "foo"))) '(p (span "NO ATTRS") (span ((id "foo")) "One attr"))
procedure
(decode-elements elements [ #:txexpr-tag-proc txexpr-tag-proc #:txexpr-attrs-proc txexpr-attrs-proc #:txexpr-elements-proc txexpr-elements-proc #:txexpr-proc txexpr-proc #:block-txexpr-proc block-txexpr-proc #:inline-txexpr-proc inline-txexpr-proc #:string-proc string-proc #:entity-proc entity-proc #:cdata-proc cdata-proc #:exclude-tags tags-to-exclude #:exclude-attrs attrs-to-exclude]) → (or/c xexpr/c (listof xexpr/c)) elements : txexpr-elements?
txexpr-tag-proc : (txexpr-tag? . -> . txexpr-tag?) = (λ (tag) tag)
txexpr-attrs-proc : (txexpr-attrs? . -> . txexpr-attrs?) = (λ (attrs) attrs)
txexpr-elements-proc : (txexpr-elements? . -> . txexpr-elements?) = (λ (elements) elements)
txexpr-proc : (txexpr? . -> . (or/c xexpr? (listof xexpr?))) = (λ (tx) tx)
block-txexpr-proc : (block-txexpr? . -> . (or/c xexpr? (listof xexpr?))) = (λ (tx) tx)
inline-txexpr-proc : (txexpr? . -> . (or/c xexpr? (listof xexpr?))) = (λ (tx) tx)
string-proc : (string? . -> . (or/c xexpr? (listof xexpr?))) = (λ (str) str)
entity-proc : ((or/c symbol? valid-char?) . -> . (or/c xexpr? (listof xexpr?))) = (λ (ent) ent)
cdata-proc : (cdata? . -> . (or/c xexpr? (listof xexpr?))) = (λ (cdata) cdata) tags-to-exclude : (listof txexpr-tag?) = null attrs-to-exclude : txexpr-attrs? = null
procedure
(block-txexpr? v) → boolean?
v : any/c
This predicate affects the behavior of other functions. For instance, decode-paragraphs knows that block elements in the markup shouldn’t be wrapped in a p tag. So if you introduce a new block element called bloq without configuring it as a block, misbehavior will follow:
> (define (paras tx) (decode tx #:txexpr-elements-proc decode-paragraphs)) > (paras '(body "I want to be a paragraph." "\n\n" (bloq "But not me."))) '(body (p "I want to be a paragraph.") (p (bloq "But not me.")))
; Wrong: bloq should not be wrapped
To change how this test works, use a setup submodule as described in How to override setup values:
(module setup racket/base (provide (all-defined-out)) (require pollen/setup) (define block-tags (cons 'bloq default-block-tags)))
After that change, the result will be:
’(body (p "I want to be a paragraph.") (bloq "But not me."))
The default block tags are:
root address article aside blockquote body canvas dd div dl fieldset figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr li main nav noscript ol output p pre section table tfoot ul video
procedure
(merge-newlines elements) → (listof xexpr?)
elements : (listof xexpr?)
> (merge-newlines '(p "\n" "\n" "foo" "\n" "\n\n" "bar" (em "\n" "\n" "\n"))) '(p "\n\n" "foo" "\n\n\n" "bar" (em "\n\n\n"))
procedure
(decode-linebreaks elements [linebreaker]) → (listof xexpr?)
elements : (listof xexpr?)
linebreaker : (or/c #f xexpr? (xexpr? xexpr? . -> . (or/c #f xexpr?))) = '(br)
The linebreak separator is controlled by setup:linebreak-separator, and defaults to "\n".
The linebreaker argument can either be #f (which will delete the linebreaks), an X-expression (which will replace the linebreaks), or a function that takes two X-expressions and returns one. This function will receive the previous and next elements, to make contextual substitution possible.
> (decode-linebreaks '(div "Two items:" "\n" (em "Eggs") "\n" (em "Bacon"))) '(div "Two items:" (br) (em "Eggs") (br) (em "Bacon"))
> (decode-linebreaks '(div "Two items:" "\n" (em "Eggs") "\n" (em "Bacon")) #f) '(div "Two items:" (em "Eggs") (em "Bacon"))
> (decode-linebreaks '(div "Two items:" "\n" (div "Eggs") "\n" (div "Bacon"))) '(div "Two items:" (div "Eggs") (div "Bacon"))
> (decode-linebreaks '(div "Two items:" "\n" (em "Eggs") "\n" (em "Bacon")) (λ (prev next) (if (and (txexpr? prev) (member "Eggs" prev)) '(egg-br) '(br)))) '(div "Two items:" (br) (em "Eggs") (egg-br) (em "Bacon"))
procedure
(decode-paragraphs elements [ paragraph-wrapper #:linebreak-proc linebreak-proc #:force? force-paragraph?]) → (listof xexpr?) elements : (listof xexpr?)
paragraph-wrapper : (or/c txexpr-tag? ((listof xexpr?) . -> . txexpr?)) = 'p
linebreak-proc : ((listof xexpr?) . -> . (listof xexpr?)) = decode-linebreaks force-paragraph? : boolean? = #f
What counts as a paragraph? Any elements that are either a) explicitly set apart with a paragraph separator, or b) adjacent to a block-txexpr? (in which case the paragraph-ness is implied).
The paragraph separator is controlled by setup:paragraph-separator, and defaults to "\n\n".
> (decode-paragraphs '("Explicit para" "\n\n" "Explicit para")) '((p "Explicit para") (p "Explicit para"))
> (decode-paragraphs '("Explicit para" "\n\n" "Explicit para" "\n" "Explicit line")) '((p "Explicit para") (p "Explicit para" (br) "Explicit line"))
> (decode-paragraphs '("Implied para" (div "Block") "Implied para")) '((p "Implied para") (div "Block") (p "Implied para"))
If element is already a block, it will not be wrapped as a paragraph (because in that case, the wrapping would be superfluous). Thus, as a consequence, if paragraph-sep occurs between two blocks, it will be ignored (as in the example below using two sequential div blocks.) Likewise, paragraph-sep will also be ignored if it occurs between a block and a non-block (because a paragraph break is already implied).
; The explicit "\n\n" makes no difference in these cases > (decode-paragraphs '((div "First block") "\n\n" (div "Second block"))) '((div "First block") (div "Second block"))
> (decode-paragraphs '((div "First block") (div "Second block"))) '((div "First block") (div "Second block"))
> (decode-paragraphs '("Para" "\n\n" (div "Block"))) '((p "Para") (div "Block"))
> (decode-paragraphs '("Para" (div "Block"))) '((p "Para") (div "Block"))
The paragraph-wrapper argument can either be an X-expression, or a function that takes a list of elements and returns one tagged X-expressions. This function will receive the elements of the paragraph, to make contextual wrapping possible.
> (decode-paragraphs '("First para" "\n\n" "Second para") 'ns:p) '((ns:p "First para") (ns:p "Second para"))
> (decode-paragraphs '("First para" "\n\n" "Second para") (λ (elems) `(ns:p ,@elems "!?!"))) '((ns:p "First para" "!?!") (ns:p "Second para" "!?!"))
The linebreak-proc argument allows you to use a different linebreaking procedure other than the usual decode-linebreaks.
> (decode-paragraphs '("First para" "\n\n" "Second para" "\n" "Second line") #:linebreak-proc (λ (x) (decode-linebreaks x '(newline)))) '((p "First para") (p "Second para" (newline) "Second line"))
The #:force? option will wrap a paragraph tag around elements, even if no explicit or implicit paragraph breaks are found. The #:force? option is useful for when you want to guarantee that you always get a list of blocks.
> (decode-paragraphs '("This" (span "will not be") "a paragraph")) '("This" (span "will not be") "a paragraph")
> (decode-paragraphs '("But this" (span "will be") "a paragraph") #:force? #t) '((p "But this" (span "will be") "a paragraph"))