On this page:
sxpath
txpath
nodeset?
as-nodeset
node-eq?
node-equal?
node-pos
sxml:  filter
sxml:  complement
select-kids
select-first-kid
node-self
node-join
node-reduce
node-or
node-closure
sxml:  attribute
sxml:  child
sxml:  child-nodes
sxml:  child-elements
sxml:  descendant
sxml:  descendant-or-self
sxml:  parent
node-parent
sxml:  ancestor
sxml:  ancestor-or-self
sxml:  following
sxml:  following-sibling
sxml:  preceding
sxml:  preceding-sibling
7.7

4 Search (SXPath)

The W3C "XPath" standard describes a standardized way to perform searches in XML documents. For instance, the XPath string "/A/B/C" (thank you, Wikipedia) describes a search for a C element whose parent is a B element whose parent is an A element that is the root of the document.

The sxpath function performs a similar search over SXML data, using either the standard XPath strings or a list of Racket values.

> (require sxml/sxpath)
> ((sxpath "/A/B/C")
   '(*TOP* (A (B (C)))))

'((C))

> ((sxpath '(A B C))
   '(*TOP* (A (B (C)))))

'((C))

> ((sxpath "//p[contains(@class, 'blue')]/text()")
   '(*TOP*
     (body
      (p (@ (class "blue")) "P1")
      (p (@ (class "blue green")) "P2")
      (p (@ (class "red")) "P3"))))

'("P1" "P2")

(This documentation desperately needs more examples.)

Let’s consider the following XML document:

"<AAA>

  <BBB>

     <CCC/>

     <www> www content <xxx/><www>

     <zzz/>

  </BBB>

  <XXX>

     <DDD> content in ccc

     </DDD>

  </XXX>

</AAA>"

If we use Neil Van Dyke’s html parser, we might parse this into the following sxml document:

> (define example-doc
    '(*TOP*
      (aaa "\n" "  "
        (bbb "\n" "     "
          (ccc) "\n" "     "
          (www " www content " (xxx) (www "\n" "     " (zzz) "\n" "  ")))
        "\n" "  "
        (xxx "\n" "     " (ddd " content in ccc \n" "     ") "\n" "  ")
        "\n")))

procedure

(sxpath path [ns-bindings])  (-> (or/c node nodeset?) nodeset?)

  path : (or/c list? string?)
  ns-bindings : (listof (cons/c symbol? string?)) = '()
Given a representation of a path, produces a procedure that accepts an SXML document and returns a list of matches. Path representations are interpreted according to the following rewrite rules.

(sxpath '())

(node-join)

(sxpath (cons path-component0 path-components))

(node-join (sxpath1 path-component0)
           (sxpath path-components))

 

(sxpath1 '//)

(sxml:descendant-or-self sxml:node?)

(sxpath1 `(equal? ,x))

(select-kids (node-equal? x))

(sxpath1 `(eq? ,x))

(select-kids (node-eq? x))

(sxpath1 `(*or* ,p ...))

(select-kids (ntype-names?? `(,p ...)))

(sxpath1 `(*not* ,p ...))

(select-kids
 (sxml:complement
  (ntype-names?? `(,p ...))))

(sxpath1 `(ns-id:* ,x))

(select-kids (ntype-namespace-id?? x))

(sxpath1 symbol)

(select-kids (ntype?? symbol))

(sxpath1 string)

(txpath string)

(sxpath1 procedure)

procedure

(sxpath1 `(,symbol ,reducer ...))

(sxpath1 `((,symbol) ,reducer ...))

(sxpath1 `(,path ,reducer ...))

(node-reduce (sxpath path)
             (sxpathr reducer) ...)

 

(sxpathr number)

(node-pos number)

(sxpathr path)

(sxml:filter (sxpath path))

To extract the xxx’s inside the aaa from the example document:

> ((sxpath '(aaa xxx)) example-doc)

'((xxx "\n" "     " (ddd " content in ccc \n" "     ") "\n" "  "))

To extract all cells from an HTML table:

> (define table
    `(*TOP*
      (table
       (tr (td "a") (td "b"))
       (tr (td "c") (td "d")))))
> ((sxpath '(table tr td)) table)

'((td "a") (td "b") (td "c") (td "d"))

To extract all cells anywhere in a document:

> (define table
    `(*TOP*
      (div
       (p (table
           (tr (td "a") (td "b"))
           (tr (td "c") (td "d"))))
       (table
        (tr (td "e"))))))
> ((sxpath '(// td)) table)

'((td "a") (td "b") (td "c") (td "d") (td "e"))

One result may be nested in another one:

> (define doc
    `(*TOP*
      (div
       (p (div "3")
          (div (div "4"))))))
> ((sxpath '(// div)) doc)

'((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4"))

There’s also a string-based syntax, txpath. As shown in the grammar above, sxpath assumes that any strings in the path are expressed using the txpath syntax.

So, for instance, the prior example could be rewritten using a string:

> (define doc
    `(*TOP*
      (div
       (p (div "3")
          (div (div "4"))))))
> ((sxpath "//div") doc)

'((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4"))

More generally, lists in the s-expression syntax correspond to string concatenation in the txpath syntax.

So, to find all italics that appear at top level within a paragraph:

> (define doc
    `(*TOP*
      (div
       (p (i "3")
          (froogy (i "4"))))))
> ((sxpath "//p/i") doc)

'((i "3"))

Handling of namespaces in sxpath is a bit surprising. In particular, it appears to me that sxpath’s model is that namespaces must appear fully expanded in the matched source. For instance:

> ((sxpath "//ns:p" `((ns . "http://example.com")))
   '(*TOP* (html (http://example.com:body
                  (http://example.com:p "first para")
                  (http://example.com:p
                   "second para containing"
                   (http://example.com:p "third para") "inside it")))))

'((http://example.com:p "first para")

  (http://example.com:p

   "second para containing"

   (http://example.com:p "third para")

   "inside it")

  (http://example.com:p "third para"))

But the corresponding example where the source document contains a namespace shortcut does not match in the same way. That is:

> ((sxpath "//ns:p" `((ns . "http://example.com")))
   '(*TOP* (@ (*NAMESPACES* (ns "http://example.com")))
           (html (ns:body (ns:p "first para")
                          (ns:p "second para containing"
                                (ns:p "third para") "inside it")))))

'()

It produces the empty list. Instead, you must pretend that the shortcut is actually the namespace. Thus:

> ((sxpath "//ns:p" `((ns . "ns")))
   '(*TOP* (@ (*NAMESPACES* (ns "http://example.com")))
           (html (ns:body (ns:p "first para")
                          (ns:p "second para containing"
                                (ns:p "third para") "inside it")))))

'((ns:p "first para")

  (ns:p "second para containing" (ns:p "third para") "inside it")

  (ns:p "third para"))

Ah well.

procedure

(txpath xpath-location-path [ns-bindings])

  (-> (or/c node nodeset?) nodeset?)
  xpath-location-path : string?
  ns-bindings : (listof (cons/c symbol? string?)) = '()
Like sxpath, but only accepts an XPath query in string form, using the standard XPath syntax.

Deprecated; use sxpath instead.

A sxml-converter is a function
(-> (or/c node nodeset?)
    nodeset?)
that is, it takes nodes or nodesets to nodesets. A sxml-converter-as-predicate is an sxml-converter used as a predicate; a return value of '() indicates false.

procedure

(nodeset? v)  boolean?

  v : any/c
Returns #t if v is a list of nodes (that is, a list that does not start with a symbol).

Examples:
> (nodeset? '(p "blah"))

#f

> (nodeset? '((p "blah") (br) "more"))

#t

procedure

(as-nodeset v)  nodeset?

  v : any/c
If v is a nodeset, returns v, otherwise returns (list v).

Examples:
> (as-nodeset '(p "blah"))

'((p "blah"))

> (as-nodeset '((p "blah") (br) "more"))

'((p "blah") (br) "more")

procedure

(node-eq? v)  (-> any/c boolean?)

  v : any/c
Curried eq?.

procedure

(node-equal? v)  (-> any/c boolean?)

  v : any/c
Curried equal?.

procedure

(node-pos n)  sxml-converter

  n : (or/c exact-positive-integer? exact-negative-integer?)
Returns a converter that selects the nth element (counting from 1, not 0) of a nodelist and returns it as a singleton nodelist. If n is negative, it selects from the right: -1 selects the last node, and so forth.

Examples:
> ((node-pos 2) '((a) (b) (c) (d) (e)))

'((b))

> ((node-pos -1) '((a) (b) (c)))

'((c))

procedure

(sxml:filter pred)  sxml-converter

  pred : sxml-converter-as-predicate

procedure

(sxml:complement pred)  sxml-converter-as-predicate

  pred : sxml-converter-as-predicate

procedure

(select-kids pred)  sxml-converter

  pred : sxml-converter-as-predicate
Returns a converter that selects an (ordered) subset of the children of the given node (or the children of the members of the given nodelist) satisfying pred.

Examples:
> ((select-kids (ntype?? 'p)) '(p "blah"))

'()

> ((select-kids (ntype?? '*text*)) '(p "blah"))

'("blah")

> ((select-kids (ntype?? 'p)) (list '(p "blah") '(br) '(p "blahblah")))

'()

procedure

(select-first-kid pred)

  (-> (or/c node nodeset?) (or/c node #f))
  pred : sxml-converter-as-predicate
Like select-kids but returns only the first one, or #f if none.

procedure

(node-self pred)  sxml-converter

  pred : sxml-converter-as-predicate
Returns a function that when applied to node, returns (list node) if (pred node) is neither #f nor '(), otherwise returns '().

Examples:
> ((node-self (ntype?? 'p)) '(p "blah"))

'((p "blah"))

> ((node-self (ntype?? 'p)) '(br))

'()

procedure

(node-join selector)  sxml-converter

  selector : sxml-converter

procedure

(node-reduce converter)  sxml-converter

  converter : sxml-converter

procedure

(node-or converter)  sxml-converter

  converter : sxml-converter

procedure

(node-closure converter)  sxml-converter

  converter : sxml-converter

XPath axes and accessors.

The following procedures depend explicitly on the root node.

procedure

((sxml:parent pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node

procedure

(node-parent root)  sxml-converter

  root : node

procedure

((sxml:ancestor pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node

procedure

((sxml:ancestor-or-self pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node

procedure

((sxml:following pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node

procedure

((sxml:following-sibling pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node
Gosh, I wish these functions were documented.

procedure

((sxml:preceding pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node
given a predicate and a root node, returns a procedure that accepts a nodeset and returns all nodes that appear before the given nodes in document order, filtered using the predicate.

Here’s an example:

> (((sxml:preceding (ntype?? 'www)) example-doc) ((sxpath `(aaa xxx)) example-doc))

'((www "\n" "     " (zzz) "\n" "  ")

  (www " www content " (xxx) (www "\n" "     " (zzz) "\n" "  ")))

procedure

((sxml:preceding-sibling pred) root)  sxml-converter

  pred : sxml-converter-as-predicate
  root : node
given a predicate and a root node, returns a procedure that accepts a nodeset and returns all dones that are preceding siblings (in document order) of the given nodes.

> (define doc '(*TOP* (div (p "foo") (p "bar")
                           (img "baz") (p "quux"))))
> (((sxml:preceding-sibling (ntype?? 'p)) doc) ((sxpath '(// img)) doc))

'((p "bar") (p "foo"))