4 Search (SXPath)
The W3C "XPath" standard describes a standardized way to perform searches in XML documents. For instance, the XPath string "/A/B/C" (thank you, Wikipedia) describes a search for a C element whose parent is a B element whose parent is an A element that is the root of the document.
The sxpath function performs a similar search over SXML data, using either the standard XPath strings or a list of Racket values.
> (require sxml/sxpath)
> ((sxpath "/A/B/C") '(*TOP* (A (B (C))))) '((C))
> ((sxpath '(A B C)) '(*TOP* (A (B (C))))) '((C))
> ((sxpath "//p[contains(@class, 'blue')]/text()") '(*TOP* (body (p (@ (class "blue")) "P1") (p (@ (class "blue green")) "P2") (p (@ (class "red")) "P3")))) '("P1" "P2")
(This documentation desperately needs more examples.)
"<AAA> |
<BBB> |
<CCC/> |
<www> www content <xxx/><www> |
<zzz/> |
</BBB> |
<XXX> |
<DDD> content in ccc |
</DDD> |
</XXX> |
</AAA>" |
If we use Neil Van Dyke’s html parser, we might parse this into the following sxml document:
> (define example-doc '(*TOP* (aaa "\n" " " (bbb "\n" " " (ccc) "\n" " " (www " www content " (xxx) (www "\n" " " (zzz) "\n" " "))) "\n" " " (xxx "\n" " " (ddd " content in ccc \n" " ") "\n" " ") "\n")))
procedure
(sxpath path [ns-bindings]) → (-> (or/c node nodeset?) nodeset?)
path : (or/c list? string?) ns-bindings : (listof (cons/c symbol? string?)) = '()
| ⇒ | ||||
⇒ |
| ||||
| |||||
| ⇒ | ||||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ | ||||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| |||||
⇒ |
| ||||
| |||||
⇒ |
| ||||
| |||||
| ⇒ |
| |||
| ⇒ |
|
To extract the xxx’s inside the aaa from the example document:
> ((sxpath '(aaa xxx)) example-doc) '((xxx "\n" " " (ddd " content in ccc \n" " ") "\n" " "))
To extract all cells from an HTML table:
> (define table `(*TOP* (table (tr (td "a") (td "b")) (tr (td "c") (td "d"))))) > ((sxpath '(table tr td)) table) '((td "a") (td "b") (td "c") (td "d"))
To extract all cells anywhere in a document:
> (define table `(*TOP* (div (p (table (tr (td "a") (td "b")) (tr (td "c") (td "d")))) (table (tr (td "e")))))) > ((sxpath '(// td)) table) '((td "a") (td "b") (td "c") (td "d") (td "e"))
One result may be nested in another one:
> (define doc `(*TOP* (div (p (div "3") (div (div "4")))))) > ((sxpath '(// div)) doc) '((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4"))
There’s also a string-based syntax, txpath. As shown in the grammar above, sxpath assumes that any strings in the path are expressed using the txpath syntax.
So, for instance, the prior example could be rewritten using a string:
> (define doc `(*TOP* (div (p (div "3") (div (div "4")))))) > ((sxpath "//div") doc) '((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4"))
More generally, lists in the s-expression syntax correspond to string concatenation in the txpath syntax.
So, to find all italics that appear at top level within a paragraph:
> (define doc `(*TOP* (div (p (i "3") (froogy (i "4")))))) > ((sxpath "//p/i") doc) '((i "3"))
Handling of namespaces in sxpath is a bit surprising. In particular, it appears to me that sxpath’s model is that namespaces must appear fully expanded in the matched source. For instance:
> ((sxpath "//ns:p" `((ns . "http://example.com"))) '(*TOP* (html (http://example.com:body (http://example.com:p "first para") (http://example.com:p "second para containing" (http://example.com:p "third para") "inside it")))))
'((http://example.com:p "first para")
(http://example.com:p
"second para containing"
(http://example.com:p "third para")
"inside it")
(http://example.com:p "third para"))
But the corresponding example where the source document contains a namespace shortcut does not match in the same way. That is:
> ((sxpath "//ns:p" `((ns . "http://example.com"))) '(*TOP* (@ (*NAMESPACES* (ns "http://example.com"))) (html (ns:body (ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it"))))) '()
It produces the empty list. Instead, you must pretend that the shortcut is actually the namespace. Thus:
> ((sxpath "//ns:p" `((ns . "ns"))) '(*TOP* (@ (*NAMESPACES* (ns "http://example.com"))) (html (ns:body (ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it")))))
'((ns:p "first para")
(ns:p "second para containing" (ns:p "third para") "inside it")
(ns:p "third para"))
Ah well.
procedure
(txpath xpath-location-path [ns-bindings])
→ (-> (or/c node nodeset?) nodeset?) xpath-location-path : string? ns-bindings : (listof (cons/c symbol? string?)) = '()
Deprecated; use sxpath instead.
procedure
(as-nodeset v) → nodeset?
v : any/c
> (as-nodeset '(p "blah")) '((p "blah"))
> (as-nodeset '((p "blah") (br) "more")) '((p "blah") (br) "more")
procedure
(node-equal? v) → (-> any/c boolean?)
v : any/c
procedure
(node-pos n) → sxml-converter
n : (or/c exact-positive-integer? exact-negative-integer?)
procedure
(sxml:filter pred) → sxml-converter
pred : sxml-converter-as-predicate
procedure
pred : sxml-converter-as-predicate
procedure
(select-kids pred) → sxml-converter
pred : sxml-converter-as-predicate
> ((select-kids (ntype?? 'p)) '(p "blah")) '()
> ((select-kids (ntype?? '*text*)) '(p "blah")) '("blah")
> ((select-kids (ntype?? 'p)) (list '(p "blah") '(br) '(p "blahblah"))) '()
procedure
(select-first-kid pred)
→ (-> (or/c node nodeset?) (or/c node #f)) pred : sxml-converter-as-predicate
procedure
(node-self pred) → sxml-converter
pred : sxml-converter-as-predicate
procedure
(node-join selector) → sxml-converter
selector : sxml-converter
procedure
(node-reduce converter) → sxml-converter
converter : sxml-converter
procedure
(node-or converter) → sxml-converter
converter : sxml-converter
procedure
(node-closure converter) → sxml-converter
converter : sxml-converter
procedure
(sxml:attribute pred) → sxml-converter
pred : sxml-converter-as-predicate
procedure
(sxml:child pred) → sxml-converter
pred : sxml-converter-as-predicate
value
value
procedure
(sxml:descendant pred) → sxml-converter
pred : sxml-converter-as-predicate
procedure
(sxml:descendant-or-self pred) → sxml-converter
pred : sxml-converter-as-predicate
The following procedures depend explicitly on the root node.
procedure
((sxml:parent pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
procedure
(node-parent root) → sxml-converter
root : node
procedure
((sxml:ancestor pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
procedure
((sxml:ancestor-or-self pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
procedure
((sxml:following pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
procedure
((sxml:following-sibling pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
procedure
((sxml:preceding pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
Here’s an example:
> (((sxml:preceding (ntype?? 'www)) example-doc) ((sxpath `(aaa xxx)) example-doc))
'((www "\n" " " (zzz) "\n" " ")
(www " www content " (xxx) (www "\n" " " (zzz) "\n" " ")))
procedure
((sxml:preceding-sibling pred) root) → sxml-converter
pred : sxml-converter-as-predicate root : node
> (define doc '(*TOP* (div (p "foo") (p "bar") (img "baz") (p "quux")))) > (((sxml:preceding-sibling (ntype?? 'p)) doc) ((sxpath '(// img)) doc)) '((p "bar") (p "foo"))