SXML in Racket: Tools for XML and HTML
License: LGPLv3 Web: http://www.neilvandyke.org/racket/sxml-intro/
1 Introduction
SXML is a computer representation of XML, originally defined by Oleg Kiselyov for the Scheme programming language. SXML is one of the representations of XML and HTML that is used by Racket tools. This document first clears up some confusion about the various representations, and then lists Racket tools available for SXML.
People writing SXML-related packages for Racket might wish to link to this document from the documentation of their package. For example:
This package fooifies @seclink["top" #:doc '(lib "sxml-intro/sxml-intro.scrbl") #:indirect? #true]{SXML} for great justice. You know what you are doing.
Authors of SXML-related Racket packages should also have Neil add their SXML-related package to this document.
2 Confusion
First, to get some confusion out of the way... For historical reasons, there are a few different representations of XML and HTML that different Racket-based tools support:
“SXML” —
The Scheme XML representation described and formalized by Oleg Kiselyov. The documents can be found at http://pobox.com/~oleg/ftp/Scheme/SXML.html. “SXML/xexp” and “xexp” and “SHTML” —
SXML dialect(s) that are often practically interchangeable with strictly-standard-compliant SXML. See Appendix: SXML/xexp. “xexpr” —
An s-expression representation of XML used by some core Racket tools, including being the default representation used by the Racket Web Server. The “xexpr” format appears similar to SXML, but there are technical reasons that unifying the two formats has seemed impractical. Among the differences, the arbitrary nested lists permitted by SXML make SXML more amenable to efficient splicing. See the XML: Parsing and Writing documentation. “xml” —
An opaque representation of XML, supported by core Racket. See the XML: Parsing and Writing documentation. “html” —
An opaque representation of HTML, supported by core Racket. See the HTML : Parsing Library documentation.
Only the first two bulleted items above are considered “SXML” and discussed further in this document.
3 Tools
The original SXML tools by Kiselyov, et al., are all provided by the Racket sxml package:
SSAX —
Parsing XML to SXML. SRL —
Serialization of SXML to XML. SXPath —
XPath-like querying of SXML. SXSLT —
XSLT-like transformation language for SXML. SXML procedures —
Low-level procedures for working with SXML.
General-purpose SXML-related tools by others include:
sxml-match —
Pattern-matching of SXML, by Jim Bender. (Currently only in PLaneT.) html-parsing —
Permissively parsing HTML to SXML. html-writing —
Writing HTML from SXML. html-template —
Writing HTML from SXML, using a compile-time template language embedded in Racket code, instead of run-time s-expressions. rws-html-template —
Using compile-time HTML template with Racket Web Server. webscraperhelper —
Aid for developing SXPath queries from SXML examples.
Packages that use SXML but are not general-purpose SXML tools:
kitco —
Getting financial price data, including in SXML format.
If you’ve written an SXML-related Racket tool not listed here, please contact Neil.
4 Appendix: SXML/xexp
“SXML/xexp” and “xexp” were intended to be temporary names, to acknowledge that some tools were not strictly compliant with SXML. These names would be dropped once the SXML standard adopted the extensions and/or the tools were modified to not use the extensions. For example, the html-parsing package currently provides a html->xexp procedure, and eventually a later version of that package might provide a strictly-compliant html->sxml procedure. (In retrospect, the different names might have caused more confusion than simply calling things “SXML” even when not strictly compliant.)
The known differences of some “SXML/xexp” tools from strict SXML may include:
Input must be ordered as in SXML first normal form (1NF). For example, any attributes list must precede child elements.
The SXML keyword symbols may be lowercase, such as '*TOP* may be '*top*. This difference is actually due to early ambiguity in the standard, since some Scheme implementations at the time had case-insensitive symbols.
Special '& syntax for character entity references, especially in handwritten HTML encoded in SXML. The syntax is '(& val), where val is the symbolic name of the character as a symbol, or an integer with the numeric value of the character. For example, '(& rArr) or '(& 151). An early version of html-parsing or HtmlPrag also permitted val to be a string.
Racket character literals can be used like string literals, for XML CDATA.
For representing HTML, SXML that is non-CDATA-content within the HTML script element may be treated as “CDATA faux HTML”, or PCDATA, to support languages like Ractive.js. (This was added to the html-writing package, around version 4.0.)
The above are itemized to document the known differences, not to encourage their use.
5 History
- Version 1:3 —
2016-04-03 Added mention of CDATA faux HTML to SXML/xexp appendix.
- Version 1:2 —
2016-02-28 Updated webscraperhelper, since moved from PLaneT to new package system.
Added mention of kitco package.
- Version 1:1 —
2016-02-25 Replaced web-server-xexp PLaneT package with rws-html-template.
Added mention of SHTML.
- Version 1:0 —
2016-02-25 Initial release.
6 Legal
This document is Copyright 2016 Neil Van Dyke. Licensed under the GNU Lesser General Public License, version 3 (LGPLv3).