html-parsing: Permissive Parsing of HTML to SXML
(require html-parsing) | package: html-parsing |
1 Introduction
2 Interface
> (html->xexp (string-append "<html><head><title></title><title>whatever</title></head>" "<body> <a href=\"url\">link</a><p align=center>" "<ul compact style=\"aa\"> <p>BLah<!-- comment <comment> -->" " <i> italic <b> bold <tt> ened</i> still < bold </b>" "</body><P> But not done yet..."))
(*TOP* (html (head (title) (title "whatever")) |
(body "\n" |
(a (@ (href "url")) "link") |
(p (@ (align "center")) |
(ul (@ (compact) (style "aa")) "\n")) |
(p "BLah" |
(*COMMENT* " comment <comment> ") |
" " |
(i " italic " (b " bold " (tt " ened"))) |
"\n" |
"still < bold ")) |
(p " But not done yet..."))) |
3 History
- Version 6:0 —
2018-05-22 Fix to permit p elements as children of blockquote elements. Incrementing major version number because this is a breaking change of 17 years, but seems an appropriate change for modern HTML, and fixes a particular real-world problem. (Thanks to Sorawee Porncharoenwase for reporting.)
- Version 5:0 —
2018-05-15 In a breaking change of handing invalid HTML, most named character entity references that are invalid because (possibly among multiple reasons) they are not terminated by semicolon, now are treated as literal strings (including the ampersand), rather than as named character entites. For example, parser input string "<p>A&B Co.<p>" will now parse as (p "A&B Co.") rather than as (p "A" (& B) " Co."). (Thanks for Greg Hendershott for suggesting this, and discussing.)
For support of historical quirks handling, five early HTML named character entities (specifically, amp, apos, lt, gt, quot) do not need to be terminated with a semicolon, and will even be recognized if followed immediately by an alphabetic. For example, "<p>a<z</p>" will now parse as (p "a<z"), rather than as (p (& ltz)).
Invalid character entity references that are terminated by EOF rather than semicolon may now be parsed as literal strings, rather than as entity references.
- Version 4:3 —
2016-12-15 Error message “%html-parsing:parse-html: invalid input type:” now abbreviates the invalid value, to avoid possibly huge messages. (Thanks to John B. Clements.)
- Version 4:2 —
2016-03-02 Tweaked info.rkt, filenames.
- Version 4:1 —
2016-02-25 Updated deps.
Documentation tweaks.
- Version 4:0 —
2016-02-21 Moving from PLaneT to new package system.
Moved unit tests into main source file.
- Version 3:0 —
2015-04-24 Numeric character entities now parse to Racket strings instead of Racket characters, to bring SXML/xexp back closer to SXML. (Thanks to John Clements for reporting.)
- Version 2:0 —
2012-06-13 Converted to McFly.
- Version 0.3 —
Version 1:2 — 2011-08-27 Converted test suite from Testeez to Overeasy.
- Version 0.2 —
Version 1:1 — 2011-08-27 Fixed embarrassing bug due to code tidying. (Thanks to Danny Yoo for reporting.)
- Version 0.1 —
Version 1:0 — 2011-08-21 Part of forked development from HtmlPrag, parser originally written 2001-04.
4 Legal
Copyright 2003–2012, 2015, 2016, 2018 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.