html-parsing: Permissive Parsing of HTML to SXML

6:0

top ← prev up next →

html-parsing: Permissive Parsing of HTML to SXML

Neil Van Dyke

License: LGPLv3 Web: http://www.neilvandyke.org/racket/html-parsing/

1 Introduction

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML.

html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

2 Interface

procedure
(html->xexp input) → xexp
input : (or/c input-port? string?)

Parse HTML permissively from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:

> (html->xexp
 (string-append
 "<html><head><title></title><title>whatever</title></head>"
 "<body> <a href=\"url\">link</a>"
 "<ul compact style=\"aa\"> BLah"
 " italic bold <tt> ened still < bold "
 "</body> But not done yet..."))

(*TOP* (html (head (title) (title "whatever"))

(body "\n"

(a (@ (href "url")) "link")

(p (@ (align "center"))

(ul (@ (compact) (style "aa")) "\n"))

(p "BLah"

(*COMMENT* " comment <comment> ")

" "

(i " italic " (b " bold " (tt " ened")))

"\n"

"still < bold "))

(p " But not done yet...")))

Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element. This is one old Web browser quirk-handling of invalid HTML that this parser does not try to emulate.

3 History

Version 6:0 — 2018-05-22
- Fix to permit p elements as children of blockquote elements. Incrementing major version number because this is a breaking change of 17 years, but seems an appropriate change for modern HTML, and fixes a particular real-world problem. (Thanks to Sorawee Porncharoenwase for reporting.)
Version 5:0 — 2018-05-15
- In a breaking change of handing invalid HTML, most named character entity references that are invalid because (possibly among multiple reasons) they are not terminated by semicolon, now are treated as literal strings (including the ampersand), rather than as named character entites. For example, parser input string "A&B Co." will now parse as (p "A&B Co.") rather than as (p "A" (& B) " Co."). (Thanks for Greg Hendershott for suggesting this, and discussing.)
- For support of historical quirks handling, five early HTML named character entities (specifically, amp, apos, lt, gt, quot) do not need to be terminated with a semicolon, and will even be recognized if followed immediately by an alphabetic. For example, "a&ltz" will now parse as (p "a<z"), rather than as (p (& ltz)).
- Invalid character entity references that are terminated by EOF rather than semicolon may now be parsed as literal strings, rather than as entity references.
Version 4:3 — 2016-12-15
- Error message “%html-parsing:parse-html: invalid input type:” now abbreviates the invalid value, to avoid possibly huge messages. (Thanks to John B. Clements.)
Version 4:2 — 2016-03-02
- Tweaked info.rkt, filenames.
Version 4:1 — 2016-02-25
- Updated deps.
- Documentation tweaks.
Version 4:0 — 2016-02-21
- Moving from PLaneT to new package system.
- Moved unit tests into main source file.
Version 3:0 — 2015-04-24
- Numeric character entities now parse to Racket strings instead of Racket characters, to bring SXML/xexp back closer to SXML. (Thanks to John Clements for reporting.)
Version 2:0 — 2012-06-13
- Converted to McFly.
Version 0.3 — Version 1:2 — 2011-08-27
- Converted test suite from Testeez to Overeasy.
Version 0.2 — Version 1:1 — 2011-08-27
- Fixed embarrassing bug due to code tidying. (Thanks to Danny Yoo for reporting.)
Version 0.1 — Version 1:0 — 2011-08-21
- Part of forked development from HtmlPrag, parser originally written 2001-04.

4 Legal

Copyright 2003–2012, 2015, 2016, 2018 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.

top ← prev up next →