html-parsing:   Permissive Parsing of HTML to SXML
1 Introduction
2 Interface
html->xexp
3 History
4 Legal
6:0

html-parsing: Permissive Parsing of HTML to SXML

Neil Van Dyke

1 Introduction

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML.
html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

2 Interface

procedure

(html->xexp input)  xexp

  input : (or/c input-port? string?)
Parse HTML permissively from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:
> (html->xexp
   (string-append
    "<html><head><title></title><title>whatever</title></head>"
    "<body> <a href=\"url\">link</a><p align=center>"
    "<ul compact style=\"aa\"> <p>BLah<!-- comment <comment> -->"
    " <i> italic <b> bold <tt> ened</i> still &lt; bold </b>"
    "</body><P> But not done yet..."))
  (*TOP* (html (head (title) (title "whatever"))
               (body "\n"
                     (a (@ (href "url")) "link")
                     (p (@ (align "center"))
                        (ul (@ (compact) (style "aa")) "\n"))
                     (p "BLah"
                        (*COMMENT* " comment <comment> ")
                        " "
                        (i " italic " (b " bold " (tt " ened")))
                        "\n"
                        "still < bold "))
               (p " But not done yet...")))
Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element. This is one old Web browser quirk-handling of invalid HTML that this parser does not try to emulate.

3 History

4 Legal

Copyright 2003–2012, 2015, 2016, 2018 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.