2:0
webscraperhelper: Generating SXPath Queries from SXML Examples
Link to this document with
@other-doc['(lib "webscraperhelper/webscraperhelper.scrbl")]
Link to this document with
@other-doc['(lib "webscraperhelper/webscraperhelper.scrbl")]
1 Introduction
Link to this section with
@secref["Introduction"
#:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Link to this section with
@secref["Introduction"
#:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
The
webscraperhelper package is intended as a programmer’s aid for crafting SXPath
queries to extract information (e.g., news items, prices) from HTML Web pages
that have been parsed by the
html-parsing package.
webscraperhelper accepts an example
SXML document and an example “goal” subtree of the document, and yields
up to three different SXPath queries. A generated query can often be
incorporated into a Web-scraping program as-is, for extracting information from
documents with very similar formatting. Generated queries can also be used as
starting points for hand-crafted queries.
For example, given the SXML document doc:
(define doc |
'(*TOP* (html (head (title "My Title")) |
(body (@ (bgcolor "white")) |
(p "Summary: This is a document.") |
(div (@ (id "ResultsSection")) |
(h2 "Results") |
(p "These are the results.") |
(table (@ (id "ResultTable")) |
(tr (td (b "Input:")) |
(td "2 + 2")) |
(tr (td (b "Output:")) |
(td "Four"))) |
(p "Lookin' good!")))))) |
evaluating the expression:
> (webscraperhelper '(td "Four") doc)
will display generated queries like:
Absolute SXPath: (html body div table (tr 2) (td 2)) |
Absolute SXPath with IDs: (html body |
(div (@ (equal? (id "ResultsSection")))) |
(table (@ (equal? (id "ResultTable")))) |
(tr 2) (td 2)) |
Relative SXPath with IDs: (// (table (@ (equal? (id "ResultTable")))) |
(tr 2) (td 2)) |
|
The queries can then be compiled with the sxpath procedure of the SXPath library:
> (define query |
(sxpath '(// (table (@ (equal? (id "ResultTable")))) |
(tr 2) (td 2)))) |
> (query doc) |
((td "Four")) |
webscraperhelper comes with an advertising jingle (with apologies to greasy ground
bovine additive Americana):
Webscraperhelper
helps a programmer
scrape the
Web a great deal!
This package was originally written for R5RS Scheme with SRFI-11
and SRFI-16.
2 Interactive Interface
Link to this section with
@secref["Interactive_Interface"
#:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Link to this section with
@secref["Interactive_Interface"
#:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
In this version, the ‘interactive” interface is a procedure
intended to be invoked manually from a REPL.
(webscraperhelper goal sxml [ids]) → void/c
|
goal : any/c |
sxml : sxml? |
ids : (listof symbol?) = '(id) |
Displays some XPath queries yielding SXML goal from document sxml.
goal is the desired SXML element node.
sxml is the document in SXML First Normal Form (1NF). Some nested
nodelists emitted by SXML transformation tools, such as attributes nested in
extra list levels, are not permitted.
The optional ids is a list of name symbols for element attributes that can be
treate as unique identifiers. If ids is not given, then the default is '(id).
3 Programmatic Interface
Link to this section with
@secref["Programmatic_Interface"
#:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Link to this section with
@secref["Programmatic_Interface"
#:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
The following procedures were exposed only for tinkering, and were
documented badly.
(find-wsh-path goal sxml) → (or/c #f wsh-path?)
|
goal : any/c |
sxml : sxml? |
Yields a wsh-path? to goal within sxml, or #f if no path could be found. The yielded path might share
structure with sxml.
(wsh-path->sxpath-abs path) → any
|
path : any/c |
(wsh-path->sxpath-absids+relids path) → any |
path : any/c |
(wsh-path->sxpath-abs+absids+relids path) → any |
path : any/c |
Translate a wsh-path? to various SXPath queries. The yielded SXPath query lists
should be considered immutable, as they might share structure with the original
SXML from which path was generated, or multiple queries might share structure with
each other.
4 History
Link to this section with
@secref["History" #:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Link to this section with
@secref["History" #:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Version 2:0 — 2016-02-28
Version 1:2 — 2009-03-14
Version 1:1 — 2009-02-24
Version 1:0 — 2005-07-04
Version 0.2 — 2004-08-16
Version 0.1 — 2004-07-31
5 Legal
Link to this section with
@secref["Legal" #:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Link to this section with
@secref["Legal" #:doc '(lib "webscraperhelper/webscraperhelper.scrbl")]
Copyright 2004, 2005, 2009, 2016 Neil Van Dyke. This program is Free
Software; you can redistribute it and/or modify it under the terms of the GNU
Lesser General Public License as published by the Free Software Foundation;
either version 3 of the License, or (at your option) any later version. This
program is distributed in the hope that it will be useful, but without any
warranty; without even the implied warranty of merchantability or fitness for a
particular purpose. See http://www.gnu.org/licenses/ for details. For other
licenses and consulting, please contact the author.