lxml.html
Package lxml :: Package html
[hide private]
[frames] | no frames]

Package html

source code

The lxml.html tool set for HTML handling.
Submodules [hide private]

Classes [hide private]
  CheckboxGroup
Represents a group of checkboxes (<input type=checkbox>) that have the same name.
  CheckboxValues
Represents the values of the checked checkboxes in a group of checkboxes with the same name.
  Classes
Provides access to an element's class attribute as a set-like collection. Usage:
  FieldsDict
  FormElement
Represents a <form> element.
  HTMLParser
An HTML parser that is configured to return lxml.html Element objects.
  HtmlComment
  HtmlElement
  HtmlElementClassLookup
A lookup scheme for HTML Element classes.
  HtmlEntity
  HtmlMixin
  HtmlProcessingInstruction
  InputElement
Represents an <input> element.
  InputGetter
An accessor that represents all the input fields in a form.
  InputMixin
Mix-in for all input elements (input, select, and textarea)
  LabelElement
Represents a <label> element.
  MultipleSelectOptions
Represents all the selected options in a <select multiple> element.
  RadioGroup
This object represents several <input type=radio> elements that have the same name.
  SelectElement
<select> element. You can get the name with .name.
  TextareaElement
<textarea> element. You can get the name with .name and get/set the value with .value
  XHTMLParser
An XML parser that is configured to return lxml.html Element objects.
  _MethodFunc
An object that represents a method on an element as a function; the function takes either an element or an HTML string. It returns whatever the function normally returns, or if the function works in-place (and so returns None) it returns a serialized form of the resulting document.
Functions [hide private]
 
Element(*args, **kw)
Create a new HTML Element.
source code
 
__bytes_replace_meta_content_type(...)
sub(repl, string[, count = 0]) --> newstring Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
source code
 
__fix_docstring(s) source code
 
__str_replace_meta_content_type(...)
sub(repl, string[, count = 0]) --> newstring Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
source code
 
_contains_block_level_tag(el) source code
 
_element_name(el) source code
 
_iter_css_imports(...)
finditer(string[, pos[, endpos]]) --> iterator. Return an iterator over all non-overlapping matches for the RE pattern in string. For each match, the iterator returns a match object.
source code
 
_iter_css_urls(...)
finditer(string[, pos[, endpos]]) --> iterator. Return an iterator over all non-overlapping matches for the RE pattern in string. For each match, the iterator returns a match object.
source code
 
_looks_like_full_html_bytes(...)
match(string[, pos[, endpos]]) --> match object or None. Matches zero or more characters at the beginning of the string
source code
 
_looks_like_full_html_unicode(...)
match(string[, pos[, endpos]]) --> match object or None. Matches zero or more characters at the beginning of the string
source code
 
_nons(tag) source code
 
_parse_meta_refresh_url(...)
search(string[, pos[, endpos]]) --> match object or None. Scan through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.
source code
 
_transform_result(typ, result)
Convert the result back into the input type.
source code
 
_unquote_match(s, pos) source code
 
document_fromstring(html, parser=None, ensure_head_body=False, **kw) source code
 
fragment_fromstring(html, create_parent=False, base_url=None, parser=None, **kw)
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
source code
 
fragments_fromstring(html, no_leading_text=False, base_url=None, parser=None, **kw)
Parses several HTML elements, returning a list of elements.
source code
 
fromstring(html, base_url=None, parser=None, **kw)
Parse the html, returning a single element/document.
source code
 
html_to_xhtml(html)
Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace.
source code
 
open_http_urllib(method, url, values) source code
 
open_in_browser(doc, encoding=None)
Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging.
source code
 
parse(filename_or_url, parser=None, base_url=None, **kw)
Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.
source code
 
submit_form(form, extra_values=None, open_http=None)
Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects.
source code
 
tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html', with_tail=True, doctype=None)
Return an HTML string representation of the document.
source code
 
xhtml_to_html(xhtml)
Convert all tags in an XHTML tree to HTML by removing their XHTML namespace.
source code
Variables [hide private]
  XHTML_NAMESPACE = 'http://www.w3.org/1999/xhtml'
  __package__ = 'lxml.html'
  _archive_re = re.compile(r'[^ ]+')
  _class_xpath = descendant-or-self::*[@class and contains(conca...
  _collect_string_content = string()
  _forms_xpath = descendant-or-self::form|descendant-or-self::x:...
  _id_xpath = descendant-or-self::*[@id=$id]
  _label_xpath = //label[@for=$id]|//x:label[@for=$id]
  _options_xpath = descendant-or-self::option|descendant-or-self...
  _rel_links_xpath = descendant-or-self::a[@rel]|descendant-or-s...
  find_class = <lxml.html._MethodFunc object>
  find_rel_links = <lxml.html._MethodFunc object>
  html_parser = <lxml.html.HTMLParser object>
  iterlinks = <lxml.html._MethodFunc object>
  make_links_absolute = <lxml.html._MethodFunc object>
  resolve_base_href = <lxml.html._MethodFunc object>
  rewrite_links = <lxml.html._MethodFunc object>
  xhtml_parser = <lxml.html.XHTMLParser object>
Function Details [hide private]

Element(*args, **kw)

source code 

Create a new HTML Element.

This can also be used for XHTML documents.

fragment_fromstring(html, create_parent=False, base_url=None, parser=None, **kw)

source code 

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.

If create_parent is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is also allowed, as are multiple elements as result of the parsing.

Passing a base_url will set the document's base_url attribute (and the tree's docinfo.URL).

fragments_fromstring(html, no_leading_text=False, base_url=None, parser=None, **kw)

source code 

Parses several HTML elements, returning a list of elements.

The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.

base_url will set the document's base_url attribute (and the tree's docinfo.URL).

fromstring(html, base_url=None, parser=None, **kw)

source code 

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)

parse(filename_or_url, parser=None, base_url=None, **kw)

source code 

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.

You can override the base URL with the base_url keyword. This is most useful when parsing from a file-like object.

submit_form(form, extra_values=None, open_http=None)

source code 

Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects.

You can use this like:

form = doc.forms[0]
form.inputs['foo'].value = 'bar' # etc
response = form.submit()
doc = parse(response)
doc.make_links_absolute(response.geturl())

To change the HTTP requester, pass a function as open_http keyword argument that opens the URL for you. The function must have the following signature:

open_http(method, URL, values)

The action is one of 'GET' or 'POST', the URL is the target URL as a string, and the values are a sequence of (name, value) tuples with the form data.

tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html', with_tail=True, doctype=None)

source code 

Return an HTML string representation of the document.

Note: if include_meta_content_type is true this will create a <meta http-equiv="Content-Type" ...> tag in the head; regardless of the value of include_meta_content_type any existing <meta http-equiv="Content-Type" ...> tag will be removed

The encoding argument controls the output encoding (defauts to ASCII, with &#...; character references for any characters outside of ASCII). Note that you can pass the name 'unicode' as encoding argument to serialise to a Unicode string.

The method argument defines the output method. It defaults to 'html', but can also be 'xml' for xhtml output, or 'text' to serialise to plain text without markup.

To leave out the tail text of the top-level element that is being serialised, pass with_tail=False.

The doctype option allows passing in a plain string that will be serialised before the XML tree. Note that passing in non well-formed content here will make the XML output non well-formed. Also, an existing doctype in the document tree will not be removed when serialising an ElementTree instance.

Example:

>>> from lxml import html
>>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')

>>> html.tostring(root)
'<p>Hello<br>world!</p>'
>>> html.tostring(root, method='html')
'<p>Hello<br>world!</p>'

>>> html.tostring(root, method='xml')
'<p>Hello<br/>world!</p>'

>>> html.tostring(root, method='text')
'Helloworld!'

>>> html.tostring(root, method='text', encoding='unicode')
u'Helloworld!'

>>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>')
>>> html.tostring(root[0], method='text', encoding='unicode')
u'Helloworld!TAIL'

>>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False)
u'Helloworld!'

>>> doc = html.document_fromstring('<p>Hello<br>world!</p>')
>>> html.tostring(doc, method='html', encoding='unicode')
u'<html><body><p>Hello<br>world!</p></body></html>'

>>> print(html.tostring(doc, method='html', encoding='unicode',
...          doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"'
...                  ' "http://www.w3.org/TR/html4/strict.dtd">'))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><body><p>Hello<br>world!</p></body></html>

Variables Details [hide private]

_class_xpath

Value:
descendant-or-self::*[@class and contains(concat(' ', normalize-space(\
@class), ' '), concat(' ', $class_name, ' '))]

_forms_xpath

Value:
descendant-or-self::form|descendant-or-self::x:form

_options_xpath

Value:
descendant-or-self::option|descendant-or-self::x:option

_rel_links_xpath

Value:
descendant-or-self::a[@rel]|descendant-or-self::x:a[@rel]