lxml_html_clean package

lxml_html_clean.clean module

A cleanup tool for HTML.

Removes unwanted tags and content. See the Cleaner class for details.

exception lxml_html_clean.clean.AmbiguousURLWarning

Bases: LXMLHTMLCleanWarning

add_note(): Exception.add_note(note) – add a note to the exception

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

args

exception lxml_html_clean.clean.LXMLHTMLCleanWarning

Bases: Warning

add_note(): Exception.add_note(note) – add a note to the exception

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

args

class lxml_html_clean.clean.Cleaner(**kw)

Bases: object

Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.

scripts:

Removes any <script> tags.

javascript:

Removes any Javascript, like an onclick attribute. Also removes stylesheets as they could contain Javascript.

comments:

Removes any comments.

style:

Removes any style tags.

inline_style

Removes any style attributes. Defaults to the value of the style option.

links:

Removes any <link> tags

meta:

Removes any <meta> tags

page_structure:

Structural parts of a page: <head>, <html>, <title>.

processing_instructions:

Removes any processing instructions.

embedded:

Removes any embedded objects (flash, iframes)

frames:

Removes any frame-related tags

forms:

Removes any form tags

annoying_tags:

Tags that aren’t wrong, but are annoying. <blink> and <marquee>

remove_tags:

A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.

kill_tags:

A list of tags to kill. Killing also removes the tag’s content, i.e. the whole subtree, not just the tag itself.

allow_tags:

A list of tags to include (default include all).

remove_unknown_tags:

Remove any tags that aren’t standard parts of HTML.

safe_attrs_only:

If true, only include ‘safe’ attributes (specifically the list from the feedparser HTML sanitisation web site).

safe_attrs:

A set of attribute names to override the default list of attributes considered ‘safe’ (when safe_attrs_only=True).

add_nofollow:

If true, then any <a> tags will have rel="nofollow" added to them.

host_whitelist:

A list or set of hosts that you can use for embedded content (for content like <object>, <link rel="stylesheet">, etc). You can also implement/override the method allow_embedded_url(el, url) or allow_element(el) to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) embedded.

Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning.

Note that you may also need to set whitelist_tags.

Note that URLs are parsed via functions from urllib.parse and no input validation is performed.

whitelist_tags:

A set of tags that can be included with host_whitelist. The default is iframe and embed; you may wish to include other tags like script, or you may want to implement allow_embedded_url for more control. Set to None to include all tags.

This modifies the document in place.

_decode_css_unicode_escapes(style)

Decode CSS Unicode escape sequences like 69 or 000069 to their actual character values. This prevents bypassing security checks using CSS escape sequences.

CSS escape syntax: backslash followed by 1-6 hex digits, optionally followed by a whitespace character.

_find_comments(string, pos=0, endpos=9223372036854775807)

Return an iterator over all non-overlapping matches for the RE pattern in string.

For each match, the iterator returns a match object.

_has_sneaky_javascript(style)

Depending on the browser, stuff like e x p r e s s i o n(...) can get interpreted, or expre/* stuff */ssion(...). This checks for attempt to do stuff like this.

Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.

_kill_elements(doc, condition, iterate=None)

_remove_javascript_link(link)

_remove_sneaky_css_comments(style)

Look for suspicious code in CSS comment and if found, remove the entire comment from the given style.

Browsers might parse <style> as an ordinary HTML tag in some specific context and that might cause code in CSS comments to run.

_substitute_comments(repl, string, count=0): Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

allow_element(el)

Decide whether an element is configured to be accepted or rejected.

Parameters:: el – an element.
Returns:: true to accept the element or false to reject/discard it.

allow_embedded_url(el, url)

Decide whether a URL that was found in an element’s attributes or text if configured to be accepted or rejected.

Parameters:

el – an element.
url – a URL found on the element.

Returns:

true to accept the URL and false to reject it.

allow_follow(anchor): Override to suppress rel=”nofollow” on some anchors.

clean_html(html)

kill_conditional_comments(doc): IE conditional comments basically embed HTML that the parser doesn’t normally see. We can’t allow anything like that, so we’ll kill any comments that could be conditional.

_comments_re = re.compile('/\\*.*?\\*/', re.DOTALL)

_css_unicode_escape_re = re.compile('\\\\([0-9a-fA-F]{1,6})\\s?')

_tag_link_attrs = {'a': 'href', 'applet': ['code', 'object'], 'embed': 'src', 'iframe': 'src', 'layer': 'src', 'link': 'href', 'script': 'src'}

add_nofollow = False

allow_tags = ()

annoying_tags = True

comments = True

embedded = True

forms = True

frames = True

host_whitelist = ()

inline_style = None

javascript = True

kill_tags = ()

links = True

meta = True

page_structure = True

processing_instructions = True

remove_tags = ()

remove_unknown_tags = True

safe_attrs = frozenset({'abbr', 'accept', 'accept-charset', 'accesskey', 'action', 'align', 'alt', 'aria-activedescendant', 'aria-atomic', 'aria-autocomplete', 'aria-braillelabel', 'aria-brailleroledescription', 'aria-busy', 'aria-checked', 'aria-colcount', 'aria-colindex', 'aria-colindextext', 'aria-colspan', 'aria-controls', 'aria-current', 'aria-describedby', 'aria-description', 'aria-details', 'aria-disabled', 'aria-dropeffect', 'aria-errormessage', 'aria-expanded', 'aria-flowto', 'aria-grabbed', 'aria-haspopup', 'aria-hidden', 'aria-invalid', 'aria-keyshortcuts', 'aria-label', 'aria-labelledby', 'aria-level', 'aria-live', 'aria-modal', 'aria-multiline', 'aria-multiselectable', 'aria-orientation', 'aria-owns', 'aria-placeholder', 'aria-posinset', 'aria-pressed', 'aria-readonly', 'aria-relevant', 'aria-required', 'aria-roledescription', 'aria-rowcount', 'aria-rowindex', 'aria-rowindextext', 'aria-rowspan', 'aria-selected', 'aria-setsize', 'aria-sort', 'aria-valuemax', 'aria-valuemin', 'aria-valuenow', 'aria-valuetext', 'axis', 'border', 'cellpadding', 'cellspacing', 'char', 'charoff', 'charset', 'checked', 'cite', 'class', 'clear', 'color', 'cols', 'colspan', 'compact', 'coords', 'datetime', 'dir', 'disabled', 'enctype', 'for', 'frame', 'headers', 'height', 'href', 'hreflang', 'hspace', 'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'media', 'method', 'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 'readonly', 'rel', 'rev', 'role', 'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size', 'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type', 'usemap', 'valign', 'value', 'vspace', 'width'})

safe_attrs_only = True

scripts = True

style = False

whitelist_tags = {'embed', 'iframe'}

lxml_html_clean.clean._break_text(text, max_width, break_character)

lxml_html_clean.clean._find_image_dataurls(string, pos=0, endpos=9223372036854775807): Return a list of all non-overlapping matches of pattern in string.

lxml_html_clean.clean._get_authority_from_url(url)

lxml_html_clean.clean._has_javascript_scheme(s)

lxml_html_clean.clean._insert_break(word, width, break_character)

lxml_html_clean.clean._is_unsafe_image_type(string, pos=0, endpos=9223372036854775807)

Scan through string looking for a match, and return a corresponding match object instance.

Return None if no position in the string matches.

lxml_html_clean.clean._link_text(text, link_regexes, avoid_hosts, factory)

lxml_html_clean.clean._looks_like_tag_content(string, pos=0, endpos=9223372036854775807)

Scan through string looking for a match, and return a corresponding match object instance.

Return None if no position in the string matches.

lxml_html_clean.clean._possibly_malicious_schemes(string, pos=0, endpos=9223372036854775807): Return a list of all non-overlapping matches of pattern in string.

lxml_html_clean.clean._replace_css_import(repl, string, count=0): Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

lxml_html_clean.clean._replace_css_javascript(repl, string, count=0): Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

lxml_html_clean.clean._substitute_whitespace(repl, string, count=0): Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

lxml_html_clean.clean.autolink(el, link_regexes=[re.compile('(?P<body>https?://(?P<host>[a-z0-9._-]+)(?:/[/\\-_.,a-z0-9%&?;=~]*)?(?:\$[/\\-_.,a-z0-9%&?;=~]*\$)?)', re.IGNORECASE), re.compile('mailto:(?P<body>[a-z0-9._-]+@(?P<host>[a-z0-9_.-]+[a-z]))', re.IGNORECASE)], avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[re.compile('^localhost', re.IGNORECASE), re.compile('\\bexample\\.(?:com|org|net)$', re.IGNORECASE), re.compile('^127\\.0\\.0\\.1$')], avoid_classes=['nolink'])

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won’t link text in an element in avoid_elements, or an element with a class in avoid_classes. It won’t link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).

If you pass in an element, the element’s tail will not be substituted, only the contents of the element.

lxml_html_clean.clean.autolink_html(html, *args, **kw)

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won’t link text in an element in avoid_elements, or an element with a class in avoid_classes. It won’t link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).

If you pass in an element, the element’s tail will not be substituted, only the contents of the element.

lxml_html_clean.clean.clean_html(html)

lxml_html_clean.clean.fromstring(data): Enhanced fromstring function that removes ASCII control chars before passing the input to the original lxml.html.fromstring.

lxml_html_clean.clean.word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character='\u200b')

Breaks any long words found in the body of the text (not attributes).

Doesn’t effect any of the tags in avoid_elements, by default <textarea> and <pre>

Breaks words by inserting , which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space.

See http://www.cs.tut.fi/~jkorpela/html/nobr.html for a discussion

lxml_html_clean.clean.word_break_html(html, *args, **kw)