Cleaning up HTML ================ The module ``lxml_html_clean`` provides a ``Cleaner`` class for cleaning up HTML pages. It supports removing embedded or script content, special tags, CSS style annotations and much more. Note: the HTML Cleaner in ``lxml_html_clean`` is **not** considered appropriate **for security sensitive environments**. See e.g. `bleach `_ for an alternative. Say, you have an overburdened web page from a hideous source which contains lots of content that upsets browsers and tries to run unnecessary code on the client side: .. sourcecode:: pycon >>> html = '''\ ... ... ... ... ... ... ... ... ... a link ... another link ...

a paragraph

...
secret EVIL!
... of EVIL! ... ...
... Password: ...
... annoying EVIL! ... spam spam SPAM! ... ... ... ''' To remove the all superfluous content from this unparsed document, use the ``clean_html`` function: .. sourcecode:: pycon >>> from lxml_html_clean import clean_html >>> print clean_html(html)
a link another link

a paragraph

secret EVIL!
of EVIL! Password: annoying EVIL!spam spam SPAM!
The ``Cleaner`` class supports several keyword arguments to control exactly which content is removed: .. sourcecode:: pycon >>> from lxml_html_clean import Cleaner >>> cleaner = Cleaner(page_structure=False, links=False) >>> print cleaner.clean_html(html) a link another link

a paragraph

secret EVIL!
of EVIL! Password: annoying EVIL! spam spam SPAM! >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True, ... page_structure=False, safe_attrs_only=False) >>> print cleaner.clean_html(html) a link another link

a paragraph

secret EVIL!
of EVIL! Password: annoying EVIL! spam spam SPAM! To control the removal of CSS styles, set the ``style`` and/or ``inline_style`` keyword arguments to ``True`` when creating a ``Cleaner`` instance. If neither option is enabled, only ``@import`` rules are automatically removed from CSS content. You can also whitelist some otherwise dangerous content with ``Cleaner(host_whitelist=['www.youtube.com'])``, which would allow embedded media from YouTube, while still filtering out embedded media from other sites. See the docstring of ``Cleaner`` for the details of what can be cleaned. autolink -------- In addition to cleaning up malicious HTML, ``lxml_html_clean`` contains functions to do other things to your HTML. This includes autolinking:: autolink(doc, ...) autolink_html(html, ...) This finds anything that looks like a link (e.g., ``http://example.com``) in the *text* of an HTML document, and turns it into an anchor. It avoids making bad links. Links in the elements ``