'''
结果...
您可以自定义要清理的元素等等.
@SørenLøvborg:Cleaner还使用`allow_tags`支持白名单.
请注意,这使用黑名单方法来过滤掉邪恶的位,而不是白名单,但只有白名单方法才能保证安全.
2> bryan..:
这是使用BeautifulSoup的简单解决方案:
from bs4 import BeautifulSoup
VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']
def sanitize_html(value):
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name not in VALID_TAGS:
tag.hidden = True
return soup.renderContents()
如果你想删除无效的标签的内容,以及,替代tag.extract()
了tag.hidden
.
您也可以考虑使用lxml和Tidy.
这不安全!请参阅Chris Dost的答案:http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785
您可能还希望限制属性的使用.为此,只需将其添加到上面的解决方案:valid_attrs ='href src'.split()for ...:... tag.attrs = [(attr,val)for attr,val in tag.attrs if attr在valid_attrs] hth
3> 小智..:
通过Beautiful Soup的上述解决方案将无效.你可能能够使用Beautiful Soup来破解它们之外的东西,因为Beautiful Soup可以访问解析树.有一段时间,我想我会尽力解决问题,但这是一个为期一周的项目,我很快就没有一个免费的一周.
只是具体而言,Beautiful Soup不仅会抛出上述代码无法捕获的一些解析错误的异常; 而且,还有很多非常真实的XSS漏洞没有被发现,例如:
<script>
您可以做的最好的事情是将<
元素删除<
为禁止所有 HTML,然后使用像Markdown这样的受限子集来正确呈现格式.特别是,您还可以使用正则表达式返回并重新引入HTML的常见位.这是流程的样子,粗略地说:
_lt_ = re.compile('<')
_tc_ = '~(lt)~' # or whatever, so long as markdown doesn't mangle it.
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I) #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)
def sanitize(text):
text = _lt_.sub(_tc_, text)
text = markdown(text)
text = _ok_.sub(r'<\1>', text)
text = _sqrt_.sub(r'√', text)
text = _endsqrt_.sub(r'', text)
return _tcre_.sub('<', text)
我尚未测试该代码,因此可能存在错误.但是你看到了一般的想法:你必须将所有HTML列入黑名单,然后才能将这些内容列入白名单.
如果您首先尝试这样做:如果您没有降价,则从降价导入降价导入重设您可以尝试easy_install
4> Jochen Ritze..:
这是我在自己的项目中使用的内容.acceptable_elements /属性来自feedparser,BeautifulSoup完成工作.
from BeautifulSoup import BeautifulSoup
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img',
'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol',
'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
'thead', 'tr', 'tt', 'u', 'ul', 'var']
acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
'colspan', 'color', 'compact', 'coords', 'datetime', 'dir',
'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt',
'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
'usemap', 'valign', 'value', 'vspace', 'width']
def clean_html( fragment ):
while True:
soup = BeautifulSoup( fragment )
removed = False
for tag in soup.findAll(True): # find all tags
if tag.name not in acceptable_elements:
tag.extract() # remove the bad ones
removed = True
else: # it might have bad attributes
# a better way to get all attributes?
for attr in tag._getAttrMap().keys():
if attr not in acceptable_attributes:
del tag[attr]
# turn it back to html
fragment = unicode(soup)
if removed:
# we removed tags and tricky can could exploit that!
# we need to reparse the html until it stops changing
continue # next round
return fragment
一些小测试,以确保这行为正确:
tests = [ #text should work
('this is text
but this too', 'this is text
but this too'),
# make sure we cant exploit removal of tags
('<script> alert("Haha, I hacked your page."); </script>', ''),
# try the same trick with attributes, gives an Exception
('load="alert("Haha, I hacked your page.");">1
', Exception),
# no tags should be skipped
('', ''),
# leave valid tags but remove bad attributes
('1