我正在尝试清理和XSS证明来自客户端的一些HTML输入.我正在使用Python 2.6和Beautiful Soup.我解析输入,剥离不在白名单中的所有标签和属性,并将树转换回字符串.
然而...
>>> unicode(BeautifulSoup('text < text')) u'text < text'
对我来说,这看起来不像是有效的HTML.使用我的标签剥离器,它打开了各种各样的肮脏的方式:
>>> print BeautifulSoup('<script>alert("xss")<script>').prettify() < script>alert("xss")< script>
这些对将被删除,剩下的不仅是XSS攻击,甚至是有效的HTML.
The obvious solution is to replace all <
characters by <
that, after parsing, are found not to belong to a tag (and similar for >&'"
). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString
nodes, but since I might miss something, I'd rather let some tried and tested code do the work.
Why doesn't Beautiful Soup escape <
(and other magic characters) by default, and how do I make it do that?
NB我也看过了lxml.html.clean
.它似乎是在黑名单的基础上工作,而不是白名单,所以它对我来说似乎不太安全.标签可以列入白名单,但属性不能,并且它允许我的品味太多属性(例如tabindex
).此外,它给出AssertionError
了输入.不好.
其他清理HTML方法的建议也非常受欢迎.我不是世界上唯一尝试这样做的人,但似乎没有标准的解决方案.
我知道这是你原来的问题后3.5yrs,但你可以使用formatter='html'
参数prettify()
,encode()
或decode()
产生良好的HTML.