16赞

从Python字符串中删除不在允许列表中的HTML标记

作者：有风吹过best | 2023-08-28 16:59

如何解决《从Python字符串中删除不在允许列表中的HTML标记》经验，为你挑选了6个好方法。

我有一个包含文本和HTML的字符串.我想删除或以其他方式禁用某些HTML标记,例如a link another link

a paragraph


   secret EVIL!
   
   
   
     Password: 
   
   annoying EVIL!
   spam spam SPAM!
   
 
'''


结果...


  
    
      
      a link
      another link
      a paragraph
      secret EVIL!
      of EVIL!
      Password:
      annoying EVIL!
      spam spam SPAM!
      
    
  



您可以自定义要清理的元素等等.

        
@SørenLøvborg:Cleaner还使用`allow_tags`支持白名单. 
请注意,这使用黑名单方法来过滤掉邪恶的位,而不是白名单,但只有白名单方法才能保证安全. 

2> bryan..：
这是使用BeautifulSoup的简单解决方案:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()


如果你想删除无效的标签的内容,以及,替代tag.extract()了tag.hidden.

您也可以考虑使用lxml和Tidy.

        
这不安全!请参阅Chris Dost的答案:http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785 
您可能还希望限制属性的使用.为此,只需将其添加到上面的解决方案:valid_attrs ='href src'.split()for ...:... tag.attrs = [(attr,val)for attr,val in tag.attrs if attr在valid_attrs] hth 

3> 小智..：
通过Beautiful Soup的上述解决方案将无效.你可能能够使用Beautiful Soup来破解它们之外的东西,因为Beautiful Soup可以访问解析树.有一段时间,我想我会尽力解决问题,但这是一个为期一周的项目,我很快就没有一个免费的一周.

只是具体而言,Beautiful Soup不仅会抛出上述代码无法捕获的一些解析错误的异常; 而且,还有很多非常真实的XSS漏洞没有被发现,例如:

<script>


您可以做的最好的事情是将<元素删除<为禁止所有 HTML,然后使用像Markdown这样的受限子集来正确呈现格式.特别是,您还可以使用正则表达式返回并重新引入HTML的常见位.这是流程的样子,粗略地说:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'√', text)
    text = _endsqrt_.sub(r'', text)
    return _tcre_.sub('<', text)


我尚未测试该代码,因此可能存在错误.但是你看到了一般的想法:你必须将所有HTML列入黑名单,然后才能将这些内容列入白名单.

        
如果您首先尝试这样做:如果您没有降价,则从降价导入降价导入重设您可以尝试easy_install 

4> Jochen Ritze..：
这是我在自己的项目中使用的内容.acceptable_elements /属性来自feedparser,BeautifulSoup完成工作.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment


一些小测试,以确保这行为正确:

tests = [   #text should work
            ('this is text
but this too', 'this is textbut this too'),
            # make sure we cant exploit removal of tags
            ('<script> alert("Haha, I hacked your page."); </script>', ''),
            # try the same trick with attributes, gives an Exception
            ('load="alert("Haha, I hacked your page.");">1',  Exception),
             # no tags should be skipped
            ('', ''),
            # leave valid tags but remove bad attributes
            ('1

', '1'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

        
这不安全!请参阅Chris Dost的答案:http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785 
@ THC4k:对不起,我忘了提到我必须修改这个例子.这是一个有效的:`<< script>  script> alert("哈哈,我砍了你的页面."); <<脚本> 脚本>` 

5> chuangbo..：
Bleach通过更有用的选项做得更好.它建立在html5lib上,可以投入生产.查看该bleack.clean功能的文档.它的默认配置会转义不安全的标签,例如 example")
# '<script>evil</script> example'

        

6> Kiran Jonnal..：
我用BeautifulSoup修改了Bryan的解决方案,以解决Chris Drost提出的问题.有点粗糙,但做的工作:

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput


编辑:已更新以支持有效属性.



    

    

    
        推荐阅读
        
            
                                
                    
                        程序员
                        从事实中返回名称列表
                    

                    
                                                
                        如何解决《从事实中返回名称列表》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        我如何通过电子邮件向Flask发送错误日志？
                    

                    
                                                
                        如何解决《我如何通过电子邮件向Flask发送错误日志？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        C++:将结构的类型更改为子类型
                    

                    
                                                
                        如何解决《C++:将结构的类型更改为子类型》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在Laravel中将id列添加到数据透视表的任何优点？
                    

                    
                                                
                        如何解决《在Laravel中将id列添加到数据透视表的任何优点？》经验，为你挑选了2个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        sleep命令执行不是我的预期
                    

                    
                                                
                        如何解决《sleep命令执行不是我的预期》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        Java开发服务器错误地抛出FeatureNotEnabled异常？
                    

                    
                                                
                        如何解决《Java开发服务器错误地抛出FeatureNotEnabled异常？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在TextView中使用locale(ltr/rtl)作为重力
                    

                    
                                                
                        如何解决《在TextView中使用locale(ltr/rtl)作为重力》经验，为你挑选了2个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        从数组id更新has_many关联
                    

                    
                                                
                            
                        
                                                
                        如何解决《从数组id更新has_many关联》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        为什么我们真的需要多个netty老板线程？
                    

                    
                                                
                        如何解决《为什么我们真的需要多个netty老板线程？》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在Seaborn隐藏轴标题
                    

                    
                                                
                            
                        
                                                
                        如何解决《在Seaborn隐藏轴标题》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        神秘的语法onClick = {:: this.submit}
                    

                    
                                                
                        如何解决《神秘的语法onClick={::this.submit}》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        100%宽度分为3*33%div
                    

                    
                                                
                        如何解决《100%宽度分为3*33%div》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        如果没有处理,则抛出相同的异常,或者构造一个新的异常？
                    

                    
                                                
                        如何解决《如果没有处理,则抛出相同的异常,或者构造一个新的异常？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        存储过程不会插入数据
                    

                    
                                                
                        如何解决《存储过程不会插入数据》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        将数据框重塑为宽大的形状
                    

                    
                                                
                        如何解决《将数据框重塑为宽大的形状》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        将大型集合对象(从json解析)写入excel范围
                    

                    
                                                
                            
                        
                                                
                        如何解决《将大型集合对象(从json解析)写入excel范围》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在SVN中是否有一个命令来查看代码已经签出的位置？
                    

                    
                                                
                        如何解决《在SVN中是否有一个命令来查看代码已经签出的位置？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        Visual Studio代码中的XML自动注释C#
                    

                    
                                                
                        如何解决《VisualStudio代码中的XML自动注释C#》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        DateTime未正确保存到我的数据库中
                    

                    
                                                
                        如何解决《DateTime未正确保存到我的数据库中》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        如何在node.js中发送200响应
                    

                    
                                                
                        如何解决《如何在node.js中发送200响应》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                

            
        
    

    
        吐了个 "CAO" !
        
            
                吐个槽吧,看都看了
            
            
                
                                        会员登录 | 用户注册