故事:
解析HTML时BeautifulSoup
,class
属性被视为多值属性,并以特殊方式处理:
请记住,单个标记的"class"属性可以有多个值.当您搜索与某个CSS类匹配的标记时,您将匹配其任何CSS类.
此外,作为其他树构建器类的基础HTMLTreeBuilder
使用的内置引用BeautifulSoup
,例如,HTMLParserTreeBuilder
:
# The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'. When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string.
问题:
如何配置BeautifulSoup
为处理class
通常的单值属性?换句话说,我不希望它class
专门处理并将其视为常规属性.
仅供参考,这是其中一个有用的用例:
在按复合类名称搜索时,BeautifulSoup返回空列表
我尝试过的:
我实际上是通过创建自定义树构建器类并class
从特殊处理的属性列表中删除它来实现的:
from bs4.builder._htmlparser import HTMLParserTreeBuilder class MyBuilder(HTMLParserTreeBuilder): def __init__(self): super(MyBuilder, self).__init__() # BeautifulSoup, please don't treat "class" specially self.cdata_list_attributes["*"].remove("class") soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
在这种方法中我不喜欢的是它非常"不自然"和"神奇"涉及导入"私人"内部_htmlparser
.我希望有一种更简单的方法.
注意:我想保存所有其他HTML解析相关的功能,这意味着我不想解析HTML
"xml" - 只有功能(这可能是另一种解决方法).
在这种方法中我不喜欢的是它非常"不自然"和"神奇"涉及导入"私人"内部
_htmlparser
.我希望有一种更简单的方法.
是的,您可以从中导入它bs4.builder
:
from bs4 import BeautifulSoup from bs4.builder import HTMLParserTreeBuilder class MyBuilder(HTMLParserTreeBuilder): def __init__(self): super(MyBuilder, self).__init__() # BeautifulSoup, please don't treat "class" as a list self.cdata_list_attributes["*"].remove("class") soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
如果您不想重复自己的重要性,请将构建器放在自己的模块中,并将其注册为register_treebuilders_from()
优先级.