问题:
BeautifulSoup
为CSS选择器提供非常有限的支持.例如,唯一支持的伪类是nth-of-type
,它只能接受数值 - 参数喜欢even
或odd
不允许.
是否可以扩展BeautifulSoup
CSS选择器或让它在lxml.cssselect
内部用作底层CSS选择机制?
我们来看一个示例问题/用例.在以下HTML中仅查找偶数行:
1 |
2 |
3 |
4 |
在lxml.html
和中lxml.cssselect
,很容易做到:nth-of-type(even)
:
from lxml.html import fromstring from lxml.cssselect import CSSSelector tree = fromstring(data) sel = CSSSelector('tr:nth-of-type(even)') print [e.text_content().strip() for e in sel(tree)]
但是,在BeautifulSoup
:
print(soup.select("tr:nth-of-type(even)"))
会抛出错误:
NotImplementedError:nth-of-type伪类目前仅支持数值.
请注意,我们可以解决此问题.find_all()
:
print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0])
Martin Valgu.. 8
在检查源代码之后,似乎BeautifulSoup
在其界面中没有提供任何方便的点来扩展或修补其在这方面的现有功能.从使用功能lxml
是不可能的,因为要么BeautifulSoup
只使用lxml
解析过程中,并使用分析结果与他们创造了自己的各个对象.该lxml
对象不再保留,以后将无法访问.
话虽如此,凭借足够的决心以及Python的灵活性和内省能力,一切皆有可能.您甚至可以在运行时修改BeautifulSoup方法的内部:
import inspect
import re
import textwrap
import bs4.element
def replace_code_lines(source, start_token, end_token,
replacement, escape_tokens=True):
"""Replace the source code between `start_token` and `end_token`
in `source` with `replacement`. The `start_token` portion is included
in the replaced code. If `escape_tokens` is True (default),
escape the tokens to avoid them being treated as a regular expression."""
if escape_tokens:
start_token = re.escape(start_token)
end_token = re.escape(end_token)
def replace_with_indent(match):
indent = match.group(1)
return textwrap.indent(replacement, indent)
return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
replace_with_indent, source, flags=re.MULTILINE)
# Get the source code of the Tag.select() method
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))
# Replace the relevant part of the method
start_token = "if pseudo_type == 'nth-of-type':"
end_token = "else"
replacement = """\
if pseudo_type == 'nth-of-type':
try:
if pseudo_value in ("even", "odd"):
pass
else:
pseudo_value = int(pseudo_value)
except:
raise NotImplementedError(
'Only numeric values, "even" and "odd" are currently '
'supported for the nth-of-type pseudo-class.')
if isinstance(pseudo_value, int) and pseudo_value < 1:
raise ValueError(
'nth-of-type pseudo-class value must be at least 1.')
class Counter(object):
def __init__(self, destination):
self.count = 0
self.destination = destination
def nth_child_of_type(self, tag):
self.count += 1
if pseudo_value == "even":
return not bool(self.count % 2)
elif pseudo_value == "odd":
return bool(self.count % 2)
elif self.count == self.destination:
return True
elif self.count > self.destination:
# Stop the generator that's sending us
# these things.
raise StopIteration()
return False
checker = Counter(pseudo_value).nth_child_of_type
"""
new_src = replace_code_lines(src, start_token, end_token, replacement)
# Compile it and execute it in the target module's namespace
exec(new_src, bs4.element.__dict__)
# Monkey patch the target method
bs4.element.Tag.select = bs4.element.select
这是要修改的代码部分.
当然,这是一切,但优雅和可靠.我并不认为这在任何地方都会得到认真使用.
在检查源代码之后,似乎BeautifulSoup
在其界面中没有提供任何方便的点来扩展或修补其在这方面的现有功能.从使用功能lxml
是不可能的,因为要么BeautifulSoup
只使用lxml
解析过程中,并使用分析结果与他们创造了自己的各个对象.该lxml
对象不再保留,以后将无法访问.
话虽如此,凭借足够的决心以及Python的灵活性和内省能力,一切皆有可能.您甚至可以在运行时修改BeautifulSoup方法的内部:
import inspect
import re
import textwrap
import bs4.element
def replace_code_lines(source, start_token, end_token,
replacement, escape_tokens=True):
"""Replace the source code between `start_token` and `end_token`
in `source` with `replacement`. The `start_token` portion is included
in the replaced code. If `escape_tokens` is True (default),
escape the tokens to avoid them being treated as a regular expression."""
if escape_tokens:
start_token = re.escape(start_token)
end_token = re.escape(end_token)
def replace_with_indent(match):
indent = match.group(1)
return textwrap.indent(replacement, indent)
return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
replace_with_indent, source, flags=re.MULTILINE)
# Get the source code of the Tag.select() method
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))
# Replace the relevant part of the method
start_token = "if pseudo_type == 'nth-of-type':"
end_token = "else"
replacement = """\
if pseudo_type == 'nth-of-type':
try:
if pseudo_value in ("even", "odd"):
pass
else:
pseudo_value = int(pseudo_value)
except:
raise NotImplementedError(
'Only numeric values, "even" and "odd" are currently '
'supported for the nth-of-type pseudo-class.')
if isinstance(pseudo_value, int) and pseudo_value < 1:
raise ValueError(
'nth-of-type pseudo-class value must be at least 1.')
class Counter(object):
def __init__(self, destination):
self.count = 0
self.destination = destination
def nth_child_of_type(self, tag):
self.count += 1
if pseudo_value == "even":
return not bool(self.count % 2)
elif pseudo_value == "odd":
return bool(self.count % 2)
elif self.count == self.destination:
return True
elif self.count > self.destination:
# Stop the generator that's sending us
# these things.
raise StopIteration()
return False
checker = Counter(pseudo_value).nth_child_of_type
"""
new_src = replace_code_lines(src, start_token, end_token, replacement)
# Compile it and execute it in the target module's namespace
exec(new_src, bs4.element.__dict__)
# Monkey patch the target method
bs4.element.Tag.select = bs4.element.select
这是要修改的代码部分.
当然,这是一切,但优雅和可靠.我并不认为这在任何地方都会得到认真使用.