3赞

在BeautifulSoup中扩展CSS选择器

作者：135369一生真爱_890 | 2023-09-09 16:30

如何解决《在BeautifulSoup中扩展CSS选择器》经验，为你挑选了1个好方法。

问题:

BeautifulSoup为CSS选择器提供非常有限的支持.例如,唯一支持的伪类是nth-of-type,它只能接受数值 - 参数喜欢even或odd不允许.

是否可以扩展BeautifulSoupCSS选择器或让它在lxml.cssselect内部用作底层CSS选择机制？

我们来看一个示例问题/用例.在以下HTML中仅查找偶数行:

在lxml.html和中lxml.cssselect,很容易做到:nth-of-type(even):

from lxml.html import fromstring
from lxml.cssselect import CSSSelector

tree = fromstring(data)

sel = CSSSelector('tr:nth-of-type(even)')

print [e.text_content().strip() for e in sel(tree)]

但是,在BeautifulSoup:

print(soup.select("tr:nth-of-type(even)"))

会抛出错误:

NotImplementedError:nth-of-type伪类目前仅支持数值.

请注意,我们可以解决此问题.find_all():

print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0])

Martin Valgu.. 8

在检查源代码之后,似乎BeautifulSoup在其界面中没有提供任何方便的点来扩展或修补其在这方面的现有功能.从使用功能lxml是不可能的,因为要么BeautifulSoup只使用lxml解析过程中,并使用分析结果与他们创造了自己的各个对象.该lxml对象不再保留,以后将无法访问.

话虽如此,凭借足够的决心以及Python的灵活性和内省能力,一切皆有可能.您甚至可以在运行时修改BeautifulSoup方法的内部:

import inspect
import re
import textwrap

import bs4.element


def replace_code_lines(source, start_token, end_token,
                       replacement, escape_tokens=True):
    """Replace the source code between `start_token` and `end_token`
    in `source` with `replacement`. The `start_token` portion is included
    in the replaced code. If `escape_tokens` is True (default),
    escape the tokens to avoid them being treated as a regular expression."""

    if escape_tokens:
        start_token = re.escape(start_token)
        end_token = re.escape(end_token)

    def replace_with_indent(match):
        indent = match.group(1)
        return textwrap.indent(replacement, indent)

    return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
                  replace_with_indent, source, flags=re.MULTILINE)


# Get the source code of the Tag.select() method
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))

# Replace the relevant part of the method
start_token = "if pseudo_type == 'nth-of-type':"
end_token = "else"
replacement = """\
if pseudo_type == 'nth-of-type':
    try:
        if pseudo_value in ("even", "odd"):
            pass
        else:
            pseudo_value = int(pseudo_value)
    except:
        raise NotImplementedError(
            'Only numeric values, "even" and "odd" are currently '
            'supported for the nth-of-type pseudo-class.')
    if isinstance(pseudo_value, int) and pseudo_value < 1:
        raise ValueError(
            'nth-of-type pseudo-class value must be at least 1.')
    class Counter(object):
        def __init__(self, destination):
            self.count = 0
            self.destination = destination

        def nth_child_of_type(self, tag):
            self.count += 1
            if pseudo_value == "even":
                return not bool(self.count % 2)
            elif pseudo_value == "odd":
                return bool(self.count % 2)
            elif self.count == self.destination:
                return True
            elif self.count > self.destination:
                # Stop the generator that's sending us
                # these things.
                raise StopIteration()
            return False
    checker = Counter(pseudo_value).nth_child_of_type
"""
new_src = replace_code_lines(src, start_token, end_token, replacement)

# Compile it and execute it in the target module's namespace
exec(new_src, bs4.element.__dict__)
# Monkey patch the target method
bs4.element.Tag.select = bs4.element.select



这是要修改的代码部分.

当然,这是一切,但优雅和可靠.我并不认为这在任何地方都会得到认真使用.


1> Martin Valgu..：
在检查源代码之后,似乎BeautifulSoup在其界面中没有提供任何方便的点来扩展或修补其在这方面的现有功能.从使用功能lxml是不可能的,因为要么BeautifulSoup只使用lxml解析过程中,并使用分析结果与他们创造了自己的各个对象.该lxml对象不再保留,以后将无法访问.

话虽如此,凭借足够的决心以及Python的灵活性和内省能力,一切皆有可能.您甚至可以在运行时修改BeautifulSoup方法的内部:

import inspect
import re
import textwrap

import bs4.element


def replace_code_lines(source, start_token, end_token,
                       replacement, escape_tokens=True):
    """Replace the source code between `start_token` and `end_token`
    in `source` with `replacement`. The `start_token` portion is included
    in the replaced code. If `escape_tokens` is True (default),
    escape the tokens to avoid them being treated as a regular expression."""

    if escape_tokens:
        start_token = re.escape(start_token)
        end_token = re.escape(end_token)

    def replace_with_indent(match):
        indent = match.group(1)
        return textwrap.indent(replacement, indent)

    return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
                  replace_with_indent, source, flags=re.MULTILINE)


# Get the source code of the Tag.select() method
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))

# Replace the relevant part of the method
start_token = "if pseudo_type == 'nth-of-type':"
end_token = "else"
replacement = """\
if pseudo_type == 'nth-of-type':
    try:
        if pseudo_value in ("even", "odd"):
            pass
        else:
            pseudo_value = int(pseudo_value)
    except:
        raise NotImplementedError(
            'Only numeric values, "even" and "odd" are currently '
            'supported for the nth-of-type pseudo-class.')
    if isinstance(pseudo_value, int) and pseudo_value < 1:
        raise ValueError(
            'nth-of-type pseudo-class value must be at least 1.')
    class Counter(object):
        def __init__(self, destination):
            self.count = 0
            self.destination = destination

        def nth_child_of_type(self, tag):
            self.count += 1
            if pseudo_value == "even":
                return not bool(self.count % 2)
            elif pseudo_value == "odd":
                return bool(self.count % 2)
            elif self.count == self.destination:
                return True
            elif self.count > self.destination:
                # Stop the generator that's sending us
                # these things.
                raise StopIteration()
            return False
    checker = Counter(pseudo_value).nth_child_of_type
"""
new_src = replace_code_lines(src, start_token, end_token, replacement)

# Compile it and execute it in the target module's namespace
exec(new_src, bs4.element.__dict__)
# Monkey patch the target method
bs4.element.Tag.select = bs4.element.select


这是要修改的代码部分.

当然,这是一切,但优雅和可靠.我并不认为这在任何地方都会得到认真使用.



    

    

    
        推荐阅读
        
            
                                
                    
                        程序员
                        重新安装应用时,Android 6.0无权更新无线网络
                    

                    
                                                
                        如何解决《重新安装应用时,Android6.0无权更新无线网络》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        过滤AWS Cloudwatch Lambda的日志
                    

                    
                                                
                        如何解决《过滤AWSCloudwatchLambda的日志》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        C- while循环未解释的行为
                    

                    
                                                
                        如何解决《C-while循环未解释的行为》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        按下按钮SplashScreen退出并打开身份验证活动
                    

                    
                                                
                        如何解决《按下按钮SplashScreen退出并打开身份验证活动》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        是否可以用CMake将某些符号替换为不同的符号？
                    

                    
                                                
                        如何解决《是否可以用CMake将某些符号替换为不同的符号？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        Python:如何优雅地获得这个？
                    

                    
                                                
                        如何解决《Python:如何优雅地获得这个？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        tsconfig.json的“排除”属性未得到尊重
                    

                    
                                                
                        如何解决《tsconfig.json的“排除”属性未得到尊重》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        如何从npm包中请求替代js文件而不指定完整的node_modules路径
                    

                    
                                                
                        如何解决《如何从npm包中请求替代js文件而不指定完整的node_modules路径》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在Symfony 3 Controller中获取请求和会话
                    

                    
                                                
                        如何解决《在Symfony3Controller中获取请求和会话》经验，为你挑选了2个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在交换机案例中获取Toasts的空对象引用错误
                    

                    
                                                
                        如何解决《在交换机案例中获取Toasts的空对象引用错误》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        R  - 统计所有组合
                    

                    
                                                
                        如何解决《R-统计所有组合》经验，为你挑选了4个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在VHDL中创建低时钟频率的替代方法
                    

                    
                                                
                        如何解决《在VHDL中创建低时钟频率的替代方法》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        ios堆栈溢出中的uuid,udid和设备令牌有什么区别
                    

                    
                                                
                        如何解决《ios堆栈溢出中的uuid,udid和设备令牌有什么区别》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        加载大型SQL转储时Docker内存不足
                    

                    
                                                
                        如何解决《加载大型SQL转储时Docker内存不足》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        即使使用nohup,后台shell脚本也无法在ssh注销后到达目录
                    

                    
                                                
                        如何解决《即使使用nohup,后台shell脚本也无法在ssh注销后到达目录》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        什么是azure云平台中的"输入端点"和"内部端点"？
                    

                    
                                                
                        如何解决《什么是azure云平台中的"输入端点"和"内部端点"？》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        iOS：从字典中获取数组中的值
                    

                    
                                                
                        如何解决《iOS：从字典中获取数组中的值》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        "this"对应用程序子类中的onCreate方法意味着什么
                    

                    
                                                
                        如何解决《"this"对应用程序子类中的onCreate方法意味着什么》经验，为你挑选了1个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        在可可中加载webview后,div onclick或oncontext只运行一次
                    

                    
                                                
                        如何解决《在可可中加载webview后,divonclick或oncontext只运行一次》经验，为你挑选了0个好方法。 ...
                        [详细]
                    
                    

                


                                
                    
                        程序员
                        通过代理java下载文件
                    

                    
                                                
                        如何解决《通过代理java下载文件》经验，为你挑选了2个好方法。 ...
                        [详细]
                    
                    

                


                

            
        
    

    
        吐了个 "CAO" !
        
            
                吐个槽吧,看都看了
            
            
                
                                        会员登录 | 用户注册
























    

    
        
            
            
                
                    
                
            

            
                135369一生真爱_890            

            
                这个屌丝很懒，什么也没留下！            
            
            

                                
                    
                    关注作者
                            

        
    


    
        Tags | 热门标签
        
            
                                
                    actionscrip
                
                                
                    bash
                
                                
                    c#
                
                                
                    c++
                
                                
                    c语言
                
                                
                    erlang
                
                                
                    flutter
                
                                
                    go
                
                                
                    golang
                
                                
                    java
                
                                
                    javascript
                
                                
                    lua
                
                                
                    node.js
                
                                
                    perl
                
                                
                    php
                
                                
                    python
                
                                
                    scala
                
                                
                    typescript
                
                                
            
        
    


    
        RankList | 热门文章
        
            
                                
                    1自定义NSDateFormatter.timeZone  -  Swift
                
                                
                    2haskell中的圆形地图
                
                                
                    3什么是原生Android与java？
                
                                
                    4在大型数据集中提交每个案例时计算公开案例的有效方法
                
                                
                    5在ReSharper 9.2中手动安装扩展
                
                                
                    6c ++函数返回错误的数组
                
                                
                    7是否可以在EF7中使用流畅的API添加CHECK约束？
                
                                
                    8http API客户端的Wreq或Servant？
                
                                
                    9将UILable更改为UITextView,而不从对象库中删除和添加
                
                                
                    10ECMAScript-6导入嵌套函数？
                
                                
                    11自定义UIView,在Storyboard上具有动态高度
                
                                
                    12HTML5 localStorage有用的函数// JavaScript,TypeScript
                
                                
                    13将插入符号设置在Froala 2中内容的末尾
                
                                
                    14if条件后的语法无效
                
                                
                    15断言没有Python中的回溯
                
                                
                    16在运行react-native初始项目时,watchman.plist权限被拒绝
                
                                
                    17只能复制到0个节点而不是minReplication(= 1).有4个数据节点在运行,并且在此操作中不排除任何节点
                
                                
                    18Erlang模块编译
                
                                
                    19使用多项式内核调整svm时出现奇怪的错误消息:"警告:达到最大迭代次数"
                
                                
                    20如果扩展一个实现Serializable"下线"的类,为什么还需要重新定义serialVersionUID？