使用Python 3和BeautifulSoup 4,我希望能够从HTML页面中提取仅由其上方的注释描绘的文本。一个例子:
<\!--UNIQUE COMMENT--> I would like to get this text <\!--SECOND UNIQUE COMMENT--> I would also like to find this text
我找到了多种方法来提取页面的文本或评论,但没有办法完成我要寻找的事情。任何帮助将不胜感激。
您只需要遍历所有可用注释,以查看它是否是必需的条目之一,然后显示以下元素的文本,如下所示:
from bs4 import BeautifulSoup, Comment html = """p tag text
I would like to get this text I would also like to find this text """ soup = BeautifulSoup(html, 'lxml') for comment in soup.findAll(text=lambda text:isinstance(text, Comment)): if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']: print comment.next_element.strip()
这将显示以下内容:
from bs4 import BeautifulSoup, Comment html = """p tag text
I would like to get this text I would also like to find this text """ soup = BeautifulSoup(html, 'lxml') for comment in soup.findAll(text=lambda text:isinstance(text, Comment)): if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']: print comment.next_element.strip()