4赞

使用Python从HTML文件中提取文本

作者：和谐啄木鸟 | 2023-09-02 13:38

如何解决《使用Python从HTML文件中提取文本》经验，为你挑选了9个好方法。

我想使用Python从HTML文件中提取文本.如果我从浏览器复制文本并将其粘贴到记事本中,我想要的输出基本相同.

我想要比使用可能在格式不正确的HTML上失败的正则表达式更强大的东西.我见过很多人推荐Beautiful Soup,但是我使用它时遇到了一些问题.首先,它选择了不需要的文本,例如JavaScript源代码.此外,它没有解释HTML实体.例如,我希望' 在HTML源代码中转换为文本中的撇号,就像我将浏览器内容粘贴到记事本中一样.

更新 html2text看起来很有希它正确处理HTML实体并忽略JavaScript.但是,它并不完全产生纯文本; 它会产生降价,然后必须将其转换为纯文本.它没有示例或文档,但代码看起来很干净.

相关问题:

过滤掉HTML标签并解析python中的实体

在Python中将XML/HTML实体转换为Unicode字符串

PeYoTlL.. 129

我找到的最好的代码,用于提取文本而不需要获取javascript或不需要的东西:

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

你必须先安装BeautifulSoup:

pip install beautifulsoup4

而不是`soup.get_text()`我使用了`soup.body.get_text()`,因此我没有从`元素中获取任何文本,例如标题. (5认同)

对于Python 3,`来自urllib.request import urlopen` (5认同)

杀戮脚本有点,救世主!! (3认同)

如果我们想要选择一些线路,只是说,第3行？ (2认同)

在经历了很多stackoverflow答案之后，我觉得这对我来说是最好的选择。我遇到的一个问题是在某些情况下将行添加在一起。我可以通过在get_text函数中添加分隔符来克服它：`text = soup.get_text（separator =''）` (2认同)

RexE.. 124

html2text是一个Python程序,在这方面表现相当不错.

1> PeYoTlL..：

我找到的最好的代码,用于提取文本而不需要获取javascript或不需要的东西:

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

你必须先安装BeautifulSoup:

pip install beautifulsoup4

而不是`soup.get_text()`我使用了`soup.body.get_text()`,因此我没有从`元素中获取任何文本,例如标题.

对于Python 3,`来自urllib.request import urlopen`

杀戮脚本有点,救世主!!

如果我们想要选择一些线路,只是说,第3行？

在经历了很多stackoverflow答案之后，我觉得这对我来说是最好的选择。我遇到的一个问题是在某些情况下将行添加在一起。我可以通过在get_text函数中添加分隔符来克服它：`text = soup.get_text（separator =''）`

2> RexE..：

html2text是一个Python程序,在这方面表现相当不错.

惊人!它的作者是RIP Aaron Swartz.

它是gpl 3.0,这意味着它可能不兼容

有没有人因为GPL 3.0而找到html2text的替代品？

3> Shatu..：

注意: NTLK不再支持clean_html功能

下面的原始答案,以及评论部分的替代方案.

使用NLTK

我浪费了4-5个小时来修复html2text的问题.幸运的是我可以遇到NLTK.
它神奇地工作.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

@ alexanderlukanin13来源:`raise NotImplementedError("删除HTML标记,使用BeautifulSoup的get_text()函数")`

显然,不再支持clean_html:https://github.com/nltk/nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb

有时候就够了:)

我想要投票一千次.我被困在正则表达式地狱中,但是,现在我看到了NLTK的智慧.

为这样一个简单的任务导入像nltk这样繁重的库会太多了

4> xperroni..：

发现自己今天面临同样的问题.我编写了一个非常简单的HTML解析器来删除所有标记的传入内容,仅使用最少的格式返回剩余的文本.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        
            
                Project: DeHTML

                Description:

                This small script is intended to allow conversion from HTML markup to 
                plain text.
            
        
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

这似乎是仅使用默认模块在Python(2.7)中执行此操作的最直接方式.这真是太愚蠢了,因为这是一个非常需要的东西,并且没有充分的理由说明默认的HTMLParser模块中没有这个解析器.

我不认为将html字符转换为unicode,对吧？例如,`&`不会转换为`&`,对吗？

5> 小智..：

这是xperroni答案的一个版本,它更完整.它会跳过脚本和样式部分并翻译charref(例如')和HTML实体(例如&).

它还包括一个简单的纯文本到html逆转换器.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with  tags,
    converting newlines to 
 tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&', "'":''', '"':'"', '<':'<', '>':'>'}.get(t)
        return '%s' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

6> Floyd..：

我知道已经有很多答案,但我发现的最优雅和pythonic解决方案部分地在这里描述.

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

更新

根据弗雷泽的评论,这里是更优雅的解决方案:

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

要避免警告,请为BeautifulSoup指定一个解析器:`text =''.join(BeautifulSoup(some_html_string,"lxml").findAll(text = True))`

7> GeekTantra..：

您也可以在条形图库中使用html2text方法.

from stripogram import html2text
text = html2text(your_html_string)

要安装条带图运行sudo easy_install条形图

根据[其pypi页面](http://pypi.python.org/pypi/stripogram),该模块已被弃用:"除非您有使用此软件包的历史原因,否则我建议不要使用它!"

8> Nuncjo..：

有用于数据挖掘的Pattern库.

http://www.clips.ua.ac.be/pages/pattern-web

您甚至可以决定要保留哪些标记:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

9> PyNEwbie..：

PyParsing做得很好.PyParsing wiki被杀了所以这里是另一个有PyParsing使用示例的位置(示例链接).投入一点时间进行pyparsing的一个原因是他还写了一篇非常简洁,非常有条理的O'Reilly Short Cut手册,价格便宜.

话虽如此,我使用BeautifulSoup并不是很难处理实体问题,你可以在运行BeautifulSoup之前转换它们.

祝好运

推荐阅读

程序员
crm 2011 OP:运行什么服务器的插件？

如何解决《crm2011OP:运行什么服务器的插件？》经验，为你挑选了1个好方法。 ... [详细]
程序员
微软Band 2上的UWP

如何解决《微软Band2上的UWP》经验，为你挑选了1个好方法。 ... [详细]
程序员
限制方法中泛型类型的目的是什么？

如何解决《限制方法中泛型类型的目的是什么？》经验，为你挑选了1个好方法。 ... [详细]
程序员
默认导出后的分号

如何解决《默认导出后的分号》经验，为你挑选了1个好方法。 ... [详细]
程序员
Delphi - 以零为单位递增整数

如何解决《Delphi-以零为单位递增整数》经验，为你挑选了1个好方法。 ... [详细]
程序员
OutputStreamWriter.append不将文本附加到Android编程的文本文件中

如何解决《OutputStreamWriter.append不将文本附加到Android编程的文本文件中》经验，为你挑选了1个好方法。 ... [详细]
程序员
Java自动装箱和数学表达式？

如何解决《Java自动装箱和数学表达式？》经验，为你挑选了1个好方法。 ... [详细]
程序员
在C/C++中,ZERO左移还是右移实际生成指令？

如何解决《在C/C++中,ZERO左移还是右移实际生成指令？》经验，为你挑选了1个好方法。 ... [详细]
程序员
UIRectCornerTopRight的圆角不起作用

如何解决《UIRectCornerTopRight的圆角不起作用》经验，为你挑选了1个好方法。 ... [详细]
程序员
为什么lodash的.isObject,.isPlainObject的行为与"typeof x ==='object'"不同？

如何解决《为什么lodash的.isObject,.isPlainObject的行为与"typeofx==='object'"不同？》经验，为你挑选了2个好方法。 ... [详细]
程序员
连接表的索引

如何解决《连接表的索引》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何删除tumblr控件/按钮/ iframe？

如何解决《如何删除tumblr控件/按钮/iframe？》经验，为你挑选了1个好方法。 ... [详细]
程序员
包装BCD到DPD:如何改进这个amd64装配程序？

如何解决《包装BCD到DPD:如何改进这个amd64装配程序？》经验，为你挑选了1个好方法。 ... [详细]
程序员
为什么使用这个全局`operator <<`无法编译？

如何解决《为什么使用这个全局`operator<<`无法编译？》经验，为你挑选了1个好方法。 ... [详细]
程序员
为什么requestIdToken返回null？

如何解决《为什么requestIdToken返回null？》经验，为你挑选了2个好方法。 ... [详细]
程序员
OpenWRT:无法安装软件包 - 内存问题

如何解决《OpenWRT:无法安装软件包-内存问题》经验，为你挑选了1个好方法。 ... [详细]
程序员
是否可以将事件侦听器绑定到外部脚本的阴影dom中的元素？

如何解决《是否可以将事件侦听器绑定到外部脚本的阴影dom中的元素？》经验，为你挑选了1个好方法。 ... [详细]
程序员
升级到react-native 0.16错误

如何解决《升级到react-native0.16错误》经验，为你挑选了1个好方法。 ... [详细]
程序员
TensorFlow检查点保存并读取

如何解决《TensorFlow检查点保存并读取》经验，为你挑选了1个好方法。 ... [详细]
程序员
在同一个StringBuilder实例上调用toString时输出不同

如何解决《在同一个StringBuilder实例上调用toString时输出不同》经验，为你挑选了1个好方法。 ... [详细]

和谐啄木鸟

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章