我正在做一些网页抓取,网站经常使用HTML实体来表示非ascii字符.Python是否有一个实用程序,它接受带有HTML实体的字符串并返回unicode类型?
例如:
我回来了:
ǎ
代表带有音标的"ǎ".在二进制中,这表示为16位01ce.我想将html实体转换为值 u'\u01ce'
Python有htmlentitydefs模块,但是这不包括unescape HTML实体的功能.
Python开发人员Fredrik Lundh(elementtree的作者,除其他外)在他的网站上有这样的功能,它与十进制,十六进制和命名实体一起使用:
import re, htmlentitydefs ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary. def unescape(text): def fixup(m): text = m.group(0) if text[:2] == "": # character reference try: if text[:3] == "": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("?\w+;", fixup, text)
标准的lib自己的HTMLParser有一个未记录的函数unescape(),它完全符合你的想法:
import HTMLParser h = HTMLParser.HTMLParser() h.unescape('© 2010') # u'\xa9 2010' h.unescape('© 2010') # u'\xa9 2010'
使用内置unichr
- BeautifulSoup是没有必要的:
>>> entity = 'ǎ' >>> unichr(int(entity[3:],16)) u'\u01ce'
另一种方法,如果你有lxml:
>>> import lxml.html >>> lxml.html.fromstring('ǎ').text u'\u01ce'
如果您使用的是Python 3.4或更高版本,则只需使用html.unescape
:
import html s = html.unescape(s)
你可以在这里找到答案 - 从网页上获取国际字符?
编辑:似乎BeautifulSoup
不转换以十六进制形式编写的实体.它可以修复:
import copy, re from BeautifulSoup import BeautifulSoup hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) # replace hexadecimal character reference by decimal one hexentityMassage += [(re.compile('([^;]+);'), lambda m: '%d;' % int(m.group(1), 16))] def convert(html): return BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES, markupMassage=hexentityMassage).contents[0].string html = 'ǎǎ' print repr(convert(html)) # u'\u01ce\u01ce'
编辑:
unescape()
@dF提到的函数使用 htmlentitydefs
标准模块,unichr()
在这种情况下可能更合适.
这个函数可以帮助您正确地将实体转换回utf-8字符.
def unescape(text): """Removes HTML or XML character references and entities from a text string. @param text The HTML (or XML) source text. @return The plain text, as a Unicode string, if necessary. from Fredrik Lundh 2008-01-03: input only unicode characters string. http://effbot.org/zone/re-sub.htm#unescape-html """ def fixup(m): text = m.group(0) if text[:2] == "": # character reference try: if text[:3] == "": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: print "Value Error" pass else: # named entity # reescape the reserved characters. try: if text[1:-1] == "amp": text = "&" elif text[1:-1] == "gt": text = ">" elif text[1:-1] == "lt": text = "<" else: print text[1:-1] text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: print "keyerror" pass return text # leave as is return re.sub("?\w+;", fixup, text)