至于"反之亦然"(我需要自己,让我找到这个问题,这没有帮助,随后另一个有答案的网站):
u'some string'.encode('ascii', 'xmlcharrefreplace')
将返回一个纯字符串,其中任何非ascii字符都转换为XML(HTML)实体.
至于"反之亦然"(我需要自己,让我找到这个问题,这没有帮助,随后另一个有答案的网站):
u'some string'.encode('ascii', 'xmlcharrefreplace')
将返回一个纯字符串,其中任何非ascii字符都转换为XML(HTML)实体.
你需要有BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup import cgi def HTMLEntitiesToUnicode(text): """Converts HTML entities to unicode. For example '&' becomes '&'.""" text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES)) return text def unicodeToHTMLEntities(text): """Converts unicode to HTML entities. For example '&' becomes '&'.""" text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace') return text text = "&, ®, <, >, ¢, £, ¥, €, §, ©" uni = HTMLEntitiesToUnicode(text) htmlent = unicodeToHTMLEntities(uni) print uni print htmlent # &, ®, <, >, ¢, £, ¥, €, §, © # &, ®, <, >, ¢, £, ¥, €, §, ©
Python 2.7和BeautifulSoup4的更新
Unescape - 用于解码的Unicode HTML htmlparser
(Python 2.7标准库):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood' >>> from HTMLParser import HTMLParser >>> htmlparser = HTMLParser() >>> unescaped = htmlparser.unescape(escaped) >>> unescaped u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print unescaped Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape - 使用bs4
(BeautifulSoup4)unicode的Unicode HTML :
>>> html = '''Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
''' >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> soup.text u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print soup.text Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape - 使用bs4
(BeautifulSoup4)将Unicode解码为HTML :
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood' >>> from bs4.dammit import EntitySubstitution >>> escaper = EntitySubstitution() >>> escaped = escaper.substitute_html(unescaped) >>> escaped u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
作为hekevintran回答表明,你可以使用cgi.escape(s)
编码蜇伤,但要注意报价是编码默认是在功能虚假,它可能是一个好主意,通过quote=True
旁边的字符串关键字参数.但即使通过quote=True
,该函数也不会转义单引号("'"
)(由于这些问题,该函数自版本3.2以来已被弃用)
有人建议使用html.escape(s)
而不是cgi.escape(s)
.(3.2版中新增功能)
也html.unescape(s)
已在3.4版中引入.
所以在python 3.4中你可以:
使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
特殊字符转换为HTML实体.
而html.unescape(text)
转换的HTML实体回纯文本表示.