我正在使用BeautifulSoup来抓一个网站.该网站的页面在我的浏览器中呈现:
乐施会国际的报告题为"越位! http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
特别是单引号和双引号看起来很好.它们看起来是html符号而不是ascii,但奇怪的是当我在FF3中查看源代码时,它们似乎是正常的ascii.
不幸的是,当我刮掉的时候我会得到类似的东西
u'Oxfam International\xe2的报告题为"xe2"--Offside!
哎呀,我的意思是:
u'Oxfam International\xe2€™s report entitled \xe2€œOffside!
页面的元数据表示'iso-88959-1'编码.我尝试了不同的编码,使用unicode-> ascii和html-> ascii第三方功能,并查看了MS/iso-8859-1的差异,但事实是该™与a无关单引号,我似乎无法将unicode + htmlsymbol组合转换为正确的ascii或html符号 - 在我有限的知识中,这就是我寻求帮助的原因.
我很满意ascii双引号,"或"
以下问题是我担心其他有趣的符号解码不正确.
\xe2€™
下面是一些python来重现我所看到的,然后是我尝试过的东西.
import twill from twill import get_browser from twill.commands import go from BeautifulSoup import BeautifulSoup as BSoup url = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271' twill.commands.go(url) soup = BSoup(twill.commands.get_browser().get_html()) ps = soup.body("p") p = ps[52] >>> p Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 22: ordinal not in range(128) >>> p.string u'Oxfam International\xe2€™s report entitled \xe2€œOffside! \r\n'
http://groups.google.com/group/comp.lang.python/browse_frm/thread/9b7bb3f621b4b8e4/3b00a890cf3a5e46?q=htmlentitydefs&rnum=3&hl=en#3b00a890cf3a5e46
http://www.fourmilab.ch/webtools/demoroniser/
http://www.crummy.com/software/BeautifulSoup/documentation.html
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
>>> AsciiDammit.asciiDammit(p.decode()) u'Oxfam International\xe2€™s report entitled \xe2€œOffside! >>> handle_html_entities(p.decode()) u'
Oxfam International\xe2\u20ac\u2122s report entitled \xe2\u20ac\u0153Offside! >>> unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore') '
Oxfam International€™s report entitled €œOffside! >>> htmlStripEscapes(p.string) u'Oxfam International\xe2TMs report entitled \xe2Offside!
编辑:
我尝试过使用不同的BS解析器:
import html5lib bsoup_parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup")) soup = bsoup_parser.parse(twill.commands.get_browser().get_html()) ps = soup.body("p") ps[55].decode()
这给了我这个
u'Oxfam International\xe2\u20ac\u2122s report entitled \xe2\u20ac\u0153Offside!
最好的情况解码似乎给了我相同的结果:
unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore') 'Oxfam InternationalTMs report entitled Offside!
编辑2:
我正在使用FF 3.0.7和Firebug运行Mac OS X 4
Python 2.5(哇,不敢相信我从一开始就没有说明这一点)
这是一个严重混乱的页面,编码明智:-)
你的方法根本没有什么问题.在将它传递给BeautifulSoup之前,我可能倾向于进行转换,因为我是忍者:
import urllib html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read() h = html.decode('iso-8859-1') soup = BeautifulSoup(h)
在这种情况下,页面的元标记与编码有关.该页面实际上是在utf-8 ... Firefox的页面信息显示了真正的编码,你实际上可以在服务器返回的响应头中看到这个charset:
curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271 HTTP/1.1 200 OK Connection: close Date: Tue, 10 Mar 2009 13:14:29 GMT Server: Microsoft-IIS/6.0 X-Powered-By: ASP.NET Set-Cookie: COMPANYID=271;path=/ Content-Language: en-US Content-Type: text/html; charset=UTF-8
如果你使用'utf-8'进行解码,它将适合你(或者,至少,对我来说):
import urllib html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read() h = html.decode('utf-8') soup = BeautifulSoup(h) ps = soup.body("p") p = ps[52] print p