我的代码:
import re import requests from lxml import etree url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U' r = requests.get(url) items = r.json()['items']
没有编码('utf-8'):
etree.fromstring(items[0])
输出:
ValueError Traceback (most recent call last)in () ----> 1 etree.fromstring(items[0]) lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)() parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)() ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
with encode('utf-8'):
etree.fromstring(items[0].encode('utf-8'))
输出:
File "", line unknown XMLSyntaxError: CData section not finished ?????????:???I??, line 1, column 281
不知道解析这个xml ..
作为解决方法,您可以encoding
在将字符串传递给之前删除属性etree.fromstring
:
xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1) root = etree.fromstring(xml)
看到@ Lea在问题中的评论后更新:
使用显式编码指定解析器:
xml = r.json()['items'].encode('utf-8') root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))