当前位置:  开发笔记 > 编程语言 > 正文

lxml无法解析xml(其他编码是否为utf-8)[python]

如何解决《lxml无法解析xml(其他编码是否为utf-8)[python]》经验,为你挑选了1个好方法。

我的代码:

import re
import requests
from lxml import etree

url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U'

r = requests.get(url)

items = r.json()['items']

    没有编码('utf-8'):

etree.fromstring(items[0]) 输出:

ValueError                                
Traceback (most recent call last)
 in ()
----> 1 etree.fromstring(items[0])

lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)()

parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

    with encode('utf-8'):

etree.fromstring(items[0].encode('utf-8')) 输出:

  File "", line unknown
XMLSyntaxError: CData section not finished
?????????:???I??, line 1, column 281

不知道解析这个xml ..



1> falsetru..:

作为解决方法,您可以encoding在将字符串传递给之前删除属性etree.fromstring:

xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1)
root = etree.fromstring(xml)

看到@ Lea在问题中的评论后更新:

使用显式编码指定解析器:

xml = r.json()['items'].encode('utf-8')
root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))

推荐阅读
135369一生真爱_890
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有