我试图用XPath解析一些HTML.按照下面的简化XML示例,我想匹配字符串'Text 1',然后获取相关content
节点的内容.
Text 1 Stuff I want Text 2 Stuff I don't want
我的Python代码抛出一个摇摆不定的:
>>> from lxml import etree >>> >>> tree = etree.XML("") >>> >>> # get all titles ... tree.xpath('//title/text()') ['Text 1', 'Text 2'] >>> >>> # match 'Text 1' ... tree.xpath('//title/text()="Text 1"') True >>> >>> # Follow parent from selected nodes ... tree.xpath('//title/text()/../..//text()') ['Text 1', 'Stuff I want', 'Text 2', "Stuff I don't want"] >>> >>> # Follow parent from selected node ... tree.xpath('//title/text()="Text 1"/../..//text()') Traceback (most recent call last): File " Text 1 Stuff I want Text 2 Stuff I d on't want ", line 1, in File "lxml.etree.pyx", line 1330, in lxml.etree._Element.xpath (src/ lxml/lxml.etree.c:14542) File "xpath.pxi", line 287, in lxml.etree.XPathElementEvaluator.__ca ll__ (src/lxml/lxml.etree.c:90093) File "xpath.pxi", line 209, in lxml.etree._XPathEvaluatorBase._handl e_result (src/lxml/lxml.etree.c:89446) File "xpath.pxi", line 194, in lxml.etree._XPathEvaluatorBase._raise _eval_error (src/lxml/lxml.etree.c:89281) lxml.etree.XPathEvalError: Invalid type
这在XPath中可行吗?我是否需要以不同的方式表达我想要做的事情?
你想要那个吗?
//title[text()='Text 1']/../content/text()
用途:
string(/*/*/title[. = 'Text 1']/following-sibling::content)
与目前公认的JohannesWeiß解决方案相比,这至少代表了两项改进:
避免使用非常昂贵的缩写"//"(通常导致整个XML文档被扫描),因为无论何时预先知道XML文档的结构,都应该这样做.
没有返回到父级(避免位置步骤"/ ..")