我想要一个XPath来获取特定节点和子节点中包含的所有文本。
在下面的例子中,我试图得到:“Neil Carmichael(斯特劳德)(骗局):”
test 1
test2 20
test3 5000
到目前为止,我已使用以下代码设法只获得一个部分:
<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
<b><b>Neil Carmichael</b>
"(Stroud) (Con):"
</b>
"What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>
欢迎任何帮助!
答案 0 :(得分:2)
在/b
处停止您的XPath,以便它返回<b>
元素,而不是<b>
内的文本节点。然后,您可以在每个元素上调用text_content()
以获得预期的文本输出,例如:
from lxml import html
raw = '''<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
<b><b>Neil Carmichael</b>
"(Stroud) (Con):"
</b>
"What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>'''
root = html.fromstring(raw)
result = root.xpath('//p/b')
print result[0].text_content()
# output :
# 'Neil Carmichael\n "(Stroud) (Con):"\n '
作为text_content()
的替代方案,您可以使用XPath string()
函数和可选normalize-space()
:
print result[0].xpath('string(normalize-space())')
# output :
# Neil Carmichael "(Stroud) (Con):"