Question

我想要一个XPath来获取特定节点和子节点中包含的所有文本。

在下面的例子中，我试图得到：“Neil Carmichael（斯特劳德）（骗局）：”

test                     1
test2                   20
test3                 5000

到目前为止，我已使用以下代码设法只获得一个部分：

<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
    <b><b>Neil Carmichael</b>
     "(Stroud) (Con):"
    </b>
    "What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>

欢迎任何帮助！

Answer 1

在/b处停止您的XPath，以便它返回<b>元素，而不是<b>内的文本节点。然后，您可以在每个元素上调用text_content()以获得预期的文本输出，例如：

from lxml import html

raw = '''<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
    <b><b>Neil Carmichael</b>
     "(Stroud) (Con):"
    </b>
    "What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>'''

root = html.fromstring(raw)
result = root.xpath('//p/b')
print result[0].text_content()

# output :
# 'Neil Carmichael\n     "(Stroud) (Con):"\n    '

作为text_content()的替代方案，您可以使用XPath string()函数和可选normalize-space()：

print result[0].xpath('string(normalize-space())')
# output :
# Neil Carmichael "(Stroud) (Con):"

XPath如何获取子节点文本和自我

1 个答案: