Question

我使用XPath来废弃一个网页，但我对代码的一部分有问题：

<div class="description">
   here's the page description
   <span> some other text</span>
   <span> another tag </span>
</div>

我使用此代码从元素中获取值：

description = tree.xpath('//div[@class="description"]/text()')

我可以找到正确的div我正在寻找，但我只想获得文字＆＃34;这里是页面描述＆＃34;不是内跨标记的内容

任何人都知道如何只获取根节点中的文本而不是子节点中的内容？

Answer 1

您当前使用的表达式实际上只匹配顶级文本子节点。您可以将其包装到normalize-space()中以清除额外换行符和空格中的文本：

>>> from lxml.html import fromstring
>>> data = """
... <div class="description">
...    here's the page description
...    <span> some other text</span>
...    <span> another tag </span>
... </div>
... """
>>> root = fromstring(data)
>>> root.xpath('normalize-space(//div[@class="description"]/text())')
"here's the page description"

要获取包含子节点的节点的完整文本，请使用.text_content()方法：

node = tree.xpath('//div[@class="description"]')[0]
print(node.text_content())

Python Xpath仅从根元素获取值

1 个答案: