Question

我想用python3和lxml提供的HTML解析器提取一些HTML元素。

考虑这个HTML：

<!DOCTYPE html>
<html>
  <body>
    <span class="foo">
      <span class="bar">bar</span>
      foo
    </span>
  </body>
</html>

考虑这个程序：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[@class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))

在浏览器中，查询选择器＆＃34; span.bar＆＃34;仅选择span元素。这就是我的愿望。但是，上述程序产生：

[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo

看起来我的XPath实际上并不像查询选择器那样，并且在span元素旁边拾取了兄弟文本节点。如何调整XPath以仅选择bar元素，而不是文本＆＃34; foo＆＃34;？

Answer 1

请注意lxml（以及标准模块xml.etree）中的XML树模型具有tail的概念。因此位于 a.k.a following-sibling元素之后的文本节点将存储为该元素的tail。所以你的XPath正确返回span元素，但是根据树模型，它有tail，它保存文本'foo'。

作为一种解决方法，假设您不想进一步使用树模型，只需在打印前清除tail：

>>> bars[0].tail = '' >>> print(html.tostring(bars[0], encoding="unicode")) <span class="bar">bar</span>

使用XPath，选择没有文本同级的节点

1 个答案: