Question

我有这个HTML代码：

<a name="apple"></a>
<h3> header1 </h3>
<p> some text </p>
<p> some text1 </p>
<a name="orange"></a>
<h3> header2 </h3>
<p> some text 2 </p>

我想在header标签之后检索文本，使用如下代码：

for header in tree.iter('h3'):
 paragraph = header.xpath('(.//following::p)[1]')
 if (header.text=="apple"):
    print "%s: %s" % (header.text, paragraph[0].text)

当我有多个<p>标记时，它不起作用。如何找到标题后我有多少<p>个标签并检索所有标签？

我使用python 2.7和xpath。

Answer 1

使用lxml（itersibling()）可能更容易，对兄弟姐妹而不是后代工作，然后在必要时处理这些兄弟姐妹的后代。

您可以尝试这样的事情

>>> for heading in root.iter("h3"):
...     print "----", heading
...     for sibling in heading.itersiblings():
...         if sibling.tag == 'h3':
...             break
...         print sibling
... 
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>

如果您想使用XPath，可以使用lxml中提供的EXSLT set extension（通过"http://exslt.org/sets"命名空间，其思路与上述大致相同：

选择所有兄弟姐妹（following-sibling::*），
但是排除（set:difference()）下一个<h3>兄弟（following-sibling::h3）和（| XPath运算符）以下所有兄弟姐妹（following-sibling::h3/following-sibling::*）。

可以这样使用：

>>> following_siblings_untilh3 = lxml.etree.XPath("""
...         set:difference(
...             following-sibling::*,
...             (following-sibling::h3|following-sibling::h3/following-sibling::*))""",
...         namespaces={"set": "http://exslt.org/sets"})
>>> 
>>> for heading in root.iter("h3"):
...     print "----", heading
...     for e in following_siblings_noth3(heading): print e
... 
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>

我确信它可以简化。（我还没找到following-sibling-or-self::h3 ...）

使用xpath的html中特定标记之后的下一个标记是什么

1 个答案: