Question

我遇到了XPath查询问题。我必须解析一个div，它被分成未知数量的＆＃34;部分＆＃34;。其中每个都用h5用节名分隔。可能的部分标题列表是已知的，每个部分标题只能出现一次。此外，每个部分可以包含一些br标签。所以，让我们说我想在＆＃34; SecondHeader＆＃34;下提取文字。

HTML

<div class="some-class">
 <h5>FirstHeader</h5>
  text1
 <h5>SecondHeader</h5>
  text2a<br>
  text2b
 <h5>ThirdHeader</h5>
  text3a<br>
  text3b<br>
  text3c<br>
 <h5>FourthHeader</h5>
  text4
</div>

预期结果（对于SecondSection）

['text2a', 'text2b']

查询＃1

//text()[following-sibling::h5/text()='ThirdHeader']

结果＃1

['text1', 'text2a', 'text2b']

它显然有点太多了，所以我决定将结果限制在所选标题和标题之间的内容之前。

查询＃2

//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']

结果＃2

['text2a', 'text2b']

产生的结果符合预期。但是，这不能使用 - 我不知道SecondHeader / ThirdHeader是否会存在于已解析的页面中。在查询中只需要使用一个节标题。

查询＃3

//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]

结果＃3

[]

你可以告诉我我做错了什么吗？我已在Google Chrome中对其进行了测试。

Answer 1

如果所有h5元素和文本节点都是兄弟节点，并且您需要按部分分组，则可能的选项只是按前面的h5计数选择文本节点。

使用lxml的示例（在Python中）

>>> import lxml.html
>>> s = '''
... <div class="some-class">
...  <h5>FirstHeader</h5>
...   text1
...  <h5>SecondHeader</h5>
...   text2a<br>
...   text2b
...  <h5>ThirdHeader</h5>
...   text3a<br>
...   text3b<br>
...   text3c<br>
...  <h5>FourthHeader</h5>
...   text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n  text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n  text2a', '\n  text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n  text3a', '\n  text3b', '\n  text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n  text4\n']
>>>

Answer 2

您应该只能测试前面的第一个兄弟h5 ...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

XPath - 在两个节点之间提取文本

2 个答案: