Question

我正在尝试通过XPath阅读网页的特定部分。该页面形式不是很好，但我不能改变它......

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

我想提取各种项目的文本，即标题div之间的文本（例如“这是第一项的文本。”）。到目前为止，我已经使用了这个XPath表达式：

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

但是，我无法对结束项目名称进行硬编码，因为在我要删除的页面中，项目的顺序不同（例如“第一项”可能后跟“第三项”）。

非常感谢有关如何调整我的XPath查询的任何帮助。

Answer 1

//*[@class='header' and contains(text(),'First item')]/following::text()[1]将在<div class="header">First item</div>之后选择第一个文字节点 //*[@class='header' and contains(text(),'Second item')]/following::text()[1]会在<div class="header">Second item</div>之后选择第一个文本节点，依此类推 编辑：抱歉，这不适用于<strong>个案例。将更新我的答案
EDIT2：使用@Michiel部分。看起来像omg但有效：//div[@class='textfield'][1]//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[@class='header' and contains(text(),'First item')]])]
似乎应该用更好的解决方案解决这个问题：）

Answer 2

发现它！

//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[@class='header'][1][contains(text(),'First item')]]]

确实，你的解决方案Aleh不适用于文本中的标签。

现在，剩下的一个案例是最后一个项目，后面没有一个带有class = header的元素;所以它将包括“直到文档末尾的所有文本。想法？

Answer 3

为了完整起见，最终查询由整个帖子中的各种建议组成：

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]

通过XPath在节点之间提取文本

3 个答案: