Question

我正在尝试仅使用xpath解析（文章）文本。

我想获取所有直接子节点和所有嵌套后代文本的文本，除了以下节点/标记：<script>, <ul class="pager pagenav">, <style>。

使用xpath匹配的示例html：

<section class="entry-content">
    want this article text
    <script>dont want this</script>
    more text i want
    <p>want this text too</p>
    <any>also this</any>
    <style>dont want this either</style>
    <ul class="pager pagenav">nope, dont want this <a>Prev Next</a></ul>
</section>

目前，我有类似的东西：

    result = tree.xpath('//section[@class="entry-content"]/*[not(descendant-or-self::script or self::ul[@class="pager pagenav"] or self::style)]/../descendant-or-self::text()')

..但它不太有用。

Answer 1

使用child::node()匹配常规子节点和文本子节点：

child::node()选择上下文节点的所有子节点，无论节点类型是什么

self::有助于过滤具有特定名称的不需要的元素：

//section[@class="entry-content"]/child::node()[not(self::script or self::ul or self::style)]/descendant-or-self::text()

使用xpath获取文章文本但省略了一些标签

1 个答案: