Question

我正在使用python创建一个Web爬虫。正在解析的html似乎有一些直接在父标记中的字符串，它的外观如下：

<div class="chapter-content3">
<noscript>...stuff here filtered successfully</noscript>
<center>...stuff here filtered successfully</center>
<h4>..stuff here shows</h4>
<p>...stuff here shows</h4>
<br>
"this stuff here doesnt show"
<br>
"this neither"
 <p>..stuff here shows</p>
 </div>

我的xpath是这样的：

//div[@class="chapter-content3"]/*[not(self::noscript) and not(self::center) and not(@class="row")]

这会带来一切，但不是直接在

中的字符串

我应该如何构建xpath以显示包含直接在父

内的字符串的所有内容

Answer 1

几乎正确。这里：

//div[@class="chapter-content3"]/*[
   not(self::noscript) and not(self::center) and not(@class="row")
]

*仅选择实际元素。您想要选择所有节点，这将是

//div[@class="chapter-content3"]//node()[
   not(self::noscript) and not(self::center) and not(@class="row")
]

或者，有点短

//div[@class="chapter-content3"]//node()[
   not(self::noscript or self::center or @class="row")
]

或者，一种不同的思考方式 - 除了具有错误祖先的文本节点之外的所有文本节点：

//div[@class="chapter-content3"]//text()[
   not(ancestor::noscript or ancestor::center or ancestor::*/@class="row")
]

如何使用xpath从父html检索嵌套和非嵌套子项？

1 个答案: