Question

我有这个HTML

<p>
<!-- templateDebugMode: start template: articleLists/indexHeadline.html -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
December 18th 2017 : <a href="https://www.example.org/news/1195.shtml">News item 1</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
December 18th 2017 : <a href="https://www.example.org/news/1194.shtml">News item 2</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
November 29th 2017 : <a href="https://www.example.org/news/1191.shtml">News item 3</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html -->
</p>

我尝试拆分<br>，但使用this answer发现评论可靠：

//*[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
   [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]

这给我链接（我可以用更简单的表达式获得）但不是它之前的日期（文本）。我怎样才能为每个条目提取这个？我希望为每个新闻收集的数据是：

日期
链接
链接文字

Answer 1

您可以使用以下XPath表达式获取所需的输出

//p/text()[string-length(.)>0] # for date
//p/a/@href # for link
//p/a/text() # for link text

如果您仍想在XPath中使用这些注释：

//p/text()[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
          [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]  # for date

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
     [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/@href  # for links

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
     [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/text()  # for link text

XPath在注释之间选择文本

1 个答案: