XPath在注释之间选择文本

时间:2018-01-09 18:35:02

标签: xpath

我有这个HTML

<p>
<!-- templateDebugMode: start template: articleLists/indexHeadline.html -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
December 18th 2017 : <a href="https://www.example.org/news/1195.shtml">News item 1</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
December 18th 2017 : <a href="https://www.example.org/news/1194.shtml">News item 2</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
November 29th 2017 : <a href="https://www.example.org/news/1191.shtml">News item 3</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html -->
</p>

我尝试拆分<br>,但使用this answer发现评论可靠:

//*[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
   [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]

这给我链接(我可以用更简单的表达式获得)但不是它之前的日期(文本)。我怎样才能为每个条目提取这个?我希望为每个新闻收集的数据是:

  • 日期
  • 链接
  • 链接文字

1 个答案:

答案 0 :(得分:0)

您可以使用以下XPath表达式获取所需的输出

//p/text()[string-length(.)>0] # for date
//p/a/@href # for link
//p/a/text() # for link text

如果您仍想在XPath中使用这些注释:

//p/text()[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
          [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]  # for date

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
     [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/@href  # for links

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
     [following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/text()  # for link text