我有这个HTML
<p>
<!-- templateDebugMode: start template: articleLists/indexHeadline.html -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
December 18th 2017 : <a href="https://www.example.org/news/1195.shtml">News item 1</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
December 18th 2017 : <a href="https://www.example.org/news/1194.shtml">News item 2</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- templateDebugMode: start template: articleLists/indexHeadline.html - templateCell: articleRow -->
November 29th 2017 : <a href="https://www.example.org/news/1191.shtml">News item 3</a><br>
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html - templateCell: articleRow -->
<!-- /templateDebugMode: end template: articleLists/indexHeadline.html -->
</p>
我尝试拆分<br>
,但使用this answer发现评论可靠:
//*[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]
这给我链接(我可以用更简单的表达式获得)但不是它之前的日期(文本)。我怎样才能为每个条目提取这个?我希望为每个新闻收集的数据是:
答案 0 :(得分:0)
您可以使用以下XPath表达式获取所需的输出
//p/text()[string-length(.)>0] # for date
//p/a/@href # for link
//p/a/text() # for link text
如果您仍想在XPath中使用这些注释:
//p/text()[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]] # for date
//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/@href # for links
//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/text() # for link text