我需要从HTML页面获取文本,但其中一些包含不必要的文本,这些文本在页面中的某些文本之后(' ---------')。 例如。 HTML第1页的示例:
...
<p> This is correct text. Everything after it is wrong</p>
<p>---------</p>
<p><strong>This is wrong text</strong></p>
<p> This is wrong another text</p>
...
HTML第2页的示例:
...
<p> This is correct text. Everything after it is wrong</p>
<p> This text is also valid </p>
<p> This is another correct text</p>
...
因此,如果页面包含&#39; -----------------&#39;,我需要在它之前只抓取文本 - 我需要抓住所有内容。如上所述(Get text followed by certain text)我可以使用:
//p[following-sibling::p[contains(.,'---------')]][1]/text()
对于第一个例子。但有两种情况可以使用一个XPath吗?
答案 0 :(得分:1)
//p[ not(contains(.,'---------'))
and not(preceding-sibling::p[contains(.,'---------')])]//text()
将返回
This is correct text. Everything after it is wrong
第一个案例和
This is correct text. Everything after it is wrong
This text is also valid
This is another correct text
根据要求,第二种情况。