Question

我需要从HTML页面获取文本，但其中一些包含不必要的文本，这些文本在页面中的某些文本之后（＆＃39; ---------＆＃39;）。例如。 HTML第1页的示例：

...
<p> This is correct text. Everything after it is wrong</p>
<p>---------</p>
<p><strong>This is wrong text</strong></p>
<p> This is wrong another text</p>
...

HTML第2页的示例：

...
<p> This is correct text. Everything after it is wrong</p>
<p> This text is also valid </p>
<p> This is another correct text</p>
...

因此，如果页面包含＆＃39; -----------------＆＃39;，我需要在它之前只抓取文本 - 我需要抓住所有内容。如上所述（Get text followed by certain text）我可以使用：

//p[following-sibling::p[contains(.,'---------')]][1]/text()

对于第一个例子。但有两种情况可以使用一个XPath吗？

Answer 1

//p[    not(contains(.,'---------')) 
    and not(preceding-sibling::p[contains(.,'---------')])]//text()

将返回

This is correct text. Everything after it is wrong

第一个案例和

This is correct text. Everything after it is wrong
This text is also valid
This is another correct text

根据要求，

第二种情况。

获取文字后跟某些文字，如果该文字丢失，则获取所有文字

1 个答案: