Question

我尝试使用以下结构提取文本：

<p class="id1"> Title or something </p>    
<p> Text text text </p>
<p> More text </p>
<p class="id2"> Something else </p>

当我使用时：

text_info <- xpathSApply(PARSED, "//p", xmlValue)

结果是：

[1] 'Title or something'
[2] 'Text text text'
[3] 'More text'
[4] 'Something else'

我只希望内的文字没有类：

[1] 'Text text text'
[2] 'More text'

我使用以下代码，但需要很长时间，而且我有很多文字：

text_info <- setdiff(xpathSApply(PARSED, "//p", xmlValue), xpathSApply(PARSED, "//p[@class]", xmlValue))

有没有办法只使用一个xpathSApply来提取没有类的人？

Answer 1

您可以在XPath中使用not()。

xpathSApply(doc, "//p[not(@class)]", xmlValue, trim = TRUE)
# [1] "Text text text" "More text"

这会选择元素而不是类属性。

数据：

library(XML) doc <- htmlParse(' Title or something Text text text More text Something else ')