Question

我正在搜索网站上的数据，如下所示：

<div class="content">
  <blockquote>
    <div>
      Do not select this.
    </div>
    How do I select only this…
    <br />
    and this…
    <br />
    and this in a single node?
  </blockquote>
</div>

假设这样的代码段在单个页面上出现20次，我想获取<blockquote>中的所有文本，但忽略子节点内的所有内容，例如内部div。

因此我使用：

html %>%
  html_nodes(xpath = "//*[@class='content']/blockquote/text()[normalize-space()]")

但是，这会将How do I select only this…，and this…，and this in a single node?分隔为xml_nodeset结构中的各个元素。

我应该怎么做才能将所有这些文本节点连接成一个并返回相同的20个元素（或者只有一个元素，以防我只有这个例子）？

Answer 1

您可以使用xml_remove()函数删除带有CSS或XPATH的节点。

library(rvest)

text <- '<div class="content">
  <blockquote>
    <div>
      Do not select this.
    </div>
    How do I select only this…
    <br />
    and this…
    <br />
    and this in a single node?
  </blockquote>
</div>'

myhtml <- read_html(text)

#select the nodes you don't want to select
do_not_select <- myhtml %>%
    html_nodes("blockquote>div") #using css

#remove those nodes
xml_remove(do_not_select)

您可以删除空白区域，然后再删除

#sample result
myhtml %>%
    html_text()
[1] "\n  \n    \n    How do I select only this…\n    \n    and this…\n    \n    and this in a single node?\n  \n"

Answer 2

您可以在XPath下面尝试连接所有子子串：

"string-join(//*[@class='content']/blockquote/text()[normalize-space()], ' ')"

输出

How do I select only this… and this… and this in a single node?

XPath选择并连接所有文本节点

2 个答案: