XPath选择并连接所有文本节点

时间:2018-06-12 16:33:10

标签: r xpath web-scraping rvest

我正在搜索网站上的数据,如下所示:

<div class="content">
  <blockquote>
    <div>
      Do not select this.
    </div>
    How do I select only this…
    <br />
    and this…
    <br />
    and this in a single node?
  </blockquote>
</div>

假设这样的代码段在单个页面上出现20次,我想获取<blockquote>中的所有文本,但忽略子节点内的所有内容,例如内部div

因此我使用:

html %>%
  html_nodes(xpath = "//*[@class='content']/blockquote/text()[normalize-space()]")

但是,这会将How do I select only this…and this…and this in a single node?分隔为xml_nodeset结构中的各个元素。

我应该怎么做才能将所有这些文本节点连接成一个并返回相同的20个元素(或者只有一个元素,以防我只有这个例子)?

2 个答案:

答案 0 :(得分:2)

您可以使用xml_remove()函数删除带有CSS或XPATH的节点。

library(rvest)

text <- '<div class="content">
  <blockquote>
    <div>
      Do not select this.
    </div>
    How do I select only this…
    <br />
    and this…
    <br />
    and this in a single node?
  </blockquote>
</div>'

myhtml <- read_html(text)

#select the nodes you don't want to select
do_not_select <- myhtml %>%
    html_nodes("blockquote>div") #using css

#remove those nodes
xml_remove(do_not_select)

您可以删除空白区域,然后再删除

#sample result
myhtml %>%
    html_text()
[1] "\n  \n    \n    How do I select only this…\n    \n    and this…\n    \n    and this in a single node?\n  \n"

答案 1 :(得分:1)

您可以在XPath下面尝试连接所有子子串:

"string-join(//*[@class='content']/blockquote/text()[normalize-space()], ' ')"

输出

How do I select only this… and this… and this in a single node?