Question

我正在抓报纸文章，我正在努力弄清楚如何排除多个节点。 R帮助说:not()接受一系列简单的选择器。我尝试了以下

zeit_url <- read.html("http://www.zeit.de/wissen/gesundheit/2017-09/aids-hiv-neuinfektionen-europa-virus-gesundheit)

article <- zeit_url %>%
    html_nodes('.article-page>:not(.ad-container, .cardstack)') %>%
    html_text()

使用逗号分隔两个节点不起作用。有关如何在:not()？

中正确指定选择器序列的任何建议

我花了很多时间寻找答案，但我是R（和HTML）的新手，所以如果这很明显，我感谢您的耐心等待。

Answer 1

library(rvest)
zeit_url <- read_html("http://www.zeit.de/wissen/gesundheit/2017-
            09/aids-hiv-neuinfektionen-europa-virus-gesundheit")

article <- zeit_url %>%
           html_nodes(".article-page>:not(.ad-container):not(.cardstack)") %>%
           html_text()

挖掘多个节点RVest

1 个答案: