使用rvest

时间:2017-07-26 11:04:22

标签: r rvest

我正在尝试学习rvest包,但网络上的文档和示例要么非常基础,要么非常复杂。我找不到如何在循环中使用follow_link函数来浏览一些页面。也许我根本不理解它的逻辑......

以下是我尝试的简化示例:

library(rvest)

url <-
  "https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500"

s <- html_session(url)

liste <- list()
for (i in 1:2) {
  data <-
    s %>%
    read_html() %>%
    html_nodes("#mw-whatlinkshere-list li")

  result <- c(liste, data)

  s <- s %>% 
    follow_link(xpath = "//a[text()='next 500']/@href")

}

我也试图避开jump_link,就像这样:它更好,但我不确定是最好和最快的解决方案:

liste <- c()
while (!is.na(url)) {
  data <-
    url %>%
    read_html() %>%
    html_nodes("#mw-whatlinkshere-list li")


  liste <- c(liste, data)

  url <- url %>% 
    read_html() %>% 
    html_node(xpath = "//a[text()='next 500']") %>% 
    html_attr("href") %>% 
    paste0("https://www.wikidata.org", .) 

  print(url)


}

欢迎任何建议,我们将不胜感激。

1 个答案:

答案 0 :(得分:1)

尝试一下:

library(rvest)

url <- "https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500"
s   <- html_session(url)

liste <- list()
for (i in 1:2) {
        data <-
                s %>%
                read_html() %>%
                html_nodes("#mw-whatlinkshere-list li")

        # There was a mistake here. You were overwriting your results
        liste <- c(liste, data) 

        # Here you have to pass a 'a' tag, not a 'href' value. Besides,
        # there is two 'next 500' tags. They are the same, but you have
        # to pick one.
        s <- s %>% 
            follow_link(xpath = "//a[text()='next 500'][1]") 
}