我正在尝试学习rvest
包,但网络上的文档和示例要么非常基础,要么非常复杂。我找不到如何在循环中使用follow_link
函数来浏览一些页面。也许我根本不理解它的逻辑......
以下是我尝试的简化示例:
library(rvest)
url <-
"https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500"
s <- html_session(url)
liste <- list()
for (i in 1:2) {
data <-
s %>%
read_html() %>%
html_nodes("#mw-whatlinkshere-list li")
result <- c(liste, data)
s <- s %>%
follow_link(xpath = "//a[text()='next 500']/@href")
}
我也试图避开jump_link
,就像这样:它更好,但我不确定是最好和最快的解决方案:
liste <- c()
while (!is.na(url)) {
data <-
url %>%
read_html() %>%
html_nodes("#mw-whatlinkshere-list li")
liste <- c(liste, data)
url <- url %>%
read_html() %>%
html_node(xpath = "//a[text()='next 500']") %>%
html_attr("href") %>%
paste0("https://www.wikidata.org", .)
print(url)
}
欢迎任何建议,我们将不胜感激。
答案 0 :(得分:1)
尝试一下:
library(rvest)
url <- "https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500"
s <- html_session(url)
liste <- list()
for (i in 1:2) {
data <-
s %>%
read_html() %>%
html_nodes("#mw-whatlinkshere-list li")
# There was a mistake here. You were overwriting your results
liste <- c(liste, data)
# Here you have to pass a 'a' tag, not a 'href' value. Besides,
# there is two 'next 500' tags. They are the same, but you have
# to pick one.
s <- s %>%
follow_link(xpath = "//a[text()='next 500'][1]")
}