跟随rverst :: follow_link()的多个链接

时间:2017-04-19 16:01:31

标签: r rvest

以下代码转到R Journal的Accepted articles页面,并下载第一篇文章 package

library(rvest)
library(magrittr)
url_stem <- html_session("https://journal.r-
project.org/archive/accepted/")
url_paper <- follow_link(url_stem, "package") %>%
  follow_link("package") -> url_article
download.file(url_article$url, destfile = "article.pdf")

我想要的是从一组给定的单词下载所有包含一个或多个mathing单词的文章

由于follow_link()采用一个表达式,我试图循环搜索术语 - 考虑到函数在找不到匹配链接的情况下返回错误这一事实。

library(rvest)
library(magrittr)
url_stem <- html_session("https://journal.r-project.org/archive/accepted/")
search_terms <- c("package", "model", "linear")
tryCatch(
  for(i in search_terms) {
  url_paper <- follow_link(url_stem, search_terms[i]) %>%
  follow_link(search_terms[i]) -> url_article
  download.file(url_article$url, destfile = "article.pdf") # Don't how I would write it as article[i=1,2, ...].pdf
}
)

我收到以下错误:

Error in if (!any(match)) { : missing value where TRUE/FALSE needed

This帖子并不实用,因为它解决了tags的情况。问题看似简单,可能会以更简单的方式解决,但这可能是因为R Journal网站非常简洁。有些网站比较混乱。

1 个答案:

答案 0 :(得分:1)

如果这是您要解决的问题(使用'package'查找r日记条目)而不是另一个网站的更大抓取任务的较小示例,那么您可以这样做:

library(xml2)
library(stringi)
library(tidyverse)

doc <- xml_ns_strip(read_xml("https://journal.r-project.org/rss.atom"))

xml_find_all(doc, "//entry[contains(., 'ackage')]") %>% 
  map_chr(~{
    xml_find_first(.x, ".//link") %>% 
      xml_attr("href") %>% 
      stri_replace_last_fixed("/index.html", "") %>% 
      stri_replace_last_regex("/(RJ-.*)$", "/$1/$1.pdf")

##  [1] "https://journal.r-project.org/archive/2017/RJ-2017-003/RJ-2017-003.pdf"
##  [2] "https://journal.r-project.org/archive/2017/RJ-2017-005/RJ-2017-005.pdf"
##  [3] "https://journal.r-project.org/archive/2017/RJ-2017-006/RJ-2017-006.pdf"
##  [4] "https://journal.r-project.org/archive/2017/RJ-2017-008/RJ-2017-008.pdf"
##  [5] "https://journal.r-project.org/archive/2017/RJ-2017-010/RJ-2017-010.pdf"
##  [6] "https://journal.r-project.org/archive/2017/RJ-2017-011/RJ-2017-011.pdf"
##  [7] "https://journal.r-project.org/archive/2017/RJ-2017-015/RJ-2017-015.pdf"
##  [8] "https://journal.r-project.org/archive/2017/RJ-2017-012/RJ-2017-012.pdf"
##  [9] "https://journal.r-project.org/archive/2017/RJ-2017-016/RJ-2017-016.pdf"
## [10] "https://journal.r-project.org/archive/2017/RJ-2017-014/RJ-2017-014.pdf"
## [11] "https://journal.r-project.org/archive/2017/RJ-2017-018/RJ-2017-018.pdf"
## [12] "https://journal.r-project.org/archive/2017/RJ-2017-019/RJ-2017-019.pdf"
## [13] "https://journal.r-project.org/archive/2017/RJ-2017-021/RJ-2017-021.pdf"
## [14] "https://journal.r-project.org/archive/2017/RJ-2017-022/RJ-2017-022.pdf"
## [15] "https://journal.r-project.org/archive/2016/RJ-2016-031/RJ-2016-031.pdf"
## [16] "https://journal.r-project.org/archive/2016/RJ-2016-032/RJ-2016-032.pdf"
## [17] "https://journal.r-project.org/archive/2016/RJ-2016-033/RJ-2016-033.pdf"
## [18] "https://journal.r-project.org/archive/2016/RJ-2016-034/RJ-2016-034.pdf"
## [19] "https://journal.r-project.org/archive/2016/RJ-2016-036/RJ-2016-036.pdf"
## [20] "https://journal.r-project.org/archive/2016/RJ-2016-041/RJ-2016-041.pdf"
## [21] "https://journal.r-project.org/archive/2016/RJ-2016-043/RJ-2016-043.pdf"
## [22] "https://journal.r-project.org/archive/2016/RJ-2016-045/RJ-2016-045.pdf"
## [23] "https://journal.r-project.org/archive/2016/RJ-2016-046/RJ-2016-046.pdf"
## [24] "https://journal.r-project.org/archive/2016/RJ-2016-047/RJ-2016-047.pdf"
## [25] "https://journal.r-project.org/archive/2016/RJ-2016-048/RJ-2016-048.pdf"
## [26] "https://journal.r-project.org/archive/2016/RJ-2016-050/RJ-2016-050.pdf"
## [27] "https://journal.r-project.org/archive/2016/RJ-2016-052/RJ-2016-052.pdf"
## [28] "https://journal.r-project.org/archive/2016/RJ-2016-054/RJ-2016-054.pdf"
## [29] "https://journal.r-project.org/archive/2016/RJ-2016-055/RJ-2016-055.pdf"
## [30] "https://journal.r-project.org/archive/2016/RJ-2016-056/RJ-2016-056.pdf"
## [31] "https://journal.r-project.org/archive/2016/RJ-2016-057/RJ-2016-057.pdf"
## [32] "https://journal.r-project.org/archive/2016/RJ-2016-058/RJ-2016-058.pdf"
## [33] "https://journal.r-project.org/archive/2016/RJ-2016-059/RJ-2016-059.pdf"
## [34] "https://journal.r-project.org/archive/2016/RJ-2016-060/RJ-2016-060.pdf"
## [35] "https://journal.r-project.org/archive/2016/RJ-2016-062/RJ-2016-062.pdf"

RSS提要是一个更容易理解的抓取来源。

即使这不是具体任务,我想这一行:

xml_find_all(doc, "//entry[contains(., 'ackage')]")

最终是你所追求的。这会找到所有entry个标记,这些标记在后代的任何位置都有该字符串。您可以在[]中使用XPath布尔逻辑(即逻辑链多个contains())。