Question

我正在尝试使用rvest和purrr::map来抓取一些页面。但是，我不确定如何使用purrr::safely处理失败的链接。输入以下代码：

library(rvest)
library(purrr)
urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome",
             "lkjsadajf")
h <- urls %>% map(~{
  Sys.sleep(sample(seq(1, 3, by=0.001), 1))
  read_html(.x)})

我收到以下可理解的错误：

Error: 'lkjsadajf' does not exist in current working directory ('/home/user').

如何使用purrr::safely或任何其他错误处理函数来生成一个列表，其中包含所有工作的urls的html以及带有NA的{{1}}的列表那不是吗？

编辑

作为上述问题的扩展：urls函数产生一个嵌套列表。如何处理safely的输出以使其由safely处理？

rvest::html_nodes

Answer 1

一种选择是将read_html用safely包裹，并将otherwise指定为NULL或NA

library(dplyr)
library(purrr)
safe_html <- safely(read_html, otherwise = NULL)
h <- urls %>% 
       map(~{
         Sys.sleep(sample(seq(1, 3, by=0.001), 1))
         safe_html(.x)})

我们可以删除NULL元素并继续

discard(h, ~ is.null(.x$result)) %>%
        map_df(~ .x$result %>% {
        a <- html_nodes(., "#firstHeading") %>%
              html_text()

        b <- html_nodes(., ".toctext") %>% 
              html_text()
        rowr::cbind.fill(a, b, fill = NA)

        } )

r：安全地使用purrr ::: webs抓取失败的网址

1 个答案: