使用tryCatch进行Web Scraping的错误处理

时间:2016-09-27 09:52:36

标签: r

我有两个链接

> primary_link
[1] "https://en.wikipedia.org/wiki/Kahaani_(2012_film)"
> secondary_link
[1] "https://en.wikipedia.org/wiki/Kahaani"

对于主要链接,我收到错误

  

read_html(primary_link)   open.connection出错(x," rb"):HTTP错误404。

但对于辅助链接,我能够完美阅读。

使用tryCatch我尝试编写一个表单的错误处理程序,如果主链接出错,请尝试辅助链接

我正在尝试的代码就是这个

web_page <- tryCatch(read_html(primary_link),finally = read_html(secondary_link))

非常感谢任何帮助

3 个答案:

答案 0 :(得分:2)

您还可以使用http_error函数来确定页面是否可访问。如果发生错误,此函数将返回TRUE

primary_link <- "https://en.wikipedia.org/wiki/Kahaani_(2012_film)"
secondary_link <- "https://en.wikipedia.org/wiki/Kahaani"

library(httr)
urls <- c(primary_link, secondary_link)

sapply(urls, http_error, config(followlocation = 0L), USE.NAMES = F)
###[1]  TRUE FALSE

答案 1 :(得分:1)

如果你想走这条路,那么我认为适当的模式是在第一个链接出现错误的情况下进行第二次String id调用:

tryCatch

答案 2 :(得分:1)

purrr可以制作一些扭曲的代码,现在可以从library(purrr) library(httr) primary_link <- "https://en.wikipedia.org/wiki/Kahaani_(2012_film)" secondary_link <- "https://en.wikipedia.org/wiki/Kahaani" GET_alt <- function(url_1, url_2, .verbose=TRUE) { # this wraps httr::GET in exception handling code in the # event the site is completely inaccessible and not just # issuing 40x errors sGET <- purrr::safely(GET) res <- sGET(url_1) # Now, check for whether it had a severe error or just # didn't retrieve the content successfully and fetch # the alternate URL if so if (is.null(res$result) | (status_code(res$result) != 200)) { if (.verbose) message("Using alternate URL") res <- sGET(url_2) } # I'd do other error handling here besides just issue a # warning, but I have no idea what you're doing so we'll # just issue a warning if (!is.null(res$result)) { warn_for_status(res$result) } return(res$result) } GET_alt(primary_link, secondary_link) 包中找到替代方案。此外,由于您无疑会多次使用此代码,因此您应将其包装在函数中:

initialize()