Question

我有更多的pdf文件链接，我想在for循环中使用download.file下载。我的解决方案工作正常，但它遇到错误时停止（许多文件不起作用）。我想在我的download.file函数中添加一个功能，告诉R跳过文件，如果下载产生错误并打印一条消息，其中包含遇到错误的页面的名称。

我发现tryCatch在这种情况下可能是一个很好的解决方案，但我不完全确定在哪里放置它（我尝试了很多方法，但都没有用）。

这是我的代码：

for (i in length(files) {

# Reads the html links 
  html <- read_html(files[i])
  reads_name <- html_nodes(html, 'h1') 
  name <- trimws(html_text(reads_name) )

# Extracts the pdf. link from all links that the webpage contains 
  webpagelinks <- html_attr(html_nodes(html, "a"), "href")
  extract_pdf_link <- webpagelinks[grepl("\\pdf", webpagelinks)]

# downloads the pdf file from the pdf link, here is where I get the error 
  download.file(extract_pdf_link, destfile = paste0(name, "_paper.pdf") , 
mode = "wb")

  skip_with_message = simpleError('Did not work out')
  tryCatch(print(name), error = function(e) skip_with_message)

  }

有关如何解决此问题的任何建议？

非常感谢！

Answer 1

将download.file放入tryCatch。例如

files <- c("http://foo.com/bar.pdf", "http://www.orimi.com/pdf-test.pdf", "http://bar.com/foo.pdf")
oldw <- getOption("warn")
options(warn = -1)
for (file in files) {
    tryCatch(download.file(file, tempfile(), mode = "wb", quiet = FALSE), 
        error = function(e) print(paste(file, 'did not work out')))    
}
options(warn = oldw)

我在开始时使用options(warn = -1)关闭警告以禁止无关的警告消息，并在结束时恢复之前的设置。这将为您提供类似

的输出

# trying URL 'http://foo.com/bar.pdf'
# [1] "http://foo.com/bar.pdf did not work out"
# trying URL 'http://www.orimi.com/pdf-test.pdf'
# Content type 'application/pdf' length 20597 bytes (20 KB)
# ==================================================
# downloaded 20 KB

# trying URL 'http://bar.com/foo.pdf'
# [1] "http://bar.com/foo.pdf did not work out"

使用R

1 个答案: