Question

此代码尝试下载不存在的页面：

public $table = 'agent_c1';
public $incrementing = false;
public $primaryKey = 'code';
public $fillable = ['code', 'description'];

这将返回404错误：

url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")

但是trying URL 'https://en.wikipedia.org/asdfasdfasdf' Error in download.file(url, destfile = "output.html", method = "libcurl") : cannot open URL 'https://en.wikipedia.org/asdfasdfasdf' In addition: Warning message: In download.file(url, destfile = "output.html", method = "libcurl") : cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'变量仍然包含0，即使code的文档指出返回的值是：

（不可见的）整数代码，成功则为0，失败则为非零。对于“ wget”和“ curl”方法，这是外部程序返回的状态代码。 “内部”方法可以返回1，但在大多数情况下会引发错误。

如果我使用download.file或curl作为下载方法，结果是相同的。我在这里想念什么？是调用wget并解析输出的唯一选择吗？

我见过other questions关于使用warnings()的信息，但是实际上我没有找到（我可以找到）HTTP状态代码。

Answer 1

最好的选择可能是直接使用cURL库，而不是通过download.file包装器使用，后者不提供cURL的全部功能。例如，我们可以使用RCurl包来做到这一点（尽管其他包，例如httr或系统调用也可以实现相同的目的）。直接使用cURL将允许您访问cURL信息，包括响应代码。例如：

library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404

尽管上面的第一个选项更加简洁，但是如果您真的想使用download.file，则一种可能的方法是使用withCallingHandlers

捕获警告。

try(withCallingHandlers( 
  download.file(url, destfile = "output.html", method = "libcurl"),
  warning = function(w) {
    my.warning <<- sub(".+HTTP status was ", "", w)
    }),
  silent = TRUE)

cat(my.warning)
'404 Not Found'

Answer 2

如果您不介意使用其他方法，可以尝试使用GET软件包中的httr：

url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"

# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200

# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404

^{由reprex package（v0.2.1）于2019-01-02创建}

如何从download.file请求中捕获HTTP错误代码？

2 个答案: