R httr从ftp错误421下载文件“从您的网址太多连接”

时间:2017-07-18 12:53:43

标签: r ftp httr

编辑 - 简短问题httr是否有关闭FTP连接的终结器?

我正在使用httr包从NASA NEX项目的ftp服务器下载气候预测文件。

我的脚本是:

library(httr)

var = c("pr", "tasmin", "tasmax")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2040,2080,1))

for (v in var) {
  for (r in rcp) {
    url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
    for (m in mod) {
  for (y in year) {
    nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
    url1<- paste0(url,nfile, sep='')
    destfile<-paste0('mypath',r,'/',v,'/',nfile, sep='')
    GET(url=url1, authenticate(user='NEXGDDP', password='', type = "basic"), write_disk(path=destfile, overwrite = FALSE ))
    Sys.sleep(0.5)
  }}}}

一段时间后,服务器停止我的连接,出现以下错误: “ 421您的互联网地址连接太多”。

我读到here这是因为打开的连接数量,我应该在每次迭代时关闭它们(我不确定这真的有意义!)。 有没有办法用httr包关闭ftp?

2 个答案:

答案 0 :(得分:2)

提议的解决方案(摘要答案)

建议的解决方案 - 为httr

设置到ftp服务器的最大连接数
> config(CURLOPT_MAXCONNECTS=5)
<request>
Options:
* CURLOPT_MAXCONNECTS: 5

说明

序言:

httr包是curl的包装器。这很重要,因为它抽象了卷曲界面。在这种情况下,我们希望通过curl抽象修改curls配置来修改httr行为。

  • httr默认处理对同一网站的请求之间的自动连接共享(默认情况下,自动管理curl句柄),跨请求维护cookie以及最新的根级SSL证书存储使用。

在这种情况下,我们不控制FTP服务器,只控制客户端对服务器的请求。因此,我们可以通过httr:config修改curl的默认行为,以减少同时发送的FTP请求数。

询问httr curl ftp选项

要检索当前选项,我们可以执行以下命令:

>httr_options("ftp")
                       httr                         libcurl    type
49              ftp_account             CURLOPT_FTP_ACCOUNT  string
50  ftp_alternative_to_user CURLOPT_FTP_ALTERNATIVE_TO_USER  string
51  ftp_create_missing_dirs CURLOPT_FTP_CREATE_MISSING_DIRS integer
52           ftp_filemethod          CURLOPT_FTP_FILEMETHOD integer
53     ftp_response_timeout    CURLOPT_FTP_RESPONSE_TIMEOUT integer
54         ftp_skip_pasv_ip        CURLOPT_FTP_SKIP_PASV_IP integer
55              ftp_ssl_ccc             CURLOPT_FTP_SSL_CCC integer
56             ftp_use_eprt            CURLOPT_FTP_USE_EPRT integer
57             ftp_use_epsv            CURLOPT_FTP_USE_EPSV integer
58             ftp_use_pret            CURLOPT_FTP_USE_PRET integer
59                  ftpport                 CURLOPT_FTPPORT  string
60               ftpsslauth              CURLOPT_FTPSSLAUTH integer
196            tftp_blksize            CURLOPT_TFTP_BLKSIZE integer 

访问libcurl文档,我们可以调用curl_docs("CURLOPT_FTP_ACCOUNT")

修改httr请求配置

您可以使用httr修改set_config()全局卷曲配置,也可以使用with_config()包裹您的请求。在这种情况下,我们希望限制到ftp服务器的最大连接数。

因此:

httr_options("max")
                    httr                      libcurl    type
95  max_recv_speed_large CURLOPT_MAX_RECV_SPEED_LARGE  number
96  max_send_speed_large CURLOPT_MAX_SEND_SPEED_LARGE  number
97           maxconnects          CURLOPT_MAXCONNECTS integer
98           maxfilesize          CURLOPT_MAXFILESIZE integer
99     maxfilesize_large    CURLOPT_MAXFILESIZE_LARGE  number
100            maxredirs            CURLOPT_MAXREDIRS integer 

我们现在可以查找curl_docs("CURLOPT_MAXCONNECTS") - 这就是我们想要的。

现在我们必须设置它。

> config(CURLOPT_MAXCONNECTS=5)
<request>
Options:
* CURLOPT_MAXCONNECTS: 5

参考: https://cran.r-project.org/web/packages/httr/httr.pdf

替代RCurl方法

我知道这有点多余,我把它包括在内以提供另一种方法。为什么?由于网络带宽,这里存在一个微妙的问题...运行多个同时发送的FTP会话可能比串行运行它们要慢。我的替代方法是在下面运行R脚本或直接通过Unix shell命令行使用curl。

require(RCurl)
require(stringr)
opts = curlOptions(userpwd = "NEXGDDP:", netrc = TRUE)

rcpDir  = c("rcp45", "rcp85")
varDir  = c("pr", "tasmin", "tasmax")

for (rcp in rcpDir ) {
  for (var in varDir ) {
    url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
    print(url)
    filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
    filelist <- unlist(str_split(filenames, "\n"))
    filelist <- filelist[!filelist == ""]
    filesavg <- str_detect(filelist,
                          "inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
    filesavg <- filelist[filesavg]
    filesavg
    urlsavg <- str_c(url, filesavg)

    for (file in seq_along(urlsavg)) {
      fname <- str_c("data/", filesavg[file])
      if (!file.exists(fname)) {
        print(urlsavg[file])
        bin <- getBinaryURL(urlsavg[file], .opts = opts)
        writeBin(bin, fname)
        Sys.sleep(1)
      }
    }
  }
}

代码输出

> require(RCurl)
> require(stringr)
> opts = curlOptions(userpwd = "NEXGDDP:", netrc = TRUE)
> rcpDir  = c("rcp45", "rcp85")
> varDir  = c("pr", "tasmin", "tasmax")
> for (rcp in rcpDir ) {
+   for (var in varDir ) {
+     url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
+     print(url)
+     filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
+     filelist <- unlist(str_split(filenames, "\n"))
+     filelist <- filelist[!filelist == ""]
+     filesavg <- str_detect(filelist,
+                           "inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
+     filesavg <- filelist[filesavg]
+     filesavg
+     urlsavg <- str_c(url, filesavg)
+ 
+     for (file in seq_along(urlsavg)) {
+       fname <- str_c("data/", filesavg[file])
+       if (!file.exists(fname)) {
+         print(urlsavg[file])
+         bin <- getBinaryURL(urlsavg[file], .opts = opts)
+         writeBin(bin, fname)
+         Sys.sleep(1)
+       }
+     }
+   }
+ }
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"

答案 1 :(得分:1)

(不确定这应该是一个答案,但我不能在评论中添加所有这些)

总而言之,两种替代解决方案将我的方法与Technophobe提出的方法相结合。我把两者的最终代码放在这里,以防对遇到相同问题的人有所帮助。

httr方法:

library(httr)
#configure a proxy, in case you are in a office/university network
set_config(use_proxy(url='http://~in_case_you_need_a_proxy', port=paste_here_port_no)
#limit the number of simultaneous connections as suggested by Technofobe
#default is 5
config(CURLOPT_MAXCONNECTS=3)

var = c("pr","tasmax","tasmin")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
  for (r in rcp) {
  url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
    for (m in mod) {
      for (y in year) {
    nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
    url1<- paste0(url,nfile, sep='')
    destfile<-paste0('D:/destination_path/',r,'/',v,'/',nfile, sep='')
    GET(url=url1, authenticate(user='NEXGDDP', password='', type = "basic"), write_disk(path=destfile, overwrite = FALSE ))
    gc()
    Sys.sleep(1)
}}}}

使用RCurl

的替代方法
library(RCurl)
opts = curlOptions(proxy='http://~in_case_you_need_a_proxy:paste_here_port_no', userpwd = "NEXGDDP:", netrc = TRUE)

    var = c("pr","tasmax","tasmin")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
  for (r in rcp) {
  url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
    for (m in mod) {
      for (y in year) {
    nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
    url1<- paste0(url,nfile, sep='')
    destfile<-paste0('D:/destination_path/',r,'/',v,'/',nfile, sep='')
    bin <- getBinaryURL(url1, .opts = opts)
    writeBin(bin, destfile)
    Sys.sleep(1)
    gc()
  }}}}

这两种方法都经过测试和研究。第二个可能仍然受到421错误问题的影响,但是出现次数非常有限(我下载的文件超过900个,总共约600 GB)。希望这对于在该领域工作的其他人来说是一个很好的参考。