编辑 - 简短问题:httr
是否有关闭FTP连接的终结器?
我正在使用httr
包从NASA NEX项目的ftp服务器下载气候预测文件。
我的脚本是:
library(httr)
var = c("pr", "tasmin", "tasmax")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2040,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
for (m in mod) {
for (y in year) {
nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
url1<- paste0(url,nfile, sep='')
destfile<-paste0('mypath',r,'/',v,'/',nfile, sep='')
GET(url=url1, authenticate(user='NEXGDDP', password='', type = "basic"), write_disk(path=destfile, overwrite = FALSE ))
Sys.sleep(0.5)
}}}}
一段时间后,服务器停止我的连接,出现以下错误: “ 421您的互联网地址连接太多”。
我读到here这是因为打开的连接数量,我应该在每次迭代时关闭它们(我不确定这真的有意义!)。
有没有办法用httr
包关闭ftp?
答案 0 :(得分:2)
建议的解决方案 - 为httr
设置到ftp服务器的最大连接数> config(CURLOPT_MAXCONNECTS=5)
<request>
Options:
* CURLOPT_MAXCONNECTS: 5
httr
包是curl
的包装器。这很重要,因为它抽象了卷曲界面。在这种情况下,我们希望通过curl
抽象修改curls配置来修改httr
行为。
httr
默认处理对同一网站的请求之间的自动连接共享(默认情况下,自动管理curl句柄),跨请求维护cookie以及最新的根级SSL证书存储使用。在这种情况下,我们不控制FTP服务器,只控制客户端对服务器的请求。因此,我们可以通过httr:config
修改curl的默认行为,以减少同时发送的FTP请求数。
要检索当前选项,我们可以执行以下命令:
>httr_options("ftp")
httr libcurl type
49 ftp_account CURLOPT_FTP_ACCOUNT string
50 ftp_alternative_to_user CURLOPT_FTP_ALTERNATIVE_TO_USER string
51 ftp_create_missing_dirs CURLOPT_FTP_CREATE_MISSING_DIRS integer
52 ftp_filemethod CURLOPT_FTP_FILEMETHOD integer
53 ftp_response_timeout CURLOPT_FTP_RESPONSE_TIMEOUT integer
54 ftp_skip_pasv_ip CURLOPT_FTP_SKIP_PASV_IP integer
55 ftp_ssl_ccc CURLOPT_FTP_SSL_CCC integer
56 ftp_use_eprt CURLOPT_FTP_USE_EPRT integer
57 ftp_use_epsv CURLOPT_FTP_USE_EPSV integer
58 ftp_use_pret CURLOPT_FTP_USE_PRET integer
59 ftpport CURLOPT_FTPPORT string
60 ftpsslauth CURLOPT_FTPSSLAUTH integer
196 tftp_blksize CURLOPT_TFTP_BLKSIZE integer
访问libcurl文档,我们可以调用curl_docs("CURLOPT_FTP_ACCOUNT")
。
httr
请求配置您可以使用httr
修改set_config()
全局卷曲配置,也可以使用with_config()
包裹您的请求。在这种情况下,我们希望限制到ftp服务器的最大连接数。
因此:
httr_options("max")
httr libcurl type
95 max_recv_speed_large CURLOPT_MAX_RECV_SPEED_LARGE number
96 max_send_speed_large CURLOPT_MAX_SEND_SPEED_LARGE number
97 maxconnects CURLOPT_MAXCONNECTS integer
98 maxfilesize CURLOPT_MAXFILESIZE integer
99 maxfilesize_large CURLOPT_MAXFILESIZE_LARGE number
100 maxredirs CURLOPT_MAXREDIRS integer
我们现在可以查找curl_docs("CURLOPT_MAXCONNECTS")
- 这就是我们想要的。
现在我们必须设置它。
> config(CURLOPT_MAXCONNECTS=5)
<request>
Options:
* CURLOPT_MAXCONNECTS: 5
参考: https://cran.r-project.org/web/packages/httr/httr.pdf
我知道这有点多余,我把它包括在内以提供另一种方法。为什么?由于网络带宽,这里存在一个微妙的问题...运行多个同时发送的FTP会话可能比串行运行它们要慢。我的替代方法是在下面运行R脚本或直接通过Unix shell命令行使用curl。
require(RCurl)
require(stringr)
opts = curlOptions(userpwd = "NEXGDDP:", netrc = TRUE)
rcpDir = c("rcp45", "rcp85")
varDir = c("pr", "tasmin", "tasmax")
for (rcp in rcpDir ) {
for (var in varDir ) {
url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
print(url)
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
filelist <- unlist(str_split(filenames, "\n"))
filelist <- filelist[!filelist == ""]
filesavg <- str_detect(filelist,
"inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
filesavg <- filelist[filesavg]
filesavg
urlsavg <- str_c(url, filesavg)
for (file in seq_along(urlsavg)) {
fname <- str_c("data/", filesavg[file])
if (!file.exists(fname)) {
print(urlsavg[file])
bin <- getBinaryURL(urlsavg[file], .opts = opts)
writeBin(bin, fname)
Sys.sleep(1)
}
}
}
}
> require(RCurl)
> require(stringr)
> opts = curlOptions(userpwd = "NEXGDDP:", netrc = TRUE)
> rcpDir = c("rcp45", "rcp85")
> varDir = c("pr", "tasmin", "tasmax")
> for (rcp in rcpDir ) {
+ for (var in varDir ) {
+ url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
+ print(url)
+ filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
+ filelist <- unlist(str_split(filenames, "\n"))
+ filelist <- filelist[!filelist == ""]
+ filesavg <- str_detect(filelist,
+ "inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
+ filesavg <- filelist[filesavg]
+ filesavg
+ urlsavg <- str_c(url, filesavg)
+
+ for (file in seq_along(urlsavg)) {
+ fname <- str_c("data/", filesavg[file])
+ if (!file.exists(fname)) {
+ print(urlsavg[file])
+ bin <- getBinaryURL(urlsavg[file], .opts = opts)
+ writeBin(bin, fname)
+ Sys.sleep(1)
+ }
+ }
+ }
+ }
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
答案 1 :(得分:1)
(不确定这应该是一个答案,但我不能在评论中添加所有这些)
总而言之,两种替代解决方案将我的方法与Technophobe提出的方法相结合。我把两者的最终代码放在这里,以防对遇到相同问题的人有所帮助。
httr
方法:
library(httr)
#configure a proxy, in case you are in a office/university network
set_config(use_proxy(url='http://~in_case_you_need_a_proxy', port=paste_here_port_no)
#limit the number of simultaneous connections as suggested by Technofobe
#default is 5
config(CURLOPT_MAXCONNECTS=3)
var = c("pr","tasmax","tasmin")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
for (m in mod) {
for (y in year) {
nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
url1<- paste0(url,nfile, sep='')
destfile<-paste0('D:/destination_path/',r,'/',v,'/',nfile, sep='')
GET(url=url1, authenticate(user='NEXGDDP', password='', type = "basic"), write_disk(path=destfile, overwrite = FALSE ))
gc()
Sys.sleep(1)
}}}}
使用RCurl
library(RCurl)
opts = curlOptions(proxy='http://~in_case_you_need_a_proxy:paste_here_port_no', userpwd = "NEXGDDP:", netrc = TRUE)
var = c("pr","tasmax","tasmin")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
for (m in mod) {
for (y in year) {
nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
url1<- paste0(url,nfile, sep='')
destfile<-paste0('D:/destination_path/',r,'/',v,'/',nfile, sep='')
bin <- getBinaryURL(url1, .opts = opts)
writeBin(bin, destfile)
Sys.sleep(1)
gc()
}}}}
这两种方法都经过测试和研究。第二个可能仍然受到421错误问题的影响,但是出现次数非常有限(我下载的文件超过900个,总共约600 GB)。希望这对于在该领域工作的其他人来说是一个很好的参考。