如何下载网页内容并查找其中列有特定扩展名的所有文件。然后下载所有这些。例如,我想从以下网页下载所有netcdf文件(扩展名为* .nc4): https://data.giss.nasa.gov/impacts/agmipcf/agmerra/
我被建议查看Rcurl包但无法找到如何执行此操作。
答案 0 :(得分:1)
library(stringr)
# Get the context of the page
thepage = readLines('https://data.giss.nasa.gov/impacts/agmipcf/agmerra/')
# Find the lines that contain the names for netcdf files
nc4.lines <- grep('*.nc4', thepage)
# Subset the original dataset leaving only those lines
thepage <- thepage[nc4.lines]
#extract the file names
str.loc <- str_locate(thepage,'A.*nc4?"')
#substring
file.list <- substring(thepage,str.loc[,1], str.loc[,2]-1)
# download all files
for ( ifile in file.list){
download.file(paste0("https://data.giss.nasa.gov/impacts/agmipcf/agmerra/",
ifile),
destfile=ifile, method="libcurl")