我的问题是在R如何下载网站上的所有文件?我知道怎么一个接一个地做,但不是一次一个。例如:
答案 0 :(得分:11)
我在页面上56个文件的一小部分(3)中测试了它,它运行正常。
## your base url
url <- "http://www2.census.gov/geo/docs/maps-data/data/rel/t00t10/"
## query the url to get all the file names ending in '.zip'
zips <- XML::getHTMLLinks(
url,
xpQuery = "//a/@href['.zip'=substring(., string-length(.) - 3)]"
)
## create a new directory 'myzips' to hold the downloads
dir.create("myzips")
## save the current directory path for later
wd <- getwd()
## change working directory for the download
setwd("myzips")
## create all the new files
file.create(zips)
## download them all
lapply(paste0(url, zips), function(x) download.file(x, basename(x)))
## reset working directory to original
setwd(wd)
现在所有的zip文件都在myzips
目录中,可以进一步处理。作为lapply()
的替代方案,您还可以使用for()
循环。
## download them all
for(u in paste0(url, zips)) download.file(u, basename(u))
当然,设置quiet = TRUE
可能不错,因为我们正在下载56个文件。
答案 1 :(得分:5)
略有不同的方法。
library(rvest)
library(httr)
library(pbapply)
library(stringi)
URL <- "http://www2.census.gov/geo/docs/maps-data/data/rel/t00t10/"
pg <- read_html(URL)
zips <- grep("zip$", html_attr(html_nodes(pg, "a[href^='TAB']"), "href"), value=TRUE)
invisible(pbsapply(zips, function(zip_file) {
GET(URL %s+% zip_file, write_disk(zip_file))
}))
你有一个进度条,内置“缓存”(write_disk
不会覆盖已经下载的文件)。
你可以编织理查德的优秀代码来创建dir&amp;文件检查。
答案 2 :(得分:0)
如果您能够使用python 3,我就设法使此代码适用于类似的人口普查网站。并不是很酷,因为我必须对所有状态代码进行硬编码,但这确实可以做到:
import wget
root_url = 'https://www2.census.gov/geo/docs/maps-data/data/rel/t00t10/TAB2000_TAB2010_ST_'
states = [str(st) for st in ["01","02","04","05", "06", "08", "09","10",
"11", "12", "13", "15", "16", "17", "18",
"19", "20", "21", "22", "23", "24", "25", "26",
"27", "28", "29", "30", "31", "32", "33","34",
"35","36", "37", "38", "39", "40", "41", "42",
"44", "45","46", "47", "48", "49", "50", "51",
"53", "54", "55", "56", "72"]]
ext = '_v2.zip'
for state in states:
print("downloading state number " + state)
fname = state + ext
wget.download(root_url + fname)