对于给定的数据集,我有大约5到20个zip文件,每个文件可能包含数百个CSV。我希望能够使用fread读取所有CSV而不从zip文件中提取它们。我目前能够下载zip文件,提取它们然后处理CSV,但是这需要大量的磁盘空间和RAM。
以下是一些示例数据(只是从另一个SO问题中抓住了这个数据):
write.csv(data.frame(x = 1:2, y = 1:2), tf1 <- tempfile(fileext = ".csv"))
write.csv(data.frame(x = 2:3, y = 2:3), tf2 <- tempfile(fileext = ".csv"))
write.csv(data.frame(x = 3:4, y = 3:4), tf3 <- tempfile(fileext = ".csv"))
zip(zipfile <- tempfile(fileext = ".zip"), files = c(tf1, tf2))
zip(zipfile <- tempfile(fileext = ".zip"), files = c(tf1, tf3))
zip(zipfile <- tempfile(fileext = ".zip"), files = c(tf2, tf3))
现有方法:
for (i in dir(pattern="\\.zip$"))
unzip(i)
lapply(list.files(pattern = "*.csv"), fread)
这就是我想要做的事情:
library(rio)
lapply(list.files(pattern = "*.zip"), import, fread = TRUE)
这给出了这个输出:
[[1]]
V1 x y
1 1 2 2
2 2 3 3
[[2]]
V1 x y
1 1 1 1
2 2 2 2
[[3]]
V1 x y
1 1 1 1
2 2 2 2
Warning messages:
1: In parse_zip(file) :
Zip archive contains multiple files. Attempting first file.
2: In parse_zip(file) :
Zip archive contains multiple files. Attempting first file.
3: In parse_zip(file) :
Zip archive contains multiple files. Attempting first file.
似乎每个zip文件中只读取第一个CSV。我已经搜索了很多,但尚未找到解决方案。
答案 0 :(得分:0)
library(stringr)
#First obtain contents of your archive:
list_of_txts<-unzip("your.zip",list=TRUE)[,1]
list_of_txts<-list_of_txts[str_detect(list_of_txts,".xml")] # use ".csv" since you are looking for csv files instead
#Then loop over it without unzipping:
final_data<-list("vector")
for (i in 1:length(list_of_txts)){
conn<-unz("your.zip", list_of_txts[i])
final_data[[i]]<-fread(conn) #replace fread with the command you want to use to read in the data. Worked with readr::read_csv()
}