R:从zip中读取多个文件而不提取

时间:2016-11-10 20:43:56

标签: r csv io zip bigdata

对于给定的数据集,我有大约5到20个zip文件,每个文件可能包含数百个CSV。我希望能够使用fread读取所有CSV而不从zip文件中提取它们。我目前能够下载zip文件,提取它们然后处理CSV,但是这需要大量的磁盘空间和RAM。

以下是一些示例数据(只是从另一个SO问题中抓住了这个数据):

write.csv(data.frame(x = 1:2, y = 1:2), tf1 <- tempfile(fileext = ".csv"))
write.csv(data.frame(x = 2:3, y = 2:3), tf2 <- tempfile(fileext = ".csv"))
write.csv(data.frame(x = 3:4, y = 3:4), tf3 <- tempfile(fileext = ".csv"))
zip(zipfile <- tempfile(fileext = ".zip"), files = c(tf1, tf2))
zip(zipfile <- tempfile(fileext = ".zip"), files = c(tf1, tf3))
zip(zipfile <- tempfile(fileext = ".zip"), files = c(tf2, tf3))

现有方法:

for (i in dir(pattern="\\.zip$"))
    unzip(i)
lapply(list.files(pattern = "*.csv"), fread)

这就是我想要做的事情:

library(rio)
lapply(list.files(pattern = "*.zip"), import, fread = TRUE)

这给出了这个输出:

[[1]]
  V1 x y
1  1 2 2
2  2 3 3

[[2]]
  V1 x y
1  1 1 1
2  2 2 2

[[3]]
  V1 x y
1  1 1 1
2  2 2 2

Warning messages:
1: In parse_zip(file) :
  Zip archive contains multiple files. Attempting first file.
2: In parse_zip(file) :
  Zip archive contains multiple files. Attempting first file.
3: In parse_zip(file) :
  Zip archive contains multiple files. Attempting first file.

似乎每个zip文件中只读取第一个CSV。我已经搜索了很多,但尚未找到解决方案。

1 个答案:

答案 0 :(得分:0)

library(stringr)
#First obtain contents of your archive:   

 list_of_txts<-unzip("your.zip",list=TRUE)[,1]
 list_of_txts<-list_of_txts[str_detect(list_of_txts,".xml")] # use ".csv" since you are looking for csv files instead 


#Then loop over it without unzipping:

final_data<-list("vector")
for (i in 1:length(list_of_txts)){
  conn<-unz("your.zip", list_of_txts[i])
  final_data[[i]]<-fread(conn) #replace fread with the command you want to use to read in the data. Worked with readr::read_csv()
}