如何将多个csv文件合并为一个大文件而不将实际文件加载到环境中?

时间:2016-05-25 17:36:40

标签: r csv merge bigdata

有没有将多个CSV文件组合成一个超级文件而不使用read.csv / read_csv函数?

我想将文件夹中的所有表格(CSV)合并到一个csv文件中,因为每个表格都代表一个单独的月份。该文件夹如下所示:

list.files(文件夹)

 [1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
 [4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
 [7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"     
[16] "201410-citibike-tripdata.csv"      "201411-citibike-tripdata.csv"      "201412-citibike-tripdata.csv"     
[19] "201501-citibike-tripdata.csv"      "201502-citibike-tripdata.csv"      "201503-citibike-tripdata.csv"     
[22] "201504-citibike-tripdata.csv"      "201505-citibike-tripdata.csv"      "201506-citibike-tripdata.csv"     
[25] "201507-citibike-tripdata.csv"      "201508-citibike-tripdata.csv"      "201509-citibike-tripdata.csv"     
[28] "201510-citibike-tripdata.csv"      "201511-citibike-tripdata.csv"      "201512-citibike-tripdata.csv"     
[31] "201601-citibike-tripdata.csv"      "201602-citibike-tripdata.csv"      "201603-citibike-tripdata.csv"     

我尝试了以下内容并获得了大数据,这是一个包含33个元素和3.6 Gbs的大型列表。但是,整个过程需要一段时间。考虑到网站每月更新的事实,数据量的增加将使合并过程更加缓慢。因此,有人可以帮助我将所有数据文件组合在一起而不将它们加载到环境中吗?可以在此处找到数据源:https://s3.amazonaws.com/tripdata/index.html

filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)

数据文件看起来像这样,这不是我想要的形式。我希望有一个大表,所有信息合并在一起。

> head(data)
[[1]]
Source: local data frame [843,416 x 15]

   tripduration           starttime            stoptime start station id      start station name start station latitude
          (int)              (time)              (time)            (int)                   (chr)                  (dbl)
1           634 2013-07-01 00:00:00 2013-07-01 00:10:34              164         E 47 St & 2 Ave               40.75323
2          1547 2013-07-01 00:00:02 2013-07-01 00:25:49              388        W 26 St & 10 Ave               40.74972
3           178 2013-07-01 00:01:04 2013-07-01 00:04:02              293   Lafayette St & E 8 St               40.73029
4          1580 2013-07-01 00:01:06 2013-07-01 00:27:26              531  Forsyth St & Broome St               40.71894
5           757 2013-07-01 00:01:10 2013-07-01 00:13:47              382 University Pl & E 14 St               40.73493
6           861 2013-07-01 00:01:23 2013-07-01 00:15:44              511      E 14 St & Avenue B               40.72939
7           550 2013-07-01 00:01:59 2013-07-01 00:11:09              293   Lafayette St & E 8 St               40.73029
8           288 2013-07-01 00:02:16 2013-07-01 00:07:04              224   Spruce St & Nassau St               40.71146
9           766 2013-07-01 00:02:16 2013-07-01 00:15:02              432       E 7 St & Avenue A               40.72622
10          773 2013-07-01 00:02:23 2013-07-01 00:15:16              173      Broadway & W 49 St               40.76065
..          ...                 ...                 ...              ...                     ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

[[2]]
Source: local data frame [1,001,958 x 15]

   tripduration           starttime            stoptime start station id        start station name start station latitude
          (int)              (time)              (time)            (int)                     (chr)                  (dbl)
1           664 2013-08-01 00:00:00 2013-08-01 00:11:04              449           W 52 St & 9 Ave               40.76462
2          2115 2013-08-01 00:00:01 2013-08-01 00:35:16              254           W 11 St & 6 Ave               40.73532
3           385 2013-08-01 00:00:03 2013-08-01 00:06:28              460        S 4 St & Wythe Ave               40.71286
4           653 2013-08-01 00:00:10 2013-08-01 00:11:03              398  Atlantic Ave & Furman St               40.69165
5           954 2013-08-01 00:00:11 2013-08-01 00:16:05              319       Park Pl & Church St               40.71336
6           145 2013-08-01 00:00:37 2013-08-01 00:03:02              521           8 Ave & W 31 St               40.75045
7           331 2013-08-01 00:01:25 2013-08-01 00:06:56             2000  Front St & Washington St               40.70255
8           194 2013-08-01 00:01:26 2013-08-01 00:04:40              313 Washington Ave & Park Ave               40.69610
9           598 2013-08-01 00:01:40 2013-08-01 00:11:38              528           2 Ave & E 31 St               40.74291
10          360 2013-08-01 00:01:45 2013-08-01 00:07:45              500        Broadway & W 51 St               40.76229
..          ...                 ...                 ...              ...                       ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

4 个答案:

答案 0 :(得分:0)

您不需要将每个csv加载到R中。将R之外的csv组合在一起,然后一次加载所有文件。如果您有权访问unix命令(solution from here),那么这是一个可以完成工作的shell脚本。

>>> liste=[(0,1,45), (0,2,90), (0,3,60), (1,2,50), (1,3,20), (2,3,25)]
>>> number_list=(0,2)
>>> d = {str(x):[item for item in liste if x in item] for x in number_list}
>>> d
{'0': [(0, 1, 45), (0, 2, 90), (0, 3, 60)], '2': [(0, 2, 90), (1, 2, 50), (2, 3, 25)]}

或使用Windows命令提示符(solution from here):

nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv

答案 1 :(得分:0)

您有一个数据框列表。因此,如果您想将这些数据帧融合到一个大数据帧中,请执行以下操作:

dplyr::bind_rows(data)

另一方面,您可以使用cat在R之外连接CSV本身(如上所述)。但是你可以在R里面这样说:

setwd(folder)
system("cat *.csv > full.csv")

唯一的问题是,您连接的每个文件都会重复列标题,这可能是您不想要的。

答案 2 :(得分:0)

您可以使用CMD并只需编写:

C:\yourdirWhereCsvfilesExist\copy *.csv combinedfile.csv

那么您将拥有一个名为combinedfile.csv的文件,其中包含所有数据

希望对您有所帮助!

答案 3 :(得分:0)

我会用这个:

library(data.table)
multmerge = function(path){
  filenames=list.files(path=path, full.names=TRUE)
  rbindlist(lapply(filenames, fread))
} 
path <- "C:/Users/kkk/Desktop/test/test1"
mergeA <- multmerge(path)
write.csv(mergeA, "mergeA.csv")

该解决方案发布在不同的线程下,作为合并多个文件的一种方式