有没有将多个CSV文件组合成一个超级文件而不使用read.csv / read_csv函数?
我想将文件夹中的所有表格(CSV)合并到一个csv文件中,因为每个表格都代表一个单独的月份。该文件夹如下所示:
list.files(文件夹)
[1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
[4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
[7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"
[16] "201410-citibike-tripdata.csv" "201411-citibike-tripdata.csv" "201412-citibike-tripdata.csv"
[19] "201501-citibike-tripdata.csv" "201502-citibike-tripdata.csv" "201503-citibike-tripdata.csv"
[22] "201504-citibike-tripdata.csv" "201505-citibike-tripdata.csv" "201506-citibike-tripdata.csv"
[25] "201507-citibike-tripdata.csv" "201508-citibike-tripdata.csv" "201509-citibike-tripdata.csv"
[28] "201510-citibike-tripdata.csv" "201511-citibike-tripdata.csv" "201512-citibike-tripdata.csv"
[31] "201601-citibike-tripdata.csv" "201602-citibike-tripdata.csv" "201603-citibike-tripdata.csv"
我尝试了以下内容并获得了大数据,这是一个包含33个元素和3.6 Gbs的大型列表。但是,整个过程需要一段时间。考虑到网站每月更新的事实,数据量的增加将使合并过程更加缓慢。因此,有人可以帮助我将所有数据文件组合在一起而不将它们加载到环境中吗?可以在此处找到数据源:https://s3.amazonaws.com/tripdata/index.html。
filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)
数据文件看起来像这样,这不是我想要的形式。我希望有一个大表,所有信息合并在一起。
> head(data)
[[1]]
Source: local data frame [843,416 x 15]
tripduration starttime stoptime start station id start station name start station latitude
(int) (time) (time) (int) (chr) (dbl)
1 634 2013-07-01 00:00:00 2013-07-01 00:10:34 164 E 47 St & 2 Ave 40.75323
2 1547 2013-07-01 00:00:02 2013-07-01 00:25:49 388 W 26 St & 10 Ave 40.74972
3 178 2013-07-01 00:01:04 2013-07-01 00:04:02 293 Lafayette St & E 8 St 40.73029
4 1580 2013-07-01 00:01:06 2013-07-01 00:27:26 531 Forsyth St & Broome St 40.71894
5 757 2013-07-01 00:01:10 2013-07-01 00:13:47 382 University Pl & E 14 St 40.73493
6 861 2013-07-01 00:01:23 2013-07-01 00:15:44 511 E 14 St & Avenue B 40.72939
7 550 2013-07-01 00:01:59 2013-07-01 00:11:09 293 Lafayette St & E 8 St 40.73029
8 288 2013-07-01 00:02:16 2013-07-01 00:07:04 224 Spruce St & Nassau St 40.71146
9 766 2013-07-01 00:02:16 2013-07-01 00:15:02 432 E 7 St & Avenue A 40.72622
10 773 2013-07-01 00:02:23 2013-07-01 00:15:16 173 Broadway & W 49 St 40.76065
.. ... ... ... ... ... ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)
[[2]]
Source: local data frame [1,001,958 x 15]
tripduration starttime stoptime start station id start station name start station latitude
(int) (time) (time) (int) (chr) (dbl)
1 664 2013-08-01 00:00:00 2013-08-01 00:11:04 449 W 52 St & 9 Ave 40.76462
2 2115 2013-08-01 00:00:01 2013-08-01 00:35:16 254 W 11 St & 6 Ave 40.73532
3 385 2013-08-01 00:00:03 2013-08-01 00:06:28 460 S 4 St & Wythe Ave 40.71286
4 653 2013-08-01 00:00:10 2013-08-01 00:11:03 398 Atlantic Ave & Furman St 40.69165
5 954 2013-08-01 00:00:11 2013-08-01 00:16:05 319 Park Pl & Church St 40.71336
6 145 2013-08-01 00:00:37 2013-08-01 00:03:02 521 8 Ave & W 31 St 40.75045
7 331 2013-08-01 00:01:25 2013-08-01 00:06:56 2000 Front St & Washington St 40.70255
8 194 2013-08-01 00:01:26 2013-08-01 00:04:40 313 Washington Ave & Park Ave 40.69610
9 598 2013-08-01 00:01:40 2013-08-01 00:11:38 528 2 Ave & E 31 St 40.74291
10 360 2013-08-01 00:01:45 2013-08-01 00:07:45 500 Broadway & W 51 St 40.76229
.. ... ... ... ... ... ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)
答案 0 :(得分:0)
您不需要将每个csv加载到R中。将R之外的csv组合在一起,然后一次加载所有文件。如果您有权访问unix命令(solution from here),那么这是一个可以完成工作的shell脚本。
>>> liste=[(0,1,45), (0,2,90), (0,3,60), (1,2,50), (1,3,20), (2,3,25)]
>>> number_list=(0,2)
>>> d = {str(x):[item for item in liste if x in item] for x in number_list}
>>> d
{'0': [(0, 1, 45), (0, 2, 90), (0, 3, 60)], '2': [(0, 2, 90), (1, 2, 50), (2, 3, 25)]}
或使用Windows命令提示符(solution from here):
nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv
答案 1 :(得分:0)
您有一个数据框列表。因此,如果您想将这些数据帧融合到一个大数据帧中,请执行以下操作:
dplyr::bind_rows(data)
另一方面,您可以使用cat
在R之外连接CSV本身(如上所述)。但是你可以在R里面这样说:
setwd(folder)
system("cat *.csv > full.csv")
唯一的问题是,您连接的每个文件都会重复列标题,这可能是您不想要的。
答案 2 :(得分:0)
您可以使用CMD
并只需编写:
C:\yourdirWhereCsvfilesExist\copy *.csv combinedfile.csv
那么您将拥有一个名为combinedfile.csv
的文件,其中包含所有数据
希望对您有所帮助!
答案 3 :(得分:0)
我会用这个:
library(data.table)
multmerge = function(path){
filenames=list.files(path=path, full.names=TRUE)
rbindlist(lapply(filenames, fread))
}
path <- "C:/Users/kkk/Desktop/test/test1"
mergeA <- multmerge(path)
write.csv(mergeA, "mergeA.csv")
该解决方案发布在不同的线程下,作为合并多个文件的一种方式