加快大型数据集的循环

时间:2016-01-07 21:00:50

标签: r for-loop vectorization subset

我正在对Citibike数据进行分析,并在帮助下创建了一个完美运行的代码 - 它提取了从自行车最后停止的不同站点开始的所有自行车行程(即所有{{ {1}}与之前的start.station.id不同,这意味着自行车是用卡车移动的。然而,月度数据集非常大,夏季月份包含超过100万次单独旅行(你可以找到它们)在这里:Citibike data

以下是数据集的快照:


head(Nov2015, n = 20)
   tripduration          starttime           stoptime start.station.id                start.station.name
1          1110 11/1/2015 00:00:00 11/1/2015 00:18:31              537           Lexington Ave & E 24 St
2          1094 11/1/2015 00:00:01 11/1/2015 00:18:15              537           Lexington Ave & E 24 St
3           520 11/1/2015 00:00:05 11/1/2015 00:08:45              536                   1 Ave & E 30 St
4           753 11/1/2015 00:00:15 11/1/2015 00:12:48              229                    Great Jones St
5           353 11/1/2015 00:00:22 11/1/2015 00:06:15              285                Broadway & E 14 St
6          1285 11/1/2015 00:00:22 11/1/2015 00:21:48              268             Howard St & Centre St
7           477 11/1/2015 00:00:25 11/1/2015 00:08:23              379                   W 31 St & 7 Ave
8           362 11/1/2015 00:00:28 11/1/2015 00:06:30              407              Henry St & Poplar St
9          2316 11/1/2015 00:00:37 11/1/2015 00:39:14              147          Greenwich St & Warren St
10          627 11/1/2015 00:00:42 11/1/2015 00:11:10              521                   8 Ave & W 31 St
11         2304 11/1/2015 00:00:44 11/1/2015 00:39:08              147          Greenwich St & Warren St
12         1471 11/1/2015 00:01:04 11/1/2015 00:25:35              281 Grand Army Plaza & Central Park S
13         1484 11/1/2015 00:01:36 11/1/2015 00:26:21              281 Grand Army Plaza & Central Park S
14          284 11/1/2015 00:01:36 11/1/2015 00:06:20              247            Perry St & Bleecker St
15          886 11/1/2015 00:01:39 11/1/2015 00:16:25              492                   W 33 St & 7 Ave
16          886 11/1/2015 00:01:42 11/1/2015 00:16:28              492                   W 33 St & 7 Ave
17         1379 11/1/2015 00:01:44 11/1/2015 00:24:44              512                   W 29 St & 9 Ave
18          179 11/1/2015 00:01:47 11/1/2015 00:04:47              319              Fulton St & Broadway
19          309 11/1/2015 00:01:51 11/1/2015 00:07:00              160           E 37 St & Lexington Ave
20          616 11/1/2015 00:02:08 11/1/2015 00:12:24              479                   9 Ave & W 45 St
   start.station.latitude start.station.longitude end.station.id         end.station.name
1                40.74026               -73.98409            531   Forsyth St & Broome St
2                40.74026               -73.98409            531   Forsyth St & Broome St
3                40.74144               -73.97536            498       Broadway & W 32 St
4                40.72743               -73.99379            328  Watts St & Greenwich St
5                40.73455               -73.99074            151 Cleveland Pl & Spring St
6                40.71911               -73.99973            476          E 31 St & 3 Ave
7                40.74916               -73.99160            546     E 30 St & Park Ave S
8                40.70047               -73.99145            310      State St & Smith St
9                40.71542               -74.01122            441          E 52 St & 2 Ave
10               40.75097               -73.99444            285       Broadway & E 14 St
11               40.71542               -74.01122            441          E 52 St & 2 Ave
12               40.76440               -73.97371            367  E 53 St & Lexington Ave
13               40.76440               -73.97371            367  E 53 St & Lexington Ave
14               40.73535               -74.00483            453          W 22 St & 8 Ave
15               40.75020               -73.99093            377         6 Ave & Canal St
16               40.75020               -73.99093            377         6 Ave & Canal St
17               40.75007               -73.99839            445       E 10 St & Avenue A
18               40.71107               -74.00945            264     Maiden Ln & Pearl St
19               40.74824               -73.97831            362       Broadway & W 37 St
20               40.76019               -73.99126            440          E 45 St & 3 Ave
   end.station.latitude end.station.longitude bikeid   usertype birth.year gender
1              40.71894             -73.99266  22545 Subscriber       1981      2
2              40.71894             -73.99266  23959 Subscriber       1980      1
3              40.74855             -73.98808  22251 Subscriber       1988      1
4              40.72406             -74.00966  15869 Subscriber       1981      1
5              40.72210             -73.99725  21645 Subscriber       1987      1
6              40.74394             -73.97966  14788   Customer         NA      0
7              40.74445             -73.98304  21128 Subscriber       1962      2
8              40.68927             -73.98913  21016 Subscriber       1978      1
9              40.75601             -73.96742  24117 Subscriber       1988      2
10             40.73455             -73.99074  17048 Subscriber       1986      2
11             40.75601             -73.96742  18241 Subscriber       1984      1
12             40.75828             -73.97069  24223   Customer         NA      0
13             40.75828             -73.97069  16779   Customer         NA      0
14             40.74475             -73.99915  17272 Subscriber       1976      1
15             40.72244             -74.00566  15008 Subscriber       1981      1
16             40.72244             -74.00566  23019 Subscriber       1982      1
17             40.72741             -73.98142  23843 Subscriber       1962      2
18             40.70706             -74.00732  22538 Subscriber       1981      1
19             40.75173             -73.98754  22042 Subscriber       1988      1
20             40.75255             -73.97283  22699 Subscriber       1982      1

我用来提取“隐藏的”自行车运动的代码并将它们放入一个连贯的data.frame:

end.station.id

因为它使用for循环,所以提取需要很长时间。有没有办法加快这个过程或以避免使用for循环的方式重写代码?

输出应保持完全相同。即:

head(output)
  bikeid end.station.id start.station.id        diff.time           stoptime          starttime
1  22545            520              529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11
2  22545            520              517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48
3  22545           2004             3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31
4  22545            296             3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16
5  22545            520              449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31
6  22545            359              519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29

1 个答案:

答案 0 :(得分:0)

使用data.table的此解决方案需要几分钟时间。

使用shift函数,并通过bikeid将当前行的数据添加到现有data.table。

然后我们过滤!is.na(end.station.id)&(end.station.id!= start.station.id),然后删除不需要的列,然后设置列顺序。

   Excel = actxGetRunningServer('excel.application'); 
   set(Excel, 'Visible', 1);

   Workbooks = Excel.Workbooks;
   Workbook = Excel.Workbooks.Open('C:\Users\...test.xlsx');
   curr_sheet = get(Workbook,'ActiveSheet');
   rngObj = ('A1:C3')
   rngObj.Copy
   Sheets = Excel.ActiveWorkBook.Sheets;
   new_sheet = Sheets.Add;
   new_sheet.PasteSpecial; %This is where I am stuck!