并行化for循环分析data.frame

时间:2018-03-08 15:49:46

标签: r parallel-processing data.table data-mining

这些天我一直在使用8M寄存器的data.frame,我需要改进分析这些数据的循环。

我将描述我试图解决的问题的每个过程。 首先,我必须按三个字段ClientID,Date和Time按升序排列所有data.frame。 (校验) 然后,使用排列的data.frame,我必须操作每个观察之间的差异,只有在ClientID相同时才能进行。例如:

ClientID|Date(YMD)|Time(HMS)
A|20120101|110000
A|20120101|111500
A|20120101|120000
B|20120202|010000
B|20120202|012030

根据数据,我想获得的结果如下:

ClientID|Date(YMD)|Time(HMS)|Difference(minutes)
A|20120101|110000|0.00
A|20120101|111500|15.00
A|20120101|120000|45.00
B|20120202|010000|0
B|20120202|012030|20.30

现在的问题是,用8M观测的data.frame分析所有这些,需要3天。我希望我可以并行化这个过程。我的想法是,data.frame可以按簇进行分段,但是这种分段可能是有序的而不是随机的,然后使用库foreach或其他库,可以通过集群进行分析并将其设置为可用的核心数量。例如:

Cluster|ClientID|Date(YMD)|Time(HMS)
CORE 1|
1|A|20120101|110000
1|A|20120101|111500
1|A|20120101|120000
CORE 2|
2|B|20120202|010000
2|B|20120202|012030

1 个答案:

答案 0 :(得分:2)

我不建议尝试并行化这个。使用data.table包并使用以整数格式存储的时间,这将花费相当多的时间。

生成一些示例数据

library(data.table)

## Generate Data
RowCount <- 8e6
GroupCount <-1e4

DT <- data.table(ClientID = paste0("Client ",sample.int(GroupCount,size = RowCount, replace = TRUE)),
                 Time = sample.int(12,size = RowCount, replace = TRUE)*900)

DT[, Time := cumsum(Time), keyby = .(ClientID)]
DT[, Time := as.POSIXct(Time, tz = "UTC", origin = "1970-01-01 00:00:00")]

print(DT)

给出

            ClientID                Time
      1:    Client 1 1970-01-01 02:30:00
      2:    Client 1 1970-01-01 04:00:00
      3:    Client 1 1970-01-01 05:30:00
      4:    Client 1 1970-01-01 07:00:00
      5:    Client 1 1970-01-01 10:00:00
     ---                                
7999996: Client 9999 1970-02-20 18:15:00
7999997: Client 9999 1970-02-20 18:30:00
7999998: Client 9999 1970-02-20 21:00:00
7999999: Client 9999 1970-02-20 22:45:00
8000000: Client 9999 1970-02-21 00:30:00

计算时差

system.time({
  ## Create a integer column that stores time as the number of seconds midnight on 1970
  DT[,Time_Unix := as.integer(Time)]

  ## Order by ClientID then Time_Unix
  setkey(DT, ClientID, Time_Unix)

  ## Calculate Elapsed Time in minutes between rows, grouped by ClientID
  DT[, Elapsed_Minutes := (Time_Unix - shift(Time_Unix, n = 1L, type = "lag", fill = NA))/60L, keyby = .(ClientID)]

  ## Clean up the integer time
  DT[,Time_Unix := NULL]
})

...

   user  system elapsed 
  0.416   0.025   0.442 

结果:

print(DT)

...

            ClientID                Time Elapsed_Minutes
      1:    Client 1 1970-01-01 02:30:00              NA
      2:    Client 1 1970-01-01 04:00:00              90
      3:    Client 1 1970-01-01 05:30:00              90
      4:    Client 1 1970-01-01 07:00:00              90
      5:    Client 1 1970-01-01 10:00:00             180
     ---                                                
7999996: Client 9999 1970-02-20 18:15:00             135
7999997: Client 9999 1970-02-20 18:30:00              15
7999998: Client 9999 1970-02-20 21:00:00             150
7999999: Client 9999 1970-02-20 22:45:00             105
8000000: Client 9999 1970-02-21 00:30:00             105