提高查找第一次事件的效率

时间:2013-11-19 22:27:38

标签: r optimization data.table

我正在使用大型时间序列data.table,60 * B * illion行X 50列

对于三个特定的列,我想添加一个相应的T / F列,用idCol表示每个事件第一次出现

换句话说,对于ColumnA,新列将是

DT[, flag.ColumnA :=  dateCol==min(dateCol)
   , by=list(idCol, ColumnA)]

但是:min(dateCol)通常有联系,关系的解决方案是只有一个元素被标记为TRUE,其余为FALSE。这导致了以下方法

## Set key to {idCol, dateCol} so that the first row in each group
##   is the unique element in that group that should be set to TRUE
setkey(DT, idCol, dateCol)
DT[, flag.ColumnA := FALSE]
DT[, { DT[ .I[[1L]], flag.ColumnA := TRUE] }  # braces here are just for easier reading
   , by=list(idCol, ColumnA)]

问题是第二种方法将运行时间增加了3倍以上,而第一种方法每列需要花费超过一小时(在相对较快的方框上)

我还考虑在方法1中手动解析关系,但这比上述两种方法慢。

有关如何更有效地完成此任务的任何建议? 以下示例数据


预期输出样本

DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]

    idCol    dateCol ColumnA ColumnB flag.ColumnA.M1 flag.ColumnA.M2
 1: ID_01 2013-06-01      BT     xxx            TRUE            TRUE <~~ M1 is WRONG, M2 is correct
 2: ID_01 2013-06-01      BT     www            TRUE           FALSE <~~ M1 is WRONG, M2 is correct
 3: ID_01 2013-06-01      BT     yyy            TRUE           FALSE <~~ M1 is WRONG, M2 is correct
 4: ID_01 2013-06-22      BT     xxx           FALSE           FALSE
 5: ID_01 2013-11-23      BT     yyy           FALSE           FALSE
 6: ID_01 2013-11-30      BT     zzz           FALSE           FALSE
 7: ID_01 2013-06-15      CK     www            TRUE            TRUE
 8: ID_01 2013-06-15      CK     uuu            TRUE           FALSE
 9: ID_01 2013-06-15      CK     www            TRUE           FALSE
10: ID_01 2013-06-29      CK     zzz           FALSE           FALSE
11: ID_01 2013-10-12      CK     vvv           FALSE           FALSE
12: ID_01 2013-11-02      CK     uuu           FALSE           FALSE
13: ID_01 2013-06-22      MH     uuu            TRUE            TRUE
14: ID_01 2013-06-22      MH     xxx            TRUE           FALSE
15: ID_01 2013-06-22      MH     zzz            TRUE           FALSE
16: ID_01 2013-08-24      MH     ttt           FALSE           FALSE
17: ID_01 2013-09-07      MH     xxx           FALSE           FALSE
18: ID_01 2013-09-14      MH     zzz           FALSE           FALSE
19: ID_01 2013-09-21      MH     vvv           FALSE           FALSE
20: ID_01 2013-11-30      MH     ttt           FALSE           FALSE

样本数据

# increase N for realistic test    
N <- 2e4  # N should be large, as certain methods will be seemingly fast but wont scale

ids   <- sprintf("ID_%02d", seq(5))
A     <- apply(expand.grid(LETTERS, LETTERS), 1, paste0, collapse="")
B     <- paste0(letters, letters, letters)[20:26]
dates <- seq.Date(as.Date("2013-06-01"), as.Date("2013-12-01"), by=7)

set.seed(1)
DT <- data.table( dateCol=sample(dates, N, TRUE)
                , idCol  =sample(ids,   N, TRUE)
                , ColumnA=sample(A,     N, TRUE)
                , ColumnB=sample(B,     N, TRUE)
                , key="idCol")


{
  cat("\n==========\nMETHOD ONE:\n")
  print(system.time({
       DT[, flag.ColumnA.M1 :=  dateCol==min(dateCol)
          , by=list(idCol, ColumnA)]}))
  cat("\n\n==========\nMETHOD TWO:\n")
  print(system.time({
      setkey(DT, idCol, dateCol)
      DT[, flag.ColumnA.M2 := FALSE]
      DT[, { DT[ .I[[1L]], flag.ColumnA.M2 := TRUE] }  # braces here are just for easier reading
         , by=list(idCol, ColumnA)]}))
}

## For Example, looking at ID_01, at a few select values of ColumnA: 
DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]

2 个答案:

答案 0 :(得分:4)

只需使用which.min即可解决问题:

DT[, flag := FALSE]
DT[DT[, .I[which.min(dateCol)], by = list(idCol, ColumnA)]$V1, flag := TRUE]

对于您的小数据样本,这对我来说是即时的,因为system.time无法测量,并且N=1e7的速度比方法1快1.5倍。我没有测试更大的N.

答案 1 :(得分:3)

我会在set中使用:=代替[.data.table。当set可以创建列时,这将再次更快。

类似的东西(按id和日期键入以确保排序正确)

system.time({
set.seed(1)
DT <- data.table( dateCol=sample(dates, N, TRUE)
                  , idCol  =sample(ids,   N, TRUE)
                  , ColumnA=sample(A,     N, TRUE)
                  , ColumnB=sample(B,     N, TRUE)
                  , key=c("idCol", "dateCol"))
ll <- lapply(c('ColumnA','ColumnB'), function(cc) DT[,.I[1],by = c('idCol',cc)][['V1']])

flags <- c('flagA','flagB')
DT[, (flags) := FALSE]

jflag <- match(flags, names(DT), nomatch=0)

for(jj in seq_along(jflag)){
  set(DT, i = ll[[jj]], j = jflag[jj], value = TRUE)

}
})


# See this is lightening fast (even incorporating the creation of the data.table)
##   user  system elapsed 
##   0.02    0.00    0.02 




DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]
    idCol    dateCol ColumnA ColumnB flagA flagB
 1: ID_01 2013-06-01      BT     xxx  TRUE  TRUE
 2: ID_01 2013-06-01      BT     www FALSE FALSE
 3: ID_01 2013-06-01      BT     yyy FALSE FALSE
 4: ID_01 2013-06-22      BT     xxx FALSE FALSE
 5: ID_01 2013-11-23      BT     yyy FALSE FALSE
 6: ID_01 2013-11-30      BT     zzz FALSE FALSE
 7: ID_01 2013-06-15      CK     www  TRUE FALSE
 8: ID_01 2013-06-15      CK     uuu FALSE FALSE
 9: ID_01 2013-06-15      CK     www FALSE FALSE
10: ID_01 2013-06-29      CK     zzz FALSE FALSE
11: ID_01 2013-10-12      CK     vvv FALSE FALSE
12: ID_01 2013-11-02      CK     uuu FALSE FALSE
13: ID_01 2013-06-22      MH     uuu  TRUE FALSE
14: ID_01 2013-06-22      MH     xxx FALSE FALSE
15: ID_01 2013-06-22      MH     zzz FALSE FALSE
16: ID_01 2013-08-24      MH     ttt FALSE FALSE
17: ID_01 2013-09-07      MH     xxx FALSE FALSE
18: ID_01 2013-09-14      MH     zzz FALSE FALSE
19: ID_01 2013-09-21      MH     vvv FALSE FALSE
20: ID_01 2013-11-30      MH     ttt FALSE FALSE

其他可能性包括在一次扫描中完成所有事情

例如

(如果你每次都知道所添加列的列号,这会更快(在你的情况下你应该这样,不需要match标志列到数据的名称.table hten)

lapply(c('A','B'), function(LL){
    cn <- sprintf('Column%s',LL)
    fl <- sprintf('flag%s',LL)
    DT[, (fl) :=FALSE]
    is <- DT[,.I[1],by =c('idCol',cn)][['V1']]
    jm <- match(fl, names(DT), nomatch=0)
    set(DT, i=is, j=jm, value=TRUE)
    invisible()
  })

for(ff in seq_along(flags)){

  is <- DT2[,.I[1], by =c('idCol',cols[1])][['V1']]
  set(DT2, i = is, j = jflag[ff], value = TRUE)
}