我正在使用大型时间序列data.table,60 * B * illion行X 50列
对于三个特定的列,我想添加一个相应的T / F列,用idCol
表示每个事件第一次出现
换句话说,对于ColumnA,新列将是
DT[, flag.ColumnA := dateCol==min(dateCol)
, by=list(idCol, ColumnA)]
但是:min(dateCol)
通常有联系,关系的解决方案是只有一个元素被标记为TRUE
,其余为FALSE
。这导致了以下方法
## Set key to {idCol, dateCol} so that the first row in each group
## is the unique element in that group that should be set to TRUE
setkey(DT, idCol, dateCol)
DT[, flag.ColumnA := FALSE]
DT[, { DT[ .I[[1L]], flag.ColumnA := TRUE] } # braces here are just for easier reading
, by=list(idCol, ColumnA)]
问题是第二种方法将运行时间增加了3倍以上,而第一种方法每列需要花费超过一小时(在相对较快的方框上)
我还考虑在方法1中手动解析关系,但这比上述两种方法慢。
有关如何更有效地完成此任务的任何建议? 以下示例数据
DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]
idCol dateCol ColumnA ColumnB flag.ColumnA.M1 flag.ColumnA.M2
1: ID_01 2013-06-01 BT xxx TRUE TRUE <~~ M1 is WRONG, M2 is correct
2: ID_01 2013-06-01 BT www TRUE FALSE <~~ M1 is WRONG, M2 is correct
3: ID_01 2013-06-01 BT yyy TRUE FALSE <~~ M1 is WRONG, M2 is correct
4: ID_01 2013-06-22 BT xxx FALSE FALSE
5: ID_01 2013-11-23 BT yyy FALSE FALSE
6: ID_01 2013-11-30 BT zzz FALSE FALSE
7: ID_01 2013-06-15 CK www TRUE TRUE
8: ID_01 2013-06-15 CK uuu TRUE FALSE
9: ID_01 2013-06-15 CK www TRUE FALSE
10: ID_01 2013-06-29 CK zzz FALSE FALSE
11: ID_01 2013-10-12 CK vvv FALSE FALSE
12: ID_01 2013-11-02 CK uuu FALSE FALSE
13: ID_01 2013-06-22 MH uuu TRUE TRUE
14: ID_01 2013-06-22 MH xxx TRUE FALSE
15: ID_01 2013-06-22 MH zzz TRUE FALSE
16: ID_01 2013-08-24 MH ttt FALSE FALSE
17: ID_01 2013-09-07 MH xxx FALSE FALSE
18: ID_01 2013-09-14 MH zzz FALSE FALSE
19: ID_01 2013-09-21 MH vvv FALSE FALSE
20: ID_01 2013-11-30 MH ttt FALSE FALSE
# increase N for realistic test
N <- 2e4 # N should be large, as certain methods will be seemingly fast but wont scale
ids <- sprintf("ID_%02d", seq(5))
A <- apply(expand.grid(LETTERS, LETTERS), 1, paste0, collapse="")
B <- paste0(letters, letters, letters)[20:26]
dates <- seq.Date(as.Date("2013-06-01"), as.Date("2013-12-01"), by=7)
set.seed(1)
DT <- data.table( dateCol=sample(dates, N, TRUE)
, idCol =sample(ids, N, TRUE)
, ColumnA=sample(A, N, TRUE)
, ColumnB=sample(B, N, TRUE)
, key="idCol")
{
cat("\n==========\nMETHOD ONE:\n")
print(system.time({
DT[, flag.ColumnA.M1 := dateCol==min(dateCol)
, by=list(idCol, ColumnA)]}))
cat("\n\n==========\nMETHOD TWO:\n")
print(system.time({
setkey(DT, idCol, dateCol)
DT[, flag.ColumnA.M2 := FALSE]
DT[, { DT[ .I[[1L]], flag.ColumnA.M2 := TRUE] } # braces here are just for easier reading
, by=list(idCol, ColumnA)]}))
}
## For Example, looking at ID_01, at a few select values of ColumnA:
DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]
答案 0 :(得分:4)
只需使用which.min
即可解决问题:
DT[, flag := FALSE]
DT[DT[, .I[which.min(dateCol)], by = list(idCol, ColumnA)]$V1, flag := TRUE]
对于您的小数据样本,这对我来说是即时的,因为system.time
无法测量,并且N=1e7
的速度比方法1快1.5倍。我没有测试更大的N.
答案 1 :(得分:3)
我会在set
中使用:=
代替[.data.table
。当set
可以创建列时,这将再次更快。
类似的东西(按id和日期键入以确保排序正确)
system.time({
set.seed(1)
DT <- data.table( dateCol=sample(dates, N, TRUE)
, idCol =sample(ids, N, TRUE)
, ColumnA=sample(A, N, TRUE)
, ColumnB=sample(B, N, TRUE)
, key=c("idCol", "dateCol"))
ll <- lapply(c('ColumnA','ColumnB'), function(cc) DT[,.I[1],by = c('idCol',cc)][['V1']])
flags <- c('flagA','flagB')
DT[, (flags) := FALSE]
jflag <- match(flags, names(DT), nomatch=0)
for(jj in seq_along(jflag)){
set(DT, i = ll[[jj]], j = jflag[jj], value = TRUE)
}
})
# See this is lightening fast (even incorporating the creation of the data.table)
## user system elapsed
## 0.02 0.00 0.02
DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]
idCol dateCol ColumnA ColumnB flagA flagB
1: ID_01 2013-06-01 BT xxx TRUE TRUE
2: ID_01 2013-06-01 BT www FALSE FALSE
3: ID_01 2013-06-01 BT yyy FALSE FALSE
4: ID_01 2013-06-22 BT xxx FALSE FALSE
5: ID_01 2013-11-23 BT yyy FALSE FALSE
6: ID_01 2013-11-30 BT zzz FALSE FALSE
7: ID_01 2013-06-15 CK www TRUE FALSE
8: ID_01 2013-06-15 CK uuu FALSE FALSE
9: ID_01 2013-06-15 CK www FALSE FALSE
10: ID_01 2013-06-29 CK zzz FALSE FALSE
11: ID_01 2013-10-12 CK vvv FALSE FALSE
12: ID_01 2013-11-02 CK uuu FALSE FALSE
13: ID_01 2013-06-22 MH uuu TRUE FALSE
14: ID_01 2013-06-22 MH xxx FALSE FALSE
15: ID_01 2013-06-22 MH zzz FALSE FALSE
16: ID_01 2013-08-24 MH ttt FALSE FALSE
17: ID_01 2013-09-07 MH xxx FALSE FALSE
18: ID_01 2013-09-14 MH zzz FALSE FALSE
19: ID_01 2013-09-21 MH vvv FALSE FALSE
20: ID_01 2013-11-30 MH ttt FALSE FALSE
其他可能性包括在一次扫描中完成所有事情
例如(如果你每次都知道所添加列的列号,这会更快(在你的情况下你应该这样,不需要match
标志列到数据的名称.table hten)
lapply(c('A','B'), function(LL){
cn <- sprintf('Column%s',LL)
fl <- sprintf('flag%s',LL)
DT[, (fl) :=FALSE]
is <- DT[,.I[1],by =c('idCol',cn)][['V1']]
jm <- match(fl, names(DT), nomatch=0)
set(DT, i=is, j=jm, value=TRUE)
invisible()
})
或
for(ff in seq_along(flags)){
is <- DT2[,.I[1], by =c('idCol',cols[1])][['V1']]
set(DT2, i = is, j = jflag[ff], value = TRUE)
}