如何避免data.table中的优化警告

时间:2013-04-21 15:11:29

标签: r data.table

我有以下代码:

> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
> dt
    a  b  c  d
 1: 3  1 11 21
 2: 3  2 12 22
 3: 3  3 13 23
 4: 3  4 14 24
 5: 3  5 15 25
 6: 4  6 16 26
 7: 4  7 17 27
 8: 4  8 18 28
 9: 4  9 19 29
10: 4 10 20 30
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))'
Starting dogroups ... done dogroups in 0 secs
   a  b  c   d
1: 3 15 65 115
2: 4 40 90 140
> dt[,c(count=.N,lapply(.SD,sum)),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))'
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
done dogroups in 0 secs
   a count  b  c   d
1: 3     5 15 65 115
2: 4     5 40 90 140

如何避免可怕的“非常低效”警告?

我可以在加入前添加count列:

> dt$count <- 1
> dt
    a  b  c  d count
 1: 3  1 11 21     1
 2: 3  2 12 22     1
 3: 3  3 13 23     1
 4: 3  4 14 24     1
 5: 3  5 15 25     1
 6: 4  6 16 26     1
 7: 4  7 17 27     1
 8: 4  8 18 28     1
 9: 4  9 19 29     1
10: 4 10 20 30     1
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))'
Starting dogroups ... done dogroups in 0 secs
   a  b  c   d count
1: 3 15 65 115     5
2: 4 40 90 140     5

但这看起来并不太优雅......

2 个答案:

答案 0 :(得分:2)

我能想到的一种方法是通过引用分配count

dt.out <- dt[, lapply(.SD,sum), by = a]
dt.out[, count := dt[, .N, by=a][, N]]
# alternatively: count := table(dt$a)

#    a  b  c   d count
# 1: 3 15 65 115     5
# 2: 4 40 90 140     5

编辑1:我仍然认为这只是消息,而不是警告。但如果您仍想避免这种情况,请执行以下操作:

dt.out[, count := as.numeric(dt[, .N, by=a][, N])]

编辑2:非常有趣。执行相当于多个:=赋值不会生成相同的消息。

dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# Detected that j uses these columns: a 
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# Detected that j uses these columns: <none> 
# Optimization is on but j left unchanged as '.N'
# Starting dogroups ... done dogroups in 0 secs
# Detected that j uses these columns: N 
# Assigning to all 2 rows
# Direct plonk of unnamed RHS, no copy.

答案 1 :(得分:2)

此解决方案删除有关指定元素的消息。但是你必须在之后重新命名。

require(data.table)
options(datatable.verbose = TRUE)

dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")

dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]

输出

> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))'
Starting dogroups ... done dogroups in 0.001 secs
   a V1 V2 V3  V4
1: 3  5 15 65 115
2: 4  5 40 90 140