" .SD"的data.table帮助函数显示如何选择每个组的第一行:
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
DT[, .N, by=x] # number of rows in each group
这对我来说效果很好,但是当我使用所有列来定义组时它会中断,我不明白为什么,所以我想知道它是不是一个错误。例如:
# Selecting by n-1 columns works:
DT[, .SD[1], by=c("x", "y", "v", "a")]
x y v a b
1: b 1 1 1 9
2: b 3 1 2 8
3: b 6 1 3 7
4: a 1 2 4 6
5: a 3 2 5 5
6: a 6 1 6 4
7: c 1 1 7 3
8: c 3 2 8 2
9: c 6 2 9 1
# The result of selecting by all columns is not what I expected:
DT[, .SD[1], by=c("x", "y", "v", "a", "b")]
Empty data.table (0 rows) of 5 cols: x,y,v,a,b
答案 0 :(得分:2)
正如@christoph评论的那样,.SD
并不包含组列(我认为这是为了提高效率以便不存储重复的组值),您可以通过以下方式验证它:
unique(DT[, .(name = names(.SD)), by=c('x','v')]$name)
# [1] "y" "a" "b"
unique(DT[, .(name = names(.SD)), by=c('x','v','a')]$name)
# [1] "y" "b"
因此,如果按所有列进行分组,则.SD
中没有任何内容;对于您的具体情况,您可以使用unique
并将group
变量传递给by
参数,这将根据by
列删除重复项:
unique(DT, by=c('x','v'))
# x v y a b
#1: b 1 1 1 9
#2: a 2 1 4 6
#3: a 1 6 6 4
#4: c 1 1 7 3
#5: c 2 3 8 2
unique(DT, by=c('x','v','y','a','b'))
# x v y a b
#1: b 1 1 1 9
#2: b 1 3 2 8
#3: b 1 6 3 7
#4: a 2 1 4 6
#5: a 2 3 5 5
#6: a 1 6 6 4
#7: c 1 1 7 3
#8: c 2 3 8 2
#9: c 2 6 9 1