Question

我有一个data.table，大约有30列和1亿行。数据包含几个行块，其中块中的每一行在我感兴趣的三个特定列中具有相同的值。这是一个说明性示例，其中我感兴趣的列是时间，水果和颜色：

dt <- data.table(Time = c(100, rep(101, 4), rep(102, 2), 103:105), 
                   Ref = 1:10, 
                   Fruit = c(rep('banana', 2), 'apple', rep('banana', 2), 
                             rep('orange', 2), 'banana', rep('apple', 2)), 
                   Colour = c('green', 'yellow', 'red', rep('yellow', 2), 
                              rep('blue', 2), 'red', 'green', 'red'), 
                   Price = c(rep(1, 3), 2, 4, 3, 1, rep(5, 3)))
dt

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   2 banana yellow     1
# 3:  101   3  apple    red     1
# 4:  101   4 banana yellow     2
# 5:  101   5 banana yellow     4
# 6:  102   6 orange   blue     3
# 7:  102   7 orange   blue     1
# 8:  103   8 banana    red     5
# 9:  104   9  apple  green     5
#10:  105  10  apple    red     5

此示例包含两个块。第一行包含101-banana-yellow行4和5，第二行包含102-orange-blue行6和7.请注意，即使第2行与Time，Fruit和Color上的第4行和第5行匹配，我不想将它作为块的一部分包含在内，因为第3行与2,4和5不同，并且会打破连续匹配行的链。

一旦找到这些块，我想以这样的方式组合块：对于大多数列，只剩下块中最后一行的值，而对于其他列，我想总结所有行中的值。在这个例子中，Ref应该显示最后一个值，而Price应该总结，所以我想要的输出是：

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   2 banana yellow     1
# 3:  101   3  apple    red     1
# 4:  101   5 banana yellow     6
# 5:  102   7 orange   blue     4
# 6:  103   8 banana    red     5
# 7:  104   9  apple  green     5
# 8:  105  10  apple    red     5

我尝试使用data.table的by功能，但我无法获得所需的输出：

byMethod <- dt[, list(Ref = tail(Ref, 1), Price = sum(Price)), by = list(Time, Fruit, Colour)]
setcolorder(byMethod, c('Time', 'Ref', 'Fruit', 'Colour', 'Price'))
byMethod

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   5 banana yellow     7
# 3:  101   3  apple    red     1
# 4:  102   7 orange   blue     4
# 5:  103   8 banana    red     5
# 6:  104   9  apple  green     5
# 7 :  105  10  apple    red     5

这适用于示例中的102-orange-blue块，但是它不会产生我想要的101-banana-yellow块的结果，因为它在我不想要的时候包含了这个块中的第2行到。

有人可以帮助我吗？

Answer 1

这够快吗？

#create an index
dt[,i:=.I]
#group adjacent indices together
dt[, g:=cumsum(c(1, (diff(i) > 1))), by=list(Time, Fruit, Colour)]
#sum prices
dt[, list(Ref=tail(Ref, 1), Price=sum(Price)), 
   by=list(Time, Fruit, Colour, g)]

#    Time  Fruit Colour g Ref Price
# 1:  100 banana  green 1   1     1
# 2:  101 banana yellow 1   2     1
# 3:  101  apple    red 1   3     1
# 4:  101 banana yellow 2   5     6
# 5:  102 orange   blue 1   7     4
# 6:  103 banana    red 1   8     5
# 7:  104  apple  green 1   9     5
# 8:  105  apple    red 1  10     5

Answer 2

rleid()现已在1.9.5中实施，请参阅#686。。来自NEWS：

7）rleid()，用于生成要在分组操作中使用的游程长度类型id列的便利函数现在被实现。关闭#686。查看?rleid示例部分了解使用方案。

有了这个，我们现在可以做到：

require(data.table) ## 1.9.5+
dt[, rleid := rleid(Time, Fruit, Colour)]
dt[, .(Ref = Ref[.N], Price = sum(Price)), by=.(Time, Fruit, Colour, rleid)]

仅汇总data.table中的连续行

2 个答案: