Question

我有一个看起来像这样的序列

  id ep value
1  1  1     a
2  1  2     a
3  1  3     b
4  1  4     d
5  2  1     a
6  2  2     a
7  2  3     c
8  2  4     e

我想做的就是把它减少到

      id    ep  value     n  time total
1      1     0      a     2    20    40
2      1     1      b     1    10    40
3      1     2      d     1    10    40
4      2     0      a     2    20    40
5      2     1      c     1    10    40
6      2     2      e     1    10    40

dplyr似乎工作正常

short = df %>% group_by(id) %>%
 mutate(grp = cumsum(value != lag(value, default = value[1]))) %>%
  count(id, grp, value) %>% mutate(time = n*10) %>% group_by(id) %>% 
   mutate(total = sum(time))

然而，我的数据库真的很大，需要永远。

问题1

有人可以帮我将这一行翻译成data.table代码吗？

问题2

我也有兴趣回到长格式和我想知道速度方面最有效的解决方案是什么。

目前，我正在使用此行

short[rep(1:nrow(short), short$n), ] %>% 
  select(-n, -time, -total) %>% 
  group_by(id) %>% 
  mutate(ep = 1:n())

有什么建议吗？

df = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("1", "2"), class = "factor"), ep = structure(c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"), 
value = structure(c(1L, 1L, 2L, 4L, 1L, 1L, 3L, 5L), .Label = c("a", 
"b", "c", "d", "e"), class = "factor")), .Names = c("id", 
"ep", "value"), row.names = c(NA, -8L), class = "data.frame")

Answer 1

选项是使用rleid

中的data.table

library(data.table)
short1 <- setDT(df)[,  .N,.(id, grp = rleid(value), value)
           ][,  time := N*10
            ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
             ][, grp := NULL][]
short1
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2

导出'long'格式将是

short1[rep(seq_len(.N), N), -c('N', 'time', 'total', 'ep'), 
             with = FALSE][, ep1 := seq_len(.N), id][]

将dplyr代码直接翻译为data.table将是

setDT(df)[, grp := cumsum(value != shift(value, fill = value[1])), id
   ][, .(N= .N), .(id, grp, value)
    ][, time := N*10
     ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
       ][, grp := NULL][]
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2

带有计数的不同序列模式

1 个答案: