带有计数的不同序列模式

时间:2017-12-22 12:16:24

标签: r dplyr data.table

我有一个看起来像这样的序列

  id ep value
1  1  1     a
2  1  2     a
3  1  3     b
4  1  4     d
5  2  1     a
6  2  2     a
7  2  3     c
8  2  4     e

我想做的就是把它减少到

      id    ep  value     n  time total
1      1     0      a     2    20    40
2      1     1      b     1    10    40
3      1     2      d     1    10    40
4      2     0      a     2    20    40
5      2     1      c     1    10    40
6      2     2      e     1    10    40

dplyr似乎工作正常

short = df %>% group_by(id) %>%
 mutate(grp = cumsum(value != lag(value, default = value[1]))) %>%
  count(id, grp, value) %>% mutate(time = n*10) %>% group_by(id) %>% 
   mutate(total = sum(time))

然而,我的数据库真的很大,需要永远。

问题1

有人可以帮我将这一行翻译成data.table代码吗?

问题2

我也有兴趣回到格式和 我想知道速度方面最有效的解决方案是什么。

目前,我正在使用此行

short[rep(1:nrow(short), short$n), ] %>% 
  select(-n, -time, -total) %>% 
  group_by(id) %>% 
  mutate(ep = 1:n())

有什么建议吗?

df = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("1", "2"), class = "factor"), ep = structure(c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"), 
value = structure(c(1L, 1L, 2L, 4L, 1L, 1L, 3L, 5L), .Label = c("a", 
"b", "c", "d", "e"), class = "factor")), .Names = c("id", 
"ep", "value"), row.names = c(NA, -8L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

选项是使用rleid

中的data.table
library(data.table)
short1 <- setDT(df)[,  .N,.(id, grp = rleid(value), value)
           ][,  time := N*10
            ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
             ][, grp := NULL][]
short1
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2

导出'long'格式将是

short1[rep(seq_len(.N), N), -c('N', 'time', 'total', 'ep'), 
             with = FALSE][, ep1 := seq_len(.N), id][]

dplyr代码直接翻译为data.table将是

setDT(df)[, grp := cumsum(value != shift(value, fill = value[1])), id
   ][, .(N= .N), .(id, grp, value)
    ][, time := N*10
     ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
       ][, grp := NULL][]
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2