我有一个看起来像这样的序列
id ep value
1 1 1 a
2 1 2 a
3 1 3 b
4 1 4 d
5 2 1 a
6 2 2 a
7 2 3 c
8 2 4 e
我想做的就是把它减少到
id ep value n time total
1 1 0 a 2 20 40
2 1 1 b 1 10 40
3 1 2 d 1 10 40
4 2 0 a 2 20 40
5 2 1 c 1 10 40
6 2 2 e 1 10 40
dplyr
似乎工作正常
short = df %>% group_by(id) %>%
mutate(grp = cumsum(value != lag(value, default = value[1]))) %>%
count(id, grp, value) %>% mutate(time = n*10) %>% group_by(id) %>%
mutate(total = sum(time))
然而,我的数据库真的很大,需要永远。
问题1
有人可以帮我将这一行翻译成data.table
代码吗?
问题2
我也有兴趣回到长格式和 我想知道速度方面最有效的解决方案是什么。
目前,我正在使用此行
short[rep(1:nrow(short), short$n), ] %>%
select(-n, -time, -total) %>%
group_by(id) %>%
mutate(ep = 1:n())
有什么建议吗?
df = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("1", "2"), class = "factor"), ep = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
value = structure(c(1L, 1L, 2L, 4L, 1L, 1L, 3L, 5L), .Label = c("a",
"b", "c", "d", "e"), class = "factor")), .Names = c("id",
"ep", "value"), row.names = c(NA, -8L), class = "data.frame")
答案 0 :(得分:1)
选项是使用rleid
data.table
library(data.table)
short1 <- setDT(df)[, .N,.(id, grp = rleid(value), value)
][, time := N*10
][, c('total', 'ep') := .(sum(time), seq_len(.N) - 1), id
][, grp := NULL][]
short1
# id value N time total ep
#1: 1 a 2 20 40 0
#2: 1 b 1 10 40 1
#3: 1 d 1 10 40 2
#4: 2 a 2 20 40 0
#5: 2 c 1 10 40 1
#6: 2 e 1 10 40 2
导出'long'格式将是
short1[rep(seq_len(.N), N), -c('N', 'time', 'total', 'ep'),
with = FALSE][, ep1 := seq_len(.N), id][]
将dplyr
代码直接翻译为data.table
将是
setDT(df)[, grp := cumsum(value != shift(value, fill = value[1])), id
][, .(N= .N), .(id, grp, value)
][, time := N*10
][, c('total', 'ep') := .(sum(time), seq_len(.N) - 1), id
][, grp := NULL][]
# id value N time total ep
#1: 1 a 2 20 40 0
#2: 1 b 1 10 40 1
#3: 1 d 1 10 40 2
#4: 2 a 2 20 40 0
#5: 2 c 1 10 40 1
#6: 2 e 1 10 40 2