我有一个非常大的数据集,看起来简化如下:
row. member_id entry_id comment_count timestamp
1 1 a 4 2008-06-09 12:41:00
2 1 b 1 2008-07-14 18:41:00
3 1 c 3 2008-07-17 15:40:00
4 2 d 12 2008-06-09 12:41:00
5 2 e 50 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
我可以使用以下代码聚合计数:
transform(df, aggregated_count = ave(comment_count, member_id, FUN = cumsum))
但我想在累积数据中滞后1,或者我希望cumsum
忽略当前行。结果应该是:
row. member_id entry_id comment_count timestamp previous_comments
1 1 a 4 2008-06-09 12:41:00 0
2 1 b 1 2008-07-14 18:41:00 4
3 1 c 3 2008-07-17 15:40:00 5
4 2 d 12 2008-06-09 12:41:00 0
5 2 e 50 2008-09-18 10:22:00 12
6 3 f 0 2008-10-03 13:36:00 0
有些想法我怎么能在R中做到这一点?也许甚至滞后大于1?
重现性数据:
# dput(df)
structure(list(member_id = c(1L, 1L, 1L, 2L, 2L, 3L), entry_id = c("a",
"b", "c", "d", "e", "f"), comment_count = c(4L, 1L, 3L, 12L,
50L, 0L), timestamp = c("2008-06-09 12:41:00", "2008-07-14 18:41:00",
"2008-07-17 15:40:00", "2008-06-09 12:41:00", "2008-09-18 10:22:00",
"2008-10-03 13:36:00")), .Names = c("member_id", "entry_id",
"comment_count", "timestamp"), row.names = c("1", "2", "3", "4",
"5", "6"), class = "data.frame")
答案 0 :(得分:10)
您可以使用lag
中的dplyr
并更改k
library(dplyr)
df %>%
group_by(member_id) %>%
mutate(previous_comments=lag(cumsum(comment_count),k=1, default=0))
# member_id entry_id comment_count timestamp previous_comments
#1 1 a 4 2008-06-09 12:41:00 0
#2 1 b 1 2008-07-14 18:41:00 4
#3 1 c 3 2008-07-17 15:40:00 5
#4 2 d 12 2008-06-09 12:41:00 0
#5 2 e 50 2008-09-18 10:22:00 12
#6 3 f 0 2008-10-03 13:36:00 0
或使用data.table
library(data.table)
setDT(df)[,previous_comments:=c(0,cumsum(comment_count[-.N])) , member_id]
答案 1 :(得分:9)
您可以对第一个元素使用0,并使用head(, -1)
transform(df, previous_comments=ave(comment_count, member_id,
FUN = function(x) cumsum(c(0, head(x, -1)))))
# member_id entry_id comment_count timestamp previous_comments
#1 1 a 4 2008-06-09 12:41:00 0
#2 1 b 1 2008-07-14 18:41:00 4
#3 1 c 3 2008-07-17 15:40:00 5
#4 2 d 12 2008-06-09 12:41:00 0
#5 2 e 50 2008-09-18 10:22:00 12
#6 3 f 0 2008-10-03 13:36:00 0
答案 2 :(得分:4)
只需从comment_count
中减去ave
:
transform(df,
aggregated_count = ave(comment_count, member_id, FUN = cumsum) - comment_count)