我有一个如下所示的数据集:
library(data.table)
set.seed(10)
n_rows <- 50
data <- data.table(id = 1:n_rows,
timestamp = Sys.Date() + as.difftime(1:n_rows, units = "days"),
subject = sample(letters[1:4], n_rows, replace = T),
response = sample(3, n_rows, replace = T)
)
head(data, 10)
id timestamp subject response
1: 1 2016-05-17 c 2
2: 2 2016-05-18 b 3
3: 3 2016-05-19 b 1
4: 4 2016-05-20 c 2
5: 5 2016-05-21 a 1
6: 6 2016-05-22 a 2
7: 7 2016-05-23 b 2
8: 8 2016-05-24 b 2
9: 9 2016-05-25 c 2
10: 10 2016-05-26 b 2
我需要通过按日期对每个响应的出现进行总结的操作来进行一些分组。
以下分组生成nth_test列。
new_vars <- data[, .(id, timestamp, nth_test = 1:.N, response), by=.(subject)]
subject id timestamp nth_test response
1: c 1 2016-05-17 1 2
2: c 4 2016-05-20 2 2
3: c 9 2016-05-25 3 2
4: c 11 2016-05-27 4 1
5: c 12 2016-05-28 5 1
6: c 14 2016-05-30 6 2
7: c 22 2016-06-07 7 2
8: c 26 2016-06-11 8 2
9: c 31 2016-06-16 9 3
10: c 36 2016-06-21 10 1
但我不知道如何制作列resp_1,resp_2&amp; resp_3如下。
subject id timestamp nth_test response resp_1 resp_2 resp_3
1: c 1 2016-05-17 1 2 0 1 0
2: c 4 2016-05-20 2 2 0 2 0
3: c 9 2016-05-25 3 2 0 3 0
4: c 11 2016-05-27 4 1 1 3 0
5: c 12 2016-05-28 5 1 2 3 0
6: c 14 2016-05-30 6 2 2 4 0
7: c 22 2016-06-07 7 2 2 5 0
8: c 26 2016-06-11 8 2 2 6 0
9: c 31 2016-06-16 9 3 2 6 1
10: c 36 2016-06-21 10 1 3 6 1
干杯
答案 0 :(得分:3)
我们可以尝试
Un1 <- unique(sort(data$response))
data[, c("nth_test", paste("resp", Un1, sep="_")) := c(list(1:.N),
lapply(Un1, function(x) cumsum(x==response))) , .(subject)]
data[order(subject, timestamp)][subject=="c"]
# id timestamp subject response nth_test resp_1 resp_2 resp_3
# 1: 1 2016-05-17 c 2 1 0 1 0
# 2: 4 2016-05-20 c 2 2 0 2 0
# 3: 9 2016-05-25 c 2 3 0 3 0
# 4: 11 2016-05-27 c 1 4 1 3 0
# 5: 12 2016-05-28 c 1 5 2 3 0
# 6: 14 2016-05-30 c 2 6 2 4 0
# 7: 22 2016-06-07 c 2 7 2 5 0
# 8: 26 2016-06-11 c 2 8 2 6 0
# 9: 31 2016-06-16 c 3 9 2 6 1
#10: 36 2016-06-21 c 1 10 3 6 1
#11: 39 2016-06-24 c 1 11 4 6 1
#12: 40 2016-06-25 c 1 12 5 6 1
#13: 44 2016-06-29 c 2 13 5 7 1
答案 1 :(得分:0)
我想看看如果在data.table是长形式的情况下完成cummax / cumsum会是什么样子(在某些配置中可能更有效):
> data[order(subject, timestamp)
+ ][, rCnt := 1:.N, .(subject, response)
+ ][, responseStr := sprintf('%s_%s', 'resp', response)
+ ][, dcast(.SD, id + timestamp + subject + response ~ responseStr, value.var='rCnt', fill=0)
+ ][, melt(.SD, id.vars=c('id', 'timestamp', 'subject', 'response'))
+ ][order(subject, timestamp)
+ ][, value := cummax(value), .(subject, variable)
+ ][, nth_test := 1:.N, .(subject, variable)
+ ][, dcast(.SD, id + timestamp + subject + response + nth_test ~ variable, value.var='value')
+ ][order(subject, timestamp)
+ ][subject == 'c'
+ ]
id timestamp subject response nth_test resp_1 resp_2 resp_3
1: 1 2016-05-17 c 2 1 0 1 0
2: 4 2016-05-20 c 2 2 0 2 0
3: 9 2016-05-25 c 2 3 0 3 0
4: 11 2016-05-27 c 1 4 1 3 0
5: 12 2016-05-28 c 1 5 2 3 0
6: 14 2016-05-30 c 2 6 2 4 0
7: 22 2016-06-07 c 2 7 2 5 0
8: 26 2016-06-11 c 2 8 2 6 0
9: 31 2016-06-16 c 3 9 2 6 1
10: 36 2016-06-21 c 1 10 3 6 1
11: 39 2016-06-24 c 1 11 4 6 1
12: 40 2016-06-25 c 1 12 5 6 1
13: 44 2016-06-29 c 2 13 5 7 1
>