我有以下数据集,我想对每个序列进行分组和汇总。每个序列应分成所有事件,这些事件发生在第一个日期之后的前7天,并将后面的事件组合成一个单独的组。基本上我最大的挑战是找到序列中的第一个日期,添加7天并标记此序列中属于此类别的所有日期。
structure(list(`Sequence ID` = c("1_0_0", "1_0_0", "1_0_0", "1_0_0",
"1_0_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_2_0",
"1_2_1", "1_2_1", "1_2_1", "1_2_1", "1_2_2"), Date = c("02.12.2015 20:16",
"03.12.2015 20:17", "02.12.2015 20:44", "03.12.2015 09:32", "03.12.2015 09:33",
"07.12.2015 08:18", "08.12.2015 19:40", "08.12.2015 19:43", "22.12.2015 18:22",
"22.12.2015 18:23", "23.12.2015 14:18", "05.01.2016 11:35", "05.01.2016 13:21",
"05.01.2016 13:22", "05.01.2016 13:22", "04.08.2016 08:25"),
StimuliA = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L), StimuliB = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L), Response = c(1L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L
)), .Names = c("Sequence ID", "Date", "StimuliA", "StimuliB",
"Response"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-16L), spec = structure(list(cols = structure(list(`Sequence ID` = structure(list(), class = c("collector_character",
"collector")), Date = structure(list(), class = c("collector_character",
"collector")), StimuliA = structure(list(), class = c("collector_integer",
"collector")), StimuliB = structure(list(), class = c("collector_integer",
"collector")), Response = structure(list(), class = c("collector_integer",
"collector")), X6 = structure(list(), class = c("collector_skip",
"collector")), X7 = structure(list(), class = c("collector_skip",
"collector")), X8 = structure(list(), class = c("collector_skip",
"collector")), X9 = structure(list(), class = c("collector_skip",
"collector")), X10 = structure(list(), class = c("collector_skip",
"collector"))), .Names = c("Sequence ID", "Date", "StimuliA",
"StimuliB", "Response", "X6", "X7", "X8", "X9", "X10")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
这可能是一个可能的输出,其中Group 0汇总了前7天的所有值,1汇总了以后出现的值。
Sequence ID Group Date StimuliA StimuliB Response
1_0_0 0 02.12.2015 20:16 0 0 5
1_0_0 1 09.12.2015 20:16 0 0 0
1_1_0 0 07.12.2015 08:18 1 0 2
1_1_0 1 14.12.2015 08:18 0 0 2
1_2_0 0 23.12.2015 14:18 1 0 0
1_2_0 1 30.12.2015 14:18 0 0 0
1_2_1 0 05.01.2016 11:35 0 1 3
1_2_1 1 12.01.2016 11:35 0 0 0
1_2_2 0 04.08.2016 08:25 0 1 0
1_2_2 1 11.08.2016 08:25 0 0 0
我会尝试使用以下代码实现此目的,但需要一些输入如何在7天之前和之后识别值。
#change the date into posixct format
df$Date <- as.POSIXct(strptime(master$Date,"%d.%m.%Y %H:%M"))
#arrange the dataframe according to User and Date
df <- arrange(df, Sequence ID,Date)
#identify the values before and after 7 days
#aggregate all the eventlog rows according to the stimuli IDs
df <- aggregate(. ~ Sequence ID + Group, data=df, sum)
答案 0 :(得分:1)
以下data.table
代码返回按顺序分组的聚合值以及每个序列(或更高版本)中前七天的时间段:
library(data.table)
# copy and coerce to data.table
data.table(DF)[
# make syntactically valid column names
, setnames(.SD, make.names(names(.SD)))][
# transform character date-time to date
, Date := as.Date(lubridate::dmy_hm(Date))][
# create Group variable for the first 7 days and beyond within each sequence
, Initial.Period := Date %between% (min(Date) + c(0L, 6L)), by = Sequence.ID][
# aggregate by sequence and date range
, .(Min.Date = min(Date), Response = sum(Response)), by = .(Sequence.ID, Initial.Period)]
Sequence.ID Initial.Period Min.Date Response 1: 1_0_0 TRUE 2015-12-02 5 2: 1_1_0 TRUE 2015-12-07 2 3: 1_1_0 FALSE 2015-12-22 2 4: 1_2_0 TRUE 2015-12-23 0 5: 1_2_1 TRUE 2016-01-05 3 6: 1_2_2 TRUE 2016-08-04 0
请注意,由于模糊或提供的示例数据不一致,结果与问题中显示的可能输出不同:
Date
而不是POSIXct
。Initial.Period
作为更多发言权的列名来表示前7天,并避免使用通用且含糊不清的名称Group
。StimuliA
和StimuliB
列,因为它们与序列不一致,并且OP没有指定如何处理这种情况。Min.Date
是指每个序列和期间的数据中的最小日期,而不是计算的期间的初始值。