在r

时间:2017-08-07 11:11:27

标签: r date aggregate

我有以下数据集,我想对每个序列进行分组和汇总。每个序列应分成所有事件,这些事件发生在第一个日期之后的前7天,并将后面的事件组合成一个单独的组。基本上我最大的挑战是找到序列中的第一个日期,添加7天并标记此序列中属于此类别的所有日期

structure(list(`Sequence ID` = c("1_0_0", "1_0_0", "1_0_0", "1_0_0", 
"1_0_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_2_0", 
"1_2_1", "1_2_1", "1_2_1", "1_2_1", "1_2_2"), Date = c("02.12.2015 20:16", 
"03.12.2015 20:17", "02.12.2015 20:44", "03.12.2015 09:32", "03.12.2015 09:33", 
"07.12.2015 08:18", "08.12.2015 19:40", "08.12.2015 19:43", "22.12.2015 18:22", 
"22.12.2015 18:23", "23.12.2015 14:18", "05.01.2016 11:35", "05.01.2016 13:21", 
"05.01.2016 13:22", "05.01.2016 13:22", "04.08.2016 08:25"), 
    StimuliA = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 
    0L, 0L, 0L, 0L, 0L), StimuliB = c(0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L), Response = c(1L, 
    1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L
    )), .Names = c("Sequence ID", "Date", "StimuliA", "StimuliB", 
"Response"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-16L), spec = structure(list(cols = structure(list(`Sequence ID` = structure(list(), class = c("collector_character", 
"collector")), Date = structure(list(), class = c("collector_character", 
"collector")), StimuliA = structure(list(), class = c("collector_integer", 
"collector")), StimuliB = structure(list(), class = c("collector_integer", 
"collector")), Response = structure(list(), class = c("collector_integer", 
"collector")), X6 = structure(list(), class = c("collector_skip", 
"collector")), X7 = structure(list(), class = c("collector_skip", 
"collector")), X8 = structure(list(), class = c("collector_skip", 
"collector")), X9 = structure(list(), class = c("collector_skip", 
"collector")), X10 = structure(list(), class = c("collector_skip", 
"collector"))), .Names = c("Sequence ID", "Date", "StimuliA", 
"StimuliB", "Response", "X6", "X7", "X8", "X9", "X10")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

这可能是一个可能的输出,其中Group 0汇总了前7天的所有值,1汇总了以后出现的值。

Sequence ID Group   Date           StimuliA StimuliB    Response
1_0_0         0   02.12.2015 20:16    0         0           5
1_0_0         1   09.12.2015 20:16    0         0           0
1_1_0         0   07.12.2015 08:18    1         0           2
1_1_0         1   14.12.2015 08:18    0         0           2
1_2_0         0   23.12.2015 14:18    1         0           0
1_2_0         1   30.12.2015 14:18    0         0           0
1_2_1         0   05.01.2016 11:35    0         1           3
1_2_1         1   12.01.2016 11:35    0         0           0
1_2_2         0   04.08.2016 08:25    0         1           0
1_2_2         1   11.08.2016 08:25    0         0           0

我会尝试使用以下代码实现此目的,但需要一些输入如何在7天之前和之后识别值。

#change the date into posixct format
df$Date <- as.POSIXct(strptime(master$Date,"%d.%m.%Y %H:%M"))

#arrange the dataframe according to User and Date
df <-  arrange(df, Sequence ID,Date)

#identify the values before and after 7 days

#aggregate all the eventlog rows according to the stimuli IDs
df <- aggregate(. ~ Sequence ID + Group, data=df, sum)

1 个答案:

答案 0 :(得分:1)

以下data.table代码返回按顺序分组的聚合值以及每个序列(或更高版本)中前七天的时间段:

library(data.table)
# copy and coerce to data.table
data.table(DF)[
  # make syntactically valid column names
  , setnames(.SD, make.names(names(.SD)))][
    # transform character date-time to date
    , Date := as.Date(lubridate::dmy_hm(Date))][
      # create Group variable for the first 7 days and beyond within each sequence
      , Initial.Period := Date %between% (min(Date) + c(0L, 6L)), by = Sequence.ID][
        # aggregate by sequence and date range
        , .(Min.Date = min(Date), Response = sum(Response)), by = .(Sequence.ID, Initial.Period)]
   Sequence.ID Initial.Period   Min.Date Response
1:       1_0_0           TRUE 2015-12-02        5
2:       1_1_0           TRUE 2015-12-07        2
3:       1_1_0          FALSE 2015-12-22        2
4:       1_2_0           TRUE 2015-12-23        0
5:       1_2_1           TRUE 2016-01-05        3
6:       1_2_2           TRUE 2016-08-04        0

请注意,由于模糊或提供的示例数据不一致,结果与问题中显示的可能输出不同:

  • 示例数据包含日期时间,但OP在其规范中始终使用术语 date days 。因此,代码使用Date而不是POSIXct
  • 我故意选择使用Initial.Period作为更多发言权的列名来表示前7天,并避免使用通用且含糊不清的名称Group
  • 聚合中省略了StimuliAStimuliB列,因为它们与序列不一致,并且OP没有指定如何处理这种情况。
  • Min.Date是指每个序列和期间的数据中的最小日期,而不是计算的期间的初始值。
  • 结果仅显示数据集中可用数据的聚合值。 可能的输出包含更多行,因为它包含序列的所有可能组合以及缺失值已用零填充的句点。