使用data.table创建2个新字段

时间:2018-06-08 23:12:26

标签: r data.table

寻找一些关于在下面制作R代码的建议 期望的结果。任何帮助将不胜感激

以下是逻辑,代码,样本数据和所需输出。当前代码未产生所需结果。我可以使用for循环获得所需的结果但是它需要太长时间

逻辑:

  1. 如果ID是遇到的第一个ID,则group = 1和groupdate = date
  2. 否则,如果不是第一个ID和日期 - 上一个日期> 10或日期 - 上一个groupdate> 10然后group =上一个组#+ 1和groupdate = date
  3. 否则,如果不是第一个ID和日期 - 上一个日期< = 10或日期 - 上一个groupdate< = 10则group =上一个组#和groupdate =上一个日期。
  4. 示例代码:

    df1 <- read.table(header=T,text='ID  DATE        ITEM
               1   1/1/2014    P1
               1   1/15/2014   P2
               1   1/20/2015   P3
               1   1/22/2015   P4
               1   3/10/2015   P5
               2   1/13/2015   P1
               2   1/20/2015   P2
               2   1/28/2015   P3
               2   2/28/2015   P4
               2   3/20/2015   P5')
    df1
    
    library(data.table)
    setDT(df1)[, GROUP:={
         dt <- as.Date(DATE, "%m/%d/%Y")
         gr1 <-cumsum((dt-shift(dt, fill=dt[1L]))>10)+1L; list(gr1)} ,
            by =  ID]
    df1[, GROUPDATE := DATE[1L] , by = .(GROUP, ID)]
    
    df1
    

    ===============

    # Desired output.
    ID  DATE        ITEM    GROUP   GROUPDATE
    1   1/1/2014    P1  1   1/1/2014
    1   1/15/2014   P2  2   1/15/2014
    1   1/20/2014   P3  2   1/15/2014
    1   1/22/2014   P4  2   1/15/2014
    1   3/10/2015   P5  3   3/10/2015
    2   1/13/2015   P1  1   1/13/2015
    2   1/20/2015   P2  1   1/13/2015
    2   1/28/2015   P3  2   1/28/2015
    2   2/28/2015   P4  3   2/28/2015
    2   3/20/2015   P5  4   3/20/2015
    

2 个答案:

答案 0 :(得分:1)

这个答案描述了两种方法,一种是迭代的,另一种是使用非等连接

虽然两者都返回OP 提供的样本数据集的所需输出,但我觉得应该使用更大的数据集对它们进行更彻底的测试。

迭代方法

OP要求将群组计数器增加一个

  1. 如果实际日期与上一日期之间的间隔超过10天,即上一行中的日期,则为 - 或 -
  2. 如果在实际日期和组的第一天之间存在超过10天的间隔(“groupdate”)
  3. 每个ID

    条件(1)可以在一次冲动中直接计算。不幸的是,在评估条件(2)时,条件(2)中引用的组的第一天可能会改变。因此,这可能需要多次传递才能找到最终解决方案。

    下面的答案评估条件(1)并使用此临时解决方案迭代评估条件(2),直到观察到组分配没有更多变化。

    OP在他的问题中提供了两个不同的数据表。在这里,我们采取包含预期结果的第二个:

    library(data.table)
    df2 <- fread("ID  DATE        ITEM    GROUP   GROUPDATE
    1   1/1/2014    P1  1   1/1/2014
    1   1/15/2014   P2  2   1/15/2014
    1   1/20/2014   P3  2   1/15/2014
    1   1/22/2014   P4  2   1/15/2014
    1   3/10/2015   P5  3   3/10/2015
    2   1/13/2015   P1  1   1/13/2015
    2   1/20/2015   P2  1   1/13/2015
    2   1/28/2015   P3  2   1/28/2015
    2   2/28/2015   P4  3   2/28/2015
    2   3/20/2015   P5  4   3/20/2015")
    

    DATEGROUPDATE列需要事先强制转换为Date类。

    cols <- c("DATE", "GROUPDATE")
    df2[, (cols) := lapply(.SD, as.IDate, "%m/%d/%Y"), .SDcols = cols]
    

    请注意,隐含的假设是每行DATE内的行按ID排序。

    现在,我们可以开始计算了。为了比较结果,我选择了不同的列名。

    dt <- copy(df2) # just for convenience to easily switch between df1 and df2
    # create group count for gaps of more than 10 days
    dt[, grp := cumsum((DATE - shift(DATE, fill = 0)) > 10L), by = ID]
    # set group date
    dt[, gdt := first(DATE), by = .(ID, grp)]
    # update group count according to conditon (2)
    tmp <- dt[, cumsum((DATE - shift(gdt, fill = 0)) > 10L), by = ID][, V1]
    # repeat as long as there changes in group counts
    while (dt[, any(grp != tmp)]) {
      # complete update of group count
      dt[, grp := tmp]
      # set new group date
      dt[, gdt := first(DATE), by = .(ID, grp)]
      # update group count according to conditon (2)
      tmp <- dt[, cumsum((DATE - shift(gdt, fill = 0)) > 10L), by = ID][, V1]
    }
    dt
    
        ID       DATE ITEM GROUP  GROUPDATE grp        gdt
     1:  1 2014-01-01   P1     1 2014-01-01   1 2014-01-01
     2:  1 2014-01-15   P2     2 2014-01-15   2 2014-01-15
     3:  1 2014-01-20   P3     2 2014-01-15   2 2014-01-15
     4:  1 2014-01-22   P4     2 2014-01-15   2 2014-01-15
     5:  1 2015-03-10   P5     3 2015-03-10   3 2015-03-10
     6:  2 2015-01-13   P1     1 2015-01-13   1 2015-01-13
     7:  2 2015-01-20   P2     1 2015-01-13   1 2015-01-13
     8:  2 2015-01-28   P3     2 2015-01-28   2 2015-01-28
     9:  2 2015-02-28   P4     3 2015-02-28   3 2015-02-28
    10:  2 2015-03-20   P5     4 2015-03-20   4 2015-03-20
    

    非平等加入

    条件可以重写:如果前一行的间隙不超过10天,则行属于一个组与该组第一行的差距不超过10天。

    dt <- copy(df2)
    # append row id 
    dt[, rn := .I]
    tmp <- 
      # non-equi join to find all rows which lie within a 10 days interval
      dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
         on = .(ID, DATE >= start, DATE <= end)][
           # aggregate for row id to find the earliest start date
           , min(DATE), by = .(ID, rn)][
             # create group id (starting at one for each ID)
             , rleid(V1), by = ID]
    # append group id to original data.table
    dt[, grp := tmp$V1][
      # set group date
      , gdt := first(DATE), by = .(ID, grp)]
    dt
    
        ID       DATE ITEM GROUP  GROUPDATE rn grp        gdt
     1:  1 2014-01-01   P1     1 2014-01-01  1   1 2014-01-01
     2:  1 2014-01-15   P2     2 2014-01-15  2   2 2014-01-15
     3:  1 2014-01-20   P3     2 2014-01-15  3   2 2014-01-15
     4:  1 2014-01-22   P4     2 2014-01-15  4   2 2014-01-15
     5:  1 2015-03-10   P5     3 2015-03-10  5   3 2015-03-10
     6:  2 2015-01-13   P1     1 2015-01-13  6   1 2015-01-13
     7:  2 2015-01-20   P2     1 2015-01-13  7   1 2015-01-13
     8:  2 2015-01-28   P3     2 2015-01-28  8   2 2015-01-28
     9:  2 2015-02-28   P4     3 2015-02-28  9   3 2015-02-28
    10:  2 2015-03-20   P5     4 2015-03-20 10   4 2015-03-20
    
    分步解释

    这个想法是在一个操作中找到特定行所属的“groupdate”。这是通过找到一行可能属于的所有可能的10天intervall并通过获取每行的最早开始日期来聚合来实现的。这用于创建组正在使用rleid()函数。

    下面的表达式创建一个包含10天间隔

    的辅助数据.table
    dt[, .(ID, start = DATE, end = DATE + 10L)]
    
        ID      start        end
     1:  1 2014-01-01 2014-01-11
     2:  1 2014-01-15 2014-01-25
     3:  1 2014-01-20 2014-01-30
     4:  1 2014-01-22 2014-02-01
     5:  1 2015-03-10 2015-03-20
     6:  2 2015-01-13 2015-01-23
     7:  2 2015-01-20 2015-01-30
     8:  2 2015-01-28 2015-02-07
     9:  2 2015-02-28 2015-03-10
    10:  2 2015-03-20 2015-03-30
    

    非equi连接查找位于10天间隔内的所有行:

    dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
         on = .(ID, DATE >= start, DATE <= end)]
    
        ID       DATE ITEM GROUP  GROUPDATE rn     DATE.1
     1:  1 2014-01-01   P1     1 2014-01-01  1 2014-01-11
     2:  1 2014-01-15   P2     2 2014-01-15  2 2014-01-25
     3:  1 2014-01-15   P3     2 2014-01-15  3 2014-01-25
     4:  1 2014-01-15   P4     2 2014-01-15  4 2014-01-25
     5:  1 2014-01-20   P3     2 2014-01-15  3 2014-01-30
     6:  1 2014-01-20   P4     2 2014-01-15  4 2014-01-30
     7:  1 2014-01-22   P4     2 2014-01-15  4 2014-02-01
     8:  1 2015-03-10   P5     3 2015-03-10  5 2015-03-20
     9:  2 2015-01-13   P1     1 2015-01-13  6 2015-01-23
    10:  2 2015-01-13   P2     1 2015-01-13  7 2015-01-23
    11:  2 2015-01-20   P2     1 2015-01-13  7 2015-01-30
    12:  2 2015-01-20   P3     2 2015-01-28  8 2015-01-30
    13:  2 2015-01-28   P3     2 2015-01-28  8 2015-02-07
    14:  2 2015-02-28   P4     3 2015-02-28  9 2015-03-10
    15:  2 2015-03-20   P5     4 2015-03-20 10 2015-03-30
    

    例如,区间[2014-01-15, 2014-01-25]包括第2,3和4行。另一方面,第4行(DATE:2014-01-22)属于三个不同的区间: [2014-01-15, 2014-01-25][2014-01-20, 2014-01-30][2014-01-22, 2014-02-01]

    现在,我们选择每行最早开始日期的间隔:

    dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
       on = .(ID, DATE >= start, DATE <= end)][
         , min(DATE), by = .(ID, rn)]
    
        ID rn         V1
     1:  1  1 2014-01-01
     2:  1  2 2014-01-15
     3:  1  3 2014-01-15
     4:  1  4 2014-01-15
     5:  1  5 2015-03-10
     6:  2  6 2015-01-13
     7:  2  7 2015-01-13
     8:  2  8 2015-01-20
     9:  2  9 2015-02-28
    10:  2 10 2015-03-20
    

    具有相同V1的后续行属于同一组。因此,我们可以使用rleid()函数创建组ID。

    tmp <- 
      dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
         on = .(ID, DATE >= start, DATE <= end)][
           , min(DATE), by = .(ID, rn)][
             , rleid(V1), by = ID]
    
    tmp
    
        ID rn        gdt grp
     1:  1  1 2014-01-01   1
     2:  1  2 2014-01-15   2
     3:  1  3 2014-01-15   2
     4:  1  4 2014-01-15   2
     5:  1  5 2015-03-10   3
     6:  2  6 2015-01-13   1
     7:  2  7 2015-01-13   1
     8:  2  8 2015-01-20   2
     9:  2  9 2015-02-28   3
    10:  2 10 2015-03-20   4
    

    最后的步骤是

    # append group id to original data.table
    dt[, grp := tmp$V1][
      # set group date
      , gdt := first(DATE), by = .(ID, grp)]
    

答案 1 :(得分:0)

我认为最简单的方法是在矢量化函数中捕获与日期相关的逻辑,请参阅下面的groupDates(x)。我已经为该函数使用了递归逻辑,可能有更好的方法来实现它。

对于计算更改的字段,您可以使用cumsum

df1 <- read.table(header=T,text='ID  DATE        ITEM
           1   1/1/2014    P1
           1   1/15/2014   P2
           1   1/20/2014   P3
           1   1/22/2014   P4
           1   3/10/2015   P5
           2   1/13/2015   P1
           2   1/20/2015   P2
           2   1/28/2015   P3
           2   2/28/2015   P4
           2   3/20/2015   P5')

dt <- as.data.table(df1)

groupDates <- function(x) {
  x.prev <- c(head(x, 1), head(x, -1))
  x.diff <- abs(c(0, diff(x)))
  x.diff.big <- x.diff > 10 | x.diff == 0
  x.diff.prev.big <- c(TRUE, head(x.diff.big, -1))
  x[!x.diff.big & x.diff.prev.big] <- NA
  x <- safe.ifelse(!is.na(x), x, x.prev)
  d <- diff(x)
  if(min(d[d > 0]) < 10) {
    groupDates(x)
  }
  else {
    x
  }
}

dt[, date := as.Date(DATE, format = "%m/%d/%Y")]
dt[, group.date := groupDates(date), ID]
dt[, previous.date := shift(group.date, fill = first(date)), ID]
dt[, group.i := 1:.N, ID]   
dt[, previous.date.interval := abs(date -previous.date) > 10, ID]
dt[, group := cumsum(previous.date.interval) + 1L, ID]
dt[, .(ID, DATE, group.date, group)]
#    ID      DATE group.date group
# 1:  1  1/1/2014 2014-01-01     1
# 2:  1 1/15/2014 2014-01-15     2
# 3:  1 1/20/2014 2014-01-15     2
# 4:  1 1/22/2014 2014-01-15     2
# 5:  1 3/10/2015 2015-03-10     3
# 6:  2 1/13/2015 2015-01-13     1
# 7:  2 1/20/2015 2015-01-13     1
# 8:  2 1/28/2015 2015-01-28     2
# 9:  2 2/28/2015 2015-02-28     3
#10:  2 3/20/2015 2015-03-20     4