寻找一些关于在下面制作R代码的建议 期望的结果。任何帮助将不胜感激
以下是逻辑,代码,样本数据和所需输出。当前代码未产生所需结果。我可以使用for循环获得所需的结果但是它需要太长时间
df1 <- read.table(header=T,text='ID DATE ITEM
1 1/1/2014 P1
1 1/15/2014 P2
1 1/20/2015 P3
1 1/22/2015 P4
1 3/10/2015 P5
2 1/13/2015 P1
2 1/20/2015 P2
2 1/28/2015 P3
2 2/28/2015 P4
2 3/20/2015 P5')
df1
library(data.table)
setDT(df1)[, GROUP:={
dt <- as.Date(DATE, "%m/%d/%Y")
gr1 <-cumsum((dt-shift(dt, fill=dt[1L]))>10)+1L; list(gr1)} ,
by = ID]
df1[, GROUPDATE := DATE[1L] , by = .(GROUP, ID)]
df1
===============
# Desired output.
ID DATE ITEM GROUP GROUPDATE
1 1/1/2014 P1 1 1/1/2014
1 1/15/2014 P2 2 1/15/2014
1 1/20/2014 P3 2 1/15/2014
1 1/22/2014 P4 2 1/15/2014
1 3/10/2015 P5 3 3/10/2015
2 1/13/2015 P1 1 1/13/2015
2 1/20/2015 P2 1 1/13/2015
2 1/28/2015 P3 2 1/28/2015
2 2/28/2015 P4 3 2/28/2015
2 3/20/2015 P5 4 3/20/2015
答案 0 :(得分:1)
这个答案描述了两种方法,一种是迭代的,另一种是使用非等连接。
虽然两者都返回OP 提供的样本数据集的所需输出,但我觉得应该使用更大的数据集对它们进行更彻底的测试。
OP要求将群组计数器增加一个
ID
。
条件(1)可以在一次冲动中直接计算。不幸的是,在评估条件(2)时,条件(2)中引用的组的第一天可能会改变。因此,这可能需要多次传递才能找到最终解决方案。
下面的答案评估条件(1)并使用此临时解决方案迭代评估条件(2),直到观察到组分配没有更多变化。
OP在他的问题中提供了两个不同的数据表。在这里,我们采取包含预期结果的第二个:
library(data.table)
df2 <- fread("ID DATE ITEM GROUP GROUPDATE
1 1/1/2014 P1 1 1/1/2014
1 1/15/2014 P2 2 1/15/2014
1 1/20/2014 P3 2 1/15/2014
1 1/22/2014 P4 2 1/15/2014
1 3/10/2015 P5 3 3/10/2015
2 1/13/2015 P1 1 1/13/2015
2 1/20/2015 P2 1 1/13/2015
2 1/28/2015 P3 2 1/28/2015
2 2/28/2015 P4 3 2/28/2015
2 3/20/2015 P5 4 3/20/2015")
DATE
和GROUPDATE
列需要事先强制转换为Date类。
cols <- c("DATE", "GROUPDATE")
df2[, (cols) := lapply(.SD, as.IDate, "%m/%d/%Y"), .SDcols = cols]
请注意,隐含的假设是每行DATE
内的行按ID
排序。
现在,我们可以开始计算了。为了比较结果,我选择了不同的列名。
dt <- copy(df2) # just for convenience to easily switch between df1 and df2
# create group count for gaps of more than 10 days
dt[, grp := cumsum((DATE - shift(DATE, fill = 0)) > 10L), by = ID]
# set group date
dt[, gdt := first(DATE), by = .(ID, grp)]
# update group count according to conditon (2)
tmp <- dt[, cumsum((DATE - shift(gdt, fill = 0)) > 10L), by = ID][, V1]
# repeat as long as there changes in group counts
while (dt[, any(grp != tmp)]) {
# complete update of group count
dt[, grp := tmp]
# set new group date
dt[, gdt := first(DATE), by = .(ID, grp)]
# update group count according to conditon (2)
tmp <- dt[, cumsum((DATE - shift(gdt, fill = 0)) > 10L), by = ID][, V1]
}
dt
ID DATE ITEM GROUP GROUPDATE grp gdt 1: 1 2014-01-01 P1 1 2014-01-01 1 2014-01-01 2: 1 2014-01-15 P2 2 2014-01-15 2 2014-01-15 3: 1 2014-01-20 P3 2 2014-01-15 2 2014-01-15 4: 1 2014-01-22 P4 2 2014-01-15 2 2014-01-15 5: 1 2015-03-10 P5 3 2015-03-10 3 2015-03-10 6: 2 2015-01-13 P1 1 2015-01-13 1 2015-01-13 7: 2 2015-01-20 P2 1 2015-01-13 1 2015-01-13 8: 2 2015-01-28 P3 2 2015-01-28 2 2015-01-28 9: 2 2015-02-28 P4 3 2015-02-28 3 2015-02-28 10: 2 2015-03-20 P5 4 2015-03-20 4 2015-03-20
条件可以重写:如果前一行的间隙不超过10天,则行属于一个组和与该组第一行的差距不超过10天。
dt <- copy(df2)
# append row id
dt[, rn := .I]
tmp <-
# non-equi join to find all rows which lie within a 10 days interval
dt[dt[, .(ID, start = DATE, end = DATE + 10L)],
on = .(ID, DATE >= start, DATE <= end)][
# aggregate for row id to find the earliest start date
, min(DATE), by = .(ID, rn)][
# create group id (starting at one for each ID)
, rleid(V1), by = ID]
# append group id to original data.table
dt[, grp := tmp$V1][
# set group date
, gdt := first(DATE), by = .(ID, grp)]
dt
分步解释ID DATE ITEM GROUP GROUPDATE rn grp gdt 1: 1 2014-01-01 P1 1 2014-01-01 1 1 2014-01-01 2: 1 2014-01-15 P2 2 2014-01-15 2 2 2014-01-15 3: 1 2014-01-20 P3 2 2014-01-15 3 2 2014-01-15 4: 1 2014-01-22 P4 2 2014-01-15 4 2 2014-01-15 5: 1 2015-03-10 P5 3 2015-03-10 5 3 2015-03-10 6: 2 2015-01-13 P1 1 2015-01-13 6 1 2015-01-13 7: 2 2015-01-20 P2 1 2015-01-13 7 1 2015-01-13 8: 2 2015-01-28 P3 2 2015-01-28 8 2 2015-01-28 9: 2 2015-02-28 P4 3 2015-02-28 9 3 2015-02-28 10: 2 2015-03-20 P5 4 2015-03-20 10 4 2015-03-20
这个想法是在一个操作中找到特定行所属的“groupdate”。这是通过找到一行可能属于的所有可能的10天intervall并通过获取每行的最早开始日期来聚合来实现的。这用于创建组正在使用rleid()
函数。
下面的表达式创建一个包含10天间隔
的辅助数据.tabledt[, .(ID, start = DATE, end = DATE + 10L)]
ID start end 1: 1 2014-01-01 2014-01-11 2: 1 2014-01-15 2014-01-25 3: 1 2014-01-20 2014-01-30 4: 1 2014-01-22 2014-02-01 5: 1 2015-03-10 2015-03-20 6: 2 2015-01-13 2015-01-23 7: 2 2015-01-20 2015-01-30 8: 2 2015-01-28 2015-02-07 9: 2 2015-02-28 2015-03-10 10: 2 2015-03-20 2015-03-30
非equi连接查找位于10天间隔内的所有行:
dt[dt[, .(ID, start = DATE, end = DATE + 10L)],
on = .(ID, DATE >= start, DATE <= end)]
ID DATE ITEM GROUP GROUPDATE rn DATE.1 1: 1 2014-01-01 P1 1 2014-01-01 1 2014-01-11 2: 1 2014-01-15 P2 2 2014-01-15 2 2014-01-25 3: 1 2014-01-15 P3 2 2014-01-15 3 2014-01-25 4: 1 2014-01-15 P4 2 2014-01-15 4 2014-01-25 5: 1 2014-01-20 P3 2 2014-01-15 3 2014-01-30 6: 1 2014-01-20 P4 2 2014-01-15 4 2014-01-30 7: 1 2014-01-22 P4 2 2014-01-15 4 2014-02-01 8: 1 2015-03-10 P5 3 2015-03-10 5 2015-03-20 9: 2 2015-01-13 P1 1 2015-01-13 6 2015-01-23 10: 2 2015-01-13 P2 1 2015-01-13 7 2015-01-23 11: 2 2015-01-20 P2 1 2015-01-13 7 2015-01-30 12: 2 2015-01-20 P3 2 2015-01-28 8 2015-01-30 13: 2 2015-01-28 P3 2 2015-01-28 8 2015-02-07 14: 2 2015-02-28 P4 3 2015-02-28 9 2015-03-10 15: 2 2015-03-20 P5 4 2015-03-20 10 2015-03-30
例如,区间[2014-01-15, 2014-01-25]
包括第2,3和4行。另一方面,第4行(DATE
:2014-01-22)属于三个不同的区间:
[2014-01-15, 2014-01-25]
,[2014-01-20, 2014-01-30]
和[2014-01-22, 2014-02-01]
现在,我们选择每行最早开始日期的间隔:
dt[dt[, .(ID, start = DATE, end = DATE + 10L)],
on = .(ID, DATE >= start, DATE <= end)][
, min(DATE), by = .(ID, rn)]
ID rn V1 1: 1 1 2014-01-01 2: 1 2 2014-01-15 3: 1 3 2014-01-15 4: 1 4 2014-01-15 5: 1 5 2015-03-10 6: 2 6 2015-01-13 7: 2 7 2015-01-13 8: 2 8 2015-01-20 9: 2 9 2015-02-28 10: 2 10 2015-03-20
具有相同V1
的后续行属于同一组。因此,我们可以使用rleid()
函数创建组ID。
tmp <-
dt[dt[, .(ID, start = DATE, end = DATE + 10L)],
on = .(ID, DATE >= start, DATE <= end)][
, min(DATE), by = .(ID, rn)][
, rleid(V1), by = ID]
tmp
ID rn gdt grp 1: 1 1 2014-01-01 1 2: 1 2 2014-01-15 2 3: 1 3 2014-01-15 2 4: 1 4 2014-01-15 2 5: 1 5 2015-03-10 3 6: 2 6 2015-01-13 1 7: 2 7 2015-01-13 1 8: 2 8 2015-01-20 2 9: 2 9 2015-02-28 3 10: 2 10 2015-03-20 4
最后的步骤是
# append group id to original data.table
dt[, grp := tmp$V1][
# set group date
, gdt := first(DATE), by = .(ID, grp)]
答案 1 :(得分:0)
我认为最简单的方法是在矢量化函数中捕获与日期相关的逻辑,请参阅下面的groupDates(x)
。我已经为该函数使用了递归逻辑,可能有更好的方法来实现它。
对于计算更改的字段,您可以使用cumsum
。
df1 <- read.table(header=T,text='ID DATE ITEM
1 1/1/2014 P1
1 1/15/2014 P2
1 1/20/2014 P3
1 1/22/2014 P4
1 3/10/2015 P5
2 1/13/2015 P1
2 1/20/2015 P2
2 1/28/2015 P3
2 2/28/2015 P4
2 3/20/2015 P5')
dt <- as.data.table(df1)
groupDates <- function(x) {
x.prev <- c(head(x, 1), head(x, -1))
x.diff <- abs(c(0, diff(x)))
x.diff.big <- x.diff > 10 | x.diff == 0
x.diff.prev.big <- c(TRUE, head(x.diff.big, -1))
x[!x.diff.big & x.diff.prev.big] <- NA
x <- safe.ifelse(!is.na(x), x, x.prev)
d <- diff(x)
if(min(d[d > 0]) < 10) {
groupDates(x)
}
else {
x
}
}
dt[, date := as.Date(DATE, format = "%m/%d/%Y")]
dt[, group.date := groupDates(date), ID]
dt[, previous.date := shift(group.date, fill = first(date)), ID]
dt[, group.i := 1:.N, ID]
dt[, previous.date.interval := abs(date -previous.date) > 10, ID]
dt[, group := cumsum(previous.date.interval) + 1L, ID]
dt[, .(ID, DATE, group.date, group)]
# ID DATE group.date group
# 1: 1 1/1/2014 2014-01-01 1
# 2: 1 1/15/2014 2014-01-15 2
# 3: 1 1/20/2014 2014-01-15 2
# 4: 1 1/22/2014 2014-01-15 2
# 5: 1 3/10/2015 2015-03-10 3
# 6: 2 1/13/2015 2015-01-13 1
# 7: 2 1/20/2015 2015-01-13 1
# 8: 2 1/28/2015 2015-01-28 2
# 9: 2 2/28/2015 2015-02-28 3
#10: 2 3/20/2015 2015-03-20 4