我想找到每个ID的重叠日期,并创建一个具有重叠日期的新行,并合并行的字符(char)。我的数据可能具有> 2重叠并且需要> 2个字符组合。例如。 ERM
数据:
execute block
returns (
ORDERNUMBER varchar(80),
LINECODE integer,
ORDERQTY integer,
ORDERNUMBER1 varchar(80),
COMPONENTCODE integer,
ISSUEQTY integer)
as
declare variable tmp_str varchar(20);
BEGIN
tmp_str='0'; /* Set temp variable to unavalable value */
FOR
select
assemblylines.ordernumber,
assemblylines.linecode,
assemblylines.orderqty,
assemblylineissues.ordernumber,
assemblylineissues.componentcode,
assemblylineissues.issueqty
from assemblylineissues
right outer join assemblylines on (assemblylineissues.headersysuniqueid = assemblylines.sysuniqueid)
INTO :ORDERNUMBER,
:LINECODE,
:ORDERQTY,
:ORDERNUMBER1,
:COMPONENTCODE,
:ISSUEQTY
DO
BEGIN
if (:tmp_str=:ordernumber) then
orderqty = null; /* Set ORDER QTY to null if this is not the first selected record of ASSEMBLYLINES.ORDERNUMBER */
tmp_str = :ordernumber;
SUSPEND;
END
END
输出我想:
ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R
我尝试过: 我尝试过使用以下行从当前行中减去日期2:
ID date1 date2 char
15 2003-04-05 2003-04-20 E
15 2003-04-20 2003-05-06 ER
15 2003-05-06 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2005-04-15 I
17 2005-04-15 2007-02-22 IC
17 2005-04-15 2007-05-15 C
17 2007-05-15 2008-02-05 CI
17 2008-02-05 2012-02-14 CM
17 2010-06-07 2011-02-14 CV
17 2010-09-22 2014-05-19 CP
17 2012-02-28 2013-03-04 CR
17 2014-05-19 2014-05-19 P
17 2010-06-07 2012-02-14 MV
17 2010-09-22 2011-02-14 VP
17 2012-02-28 2013-03-04 RP
然后确定行之间的重叠:
df$diff <- c(NA,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"])
然后我选择了那些有overlap.final == 1并将它们放入另一个数据帧并找到每个ID的重叠。
然而,我已经意识到这太简单和有缺陷,因为它只选择顺序发生的重叠(使用第一步中的日期差异)。我需要做的是为每个ID取一系列日期并循环遍历每个组合以确定是否存在重叠然后,如果是,则记录开始和结束日期并创建一个新字符“char”,表示什么是在这两个日期合并。我想我需要一个循环才能做到这一点。
我尝试创建一个循环来查找date1和date 2之间的重叠间隔
df$overlap[which(df$diff<1)] <-1
df$overlap.up <- c(df$overlap[2:(nrow(df))], "NA")
df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1
然而,这不起作用。
在确定重叠间隔后,我需要为每个新的开始和结束日期创建新行,并为表示重叠的新字符创建新行。
我尝试识别重叠的另一个循环版本:
df <- df[which(!duplicated(df$ ID)),]
for (i in 1:nrow(df)) {
tmp <- length(which(df $ID[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2& df$date2[i]> df$date1))) >0
df$int[i]<- tmp
}
答案 0 :(得分:9)
首先,我们为每个data.table
创建所有可能间隔的ID
。
所有可能的间隔表示我们获取ID
的所有开始日期和结束日期,并将它们组合在已排序的向量tmp
中。唯一值表示时间轴上ID
的所有给定间隔的所有可能交叉点(或中断)。对于以后的加入,中断会在每行一个时间间隔内重新排列start
和end
列:
library(data.table)
options(datatable.print.class = TRUE)
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = ID]
breaks
ID start end <int> <IDat> <IDat> 1: 15 2003-04-05 2003-04-20 2: 15 2003-04-20 2003-05-06 3: 15 2003-05-06 2003-06-20 4: 16 2001-01-02 2002-03-04 5: 17 2003-03-05 2005-04-15 6: 17 2005-04-15 2007-02-22 7: 17 2007-02-22 2007-05-15 8: 17 2007-05-15 2008-02-05 9: 17 2008-02-05 2010-06-07 10: 17 2010-06-07 2010-09-22 11: 17 2010-09-22 2011-02-14 12: 17 2011-02-14 2012-02-14 13: 17 2012-02-14 2012-02-28 14: 17 2012-02-28 2013-03-04 15: 17 2013-03-04 2014-05-19
然后,执行非equi join ,从而在连接条件下同时聚合值(by = .EACHI
被每个i 分组称为,请参阅this answer有更详细的解释):
DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),
by = .EACHI, allow.cartesian = TRUE]
ID date1 date2 V1 <int> <IDat> <IDat> <char> 1: 15 2003-04-05 2003-04-20 E 2: 15 2003-04-20 2003-05-06 ER 3: 15 2003-05-06 2003-06-20 R 4: 16 2001-01-02 2002-03-04 M 5: 17 2003-03-05 2005-04-15 I 6: 17 2005-04-15 2007-02-22 IC 7: 17 2007-02-22 2007-05-15 C 8: 17 2007-05-15 2008-02-05 CI 9: 17 2008-02-05 2010-06-07 CM 10: 17 2010-06-07 2010-09-22 CMV 11: 17 2010-09-22 2011-02-14 CMVP 12: 17 2011-02-14 2012-02-14 CMP 13: 17 2012-02-14 2012-02-28 CP 14: 17 2012-02-28 2013-03-04 CPR 15: 17 2013-03-04 2014-05-19 CP
结果与OP发布的预期结果不同,但绘制数据表明上述结果显示所有可能的重叠:
library(ggplot2)
ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) +
geom_segment() + facet_wrap("ID", ncol = 1L)
library(data.table)
DT <- fread(
"ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R"
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
答案 1 :(得分:1)
您添加到问题中的for
- 循环以及包含的比较是一个良好的开端。在日期比较中应该是一些额外的括号(
和)
。此for
- 循环方法自动考虑数据框中的新行。因此,您可以在char
列中获得三个,四个和更多字符的字符串。
df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
stringsAsFactors=FALSE)
迭代所有行(原始data.frame中存在的行)并将它们与所有当前行进行比较。
nrow_init = nrow(df)
for (i in 1:(nrow(df)-1)) {
print(i)
## get rows of df that have overlapping dates
## (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
## (!grepl(df$char[i],df$char)) :: prevent double letters
## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
## Uncomment this line, if you want to compare only two rows each and not more
# loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))
## proceed only of at least one duplicate row was found
if (sum(loc) > 0) {
# build new rows
# pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
df_append = as.data.frame(list('ID'=df$ID[loc],
'date1'=pmax(df$date1[i],df$date1[loc]),
'date2'=pmin(df$date2[i],df$date2[loc]),
'char'=paste0(df$char[i],df$char[loc])))
## append new rows
df = rbind(df, df_append)
}
}
## create a new column and sort the characters in it
## idea for sort: https://stackoverflow.com/a/5904854/4612235
df$sort_char = df$char
for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
## remove duplicates
df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
## remove additional column
df$sort_char = NULL
ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R
15 2003-04-20 2003-05-06 ER
17 2005-04-15 2007-02-22 IC
17 2007-05-15 2008-02-05 CI
17 2008-02-05 2012-02-14 CM
17 2010-06-07 2011-02-14 CV
17 2010-09-22 2014-05-19 CP
17 2012-02-28 2013-03-04 CR
17 2010-06-07 2011-02-14 MV
17 2010-09-22 2012-02-14 MP
17 2010-06-07 2011-02-14 MCV
17 2010-09-22 2012-02-14 MCP
17 2010-09-22 2011-02-14 VP
17 2010-09-22 2011-02-14 VCP
17 2010-09-22 2011-02-14 VMP
17 2010-09-22 2011-02-14 VMCP
17 2012-02-28 2013-03-04 PR
17 2012-02-28 2013-03-04 PCR