我有两个数据集(df1和df2),两者都是由时间格式的值组成的。我想做出“客观的”。在用c(“id1”,“id2”)合并两个数据时,我想在非重叠时间内留下“NA”。
DF1
id1 id2 click_timing
1 11 2015-02-03 01:00:00
1 11 2015-02-03 02:00:00
1 12 2015-02-03 03:00:00
1 12 2015-02-03 04:00:00
1 13 2015-02-03 05:10:00
2 34 2015-02-03 03:00:00
2 34 2015-02-03 04:00:00
2 36 2015-02-03 01:00:00
...
DF2
id1 id2 start end
1 11 2015-02-03 00:20:00 2015-02-03 00:40:00
1 11 2015-02-03 00:50:00 2015-02-03 01:20:00
1 13 2015-02-03 01:10:00 2015-02-03 01:40:00
1 13 2015-02-03 04:50:00 2015-02-03 05:30:00
2 34 2015-02-03 03:50:00 2015-02-03 04:10:00
...
客观输出
id1 id2 click_timing start end
1 11 NA 2015-02-03 00:20:00 2015-02-03 00:40:00
1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00
1 11 2015-02-03 02:00:00 NA NA
1 12 2015-02-03 03:00:00 NA NA
1 12 2015-02-03 04:00:00 NA NA
1 13 NA 2015-02-03 01:10:00 2015-02-03 01:40:00
1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00
2 34 2015-02-03 03:00:00 NA NA
2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00
2 36 2015-02-03 01:00:00 NA NA
...
答案 0 :(得分:1)
艰难的问题!我认为您必须通过手动循环遍历所有{{1}来计算每个click_timing
值与每个时间段(start
和end
)之间的交集。 },然后使用结果索引匹配作为附加的连接字段:
click_timing
如果存在单个df1 <- data.frame(id1=c(1,1,1,1,1,2,2,2), id2=c(11,11,12,12,13,34,34,36), click_timing=as.POSIXct(c('2015-02-03 01:00:00','2015-02-03 02:00:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 05:10:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 01:00:00')) );
df2 <- data.frame(id1=c(1,1,1,1,2), id2=c(11,11,13,13,34), start=as.POSIXct(c('2015-02-03 00:20:00','2015-02-03 00:50:00','2015-02-03 01:10:00','2015-02-03 04:50:00','2015-02-03 03:50:00')), end=as.POSIXct(c('2015-02-03 00:40:00','2015-02-03 01:20:00','2015-02-03 01:40:00','2015-02-03 05:30:00','2015-02-03 04:10:00')) );
m <- sapply(1:nrow(df1), function(i) which(df1$id1[i]==df2$id1 & df1$id2[i] == df2$id2 & df1$click_timing[i]>=df2$start & df1$click_timing[i]<=df2$end)[1] );
merge(cbind(df1,m=m),cbind(df2,m=1:nrow(df2)),by=c('id1','id2','m'),all=T)[-3];
## id1 id2 click_timing start end
## 1 1 11 <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00
## 2 1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00
## 3 1 11 2015-02-03 02:00:00 <NA> <NA>
## 4 1 12 2015-02-03 04:00:00 <NA> <NA>
## 5 1 12 2015-02-03 03:00:00 <NA> <NA>
## 6 1 13 <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00
## 7 1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00
## 8 2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00
## 9 2 34 2015-02-03 03:00:00 <NA> <NA>
## 10 2 36 2015-02-03 01:00:00 <NA> <NA>
值与多个click_timing
和start
对相交的情况,则此解决方案将选择较早出现的值(即具有较低值) end
)中的行索引比其他匹配。
答案 1 :(得分:1)
重新创建初始数据框并做一些小的准备工作:
library(data.table)
library(lubridate)
df1<- fread("id1,id2,click_timing
1,11,2015-02-03 01:00:00
1,11,2015-02-03 02:00:00
1,12,2015-02-03 03:00:00
1,12,2015-02-03 04:00:00
1,13,2015-02-03 05:10:00
2,34,2015-02-03 03:00:00
2,34,2015-02-03 04:00:00
2,36,2015-02-03 01:00:00")
# adding a redundant click_timing2 column to use as the end range for further foverlaps() function
df1[, click_timing2:= click_timing]
df1[,c("click_timing", "click_timing2"):= list(parse_date_time(click_timing, "%Y-%m-%d %T"), parse_date_time(click_timing2, "%Y-%m-%d %T"))]
df2<- fread("id1,id2,start,end
1,11,2015-02-03 00:20:00,2015-02-03 00:40:00
1,11,2015-02-03 00:50:00,2015-02-03 01:20:00
1,13,2015-02-03 01:10:00,2015-02-03 01:40:00
1,13,2015-02-03 04:50:00,2015-02-03 05:30:00
2,34,2015-02-03 03:50:00,2015-02-03 04:10:00")
df2[,c("start","end") := list(parse_date_time(start, "%Y-%m-%d %T"), parse_date_time(end, "%Y-%m-%d %T"))]
setkey(df2, id1, id2, start, end)
解决方案:
df3<- foverlaps(df1, df2, by.x=c("id1", "id2", "click_timing", "click_timing2"),
by.y = c("id1", "id2", "start", "end"), type="within")
objective_output<- merge(df3, df2, by = c("id1", "id2", "start", "end"), all = T)
# deleting redundant click_timing2 column
objective_output[,click_timing2:= NULL]
# reordering columns
setcolorder(objective_output, c(1,2,5,3,4))
#setting key using all columns and thus reordering all rows
setkey(objective_output)
objective_output
#id1 id2 click_timing start end
# 1: 1 11 2015-02-03 02:00:00 <NA> <NA>
# 2: 1 11 <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00
# 3: 1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00
# 4: 1 12 2015-02-03 03:00:00 <NA> <NA>
# 5: 1 12 2015-02-03 04:00:00 <NA> <NA>
# 6: 1 13 <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00
# 7: 1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00
# 8: 2 34 2015-02-03 03:00:00 <NA> <NA>
# 9: 2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00
#10: 2 36 2015-02-03 01:00:00 <NA> <NA>