我有
household person start time end time
1 1 07:45:00 21:45:00
1 2 09:45:00 17:45:00
1 3 22:45:00 23:45:00
1 4 08:45:00 01:45:00
1 1 23:50:00 24:00:00
2 1 07:45:00 21:45:00
2 2 016:45:00 22:45:00
我想找到一列以查找家庭成员之间的重叠时间。
指标是:如果某人的开始和结束时间与其他成员的交集为1,否则为0
在上面的示例中,第一家庭的第一,第二和第四个人的时间相交,因此指示器为1,而第三和第五行与该家庭中的其他人没有相交。
输出:
household person start time end time overlap
1 1 07:45:00 21:45:00 1
1 2 09:45:00 17:45:00 1
1 3 22:45:00 23:45:00 0
1 4 08:45:00 01:45:00 1
1 1 23:50:00 24:00:00 0
2 1 07:45:00 21:45:00 1
2 2 016:45:00 22:45:00 1
dput格式的数据:
structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L), PERNO = c(1,
1, 1, 1, 1, 1), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400
), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
答案 0 :(得分:0)
Tidyverse解决方案
这是tidyverse语法的解决方案。基本思想是相同的。我们对家庭(在您当前的示例数据中为sampn
)进行多对多合并匹配,并删除将人与他们自己进行比较的情况(perno
)。我们检查是否有重叠,然后每个家庭和每个人崩溃到单个记录。请注意,如果所有记录具有相同的perno,则此代码将出错。
compare <-
df %>%
left_join(df %>%
rename(compare_PERNO = PERNO,
compare_arr = arr,
compare_dep = dep), by = ("SAMPN")) %>%
filter(PERNO != compare_PERNO) %>%
mutate(overlap = arr <= compare_dep & dep >= compare_arr) %>%
group_by(SAMPN, PERNO) %>%
summarize(overlap = max(overlap))
带有家庭分组的SQL解决方案
按家庭分组数据实际上使此问题稍微容易一些。同样,我正在使用SQL来解决它。在内部SQL语句中,我进行了许多匹配,将一个家庭的所有成员与所有其他成员匹配,删除了将一个人与其自身匹配的所有情况。然后,在外部SQL语句中,我们将每个家庭和每个人的记录减少到一条,这表明它们是否重叠。
df = data.frame(
household = c(rep(1,5), rep(2,2)),
person = c(1:5, 1:2),
start_time=as.Date(c("2017-05-31","2018-01-14", "2019-02-03", "2018-01-19", "2019-04-17",
"2018-02-03", "2018-03-03"),
format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2019-04-15", "2018-02-20", "2019-05-17",
"2019-03-03", "2019-03-03"),
format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
LEFT JOIN df as R ON L.household = R.household
WHERE L.person != R.person
)
GROUP BY household, person
HAVING overlap = MAX(overlap)
"
)
无家庭分组的SQL解决方案
这是您的问题的SQL解决方案。我执行无密钥多对多合并以将每一行与另一行进行比较(但不将其与自身进行比较),然后我将大数据帧解析为每个ID的单个记录,该记录记录是否找到了匹配项。您的数据不是一个很好的代表(使用R中的dput
函数),因此我使用了一个示例数据集。如果您无法根据自己的确切数据进行调整,请发布可复制的数据,我会为您提供帮助。
df = data.frame(
id = 1:3,
start_time=as.Date(c("2017-05-31","2018-01-14", "2018-02-03"), format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
CROSS JOIN df as R
WHERE L.id != R.id
)
GROUP BY ID
HAVING overlap = MAX(overlap)
"
)
答案 1 :(得分:0)
我尝试了一种tidyverse
解决方案:
library(tidyverse)
df = structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L),
PERNO = c(1:3, 1:3), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
然后,我添加了:
df %>% group_by(SAMPN) %>%
mutate(
arr_min = mapply(function(x) min(arr[-x]), 1:n()),
dep_max = mapply(function(x) max(dep[-x]), 1:n()),
overlap = ifelse(arr<arr_min | dep>dep_max, 0, 1)
)
您将获得:
SAMPN PERNO arr dep arr_min dep_max overlap
<int> <int> <time> <time> <dbl> <dbl> <dbl>
1 1 1 08:25 09:30 35280 61800 0
2 1 2 09:48 10:05 30300 61800 1
3 1 3 10:20 17:10 30300 36300 0
4 2 1 09:00 09:20 34200 50400 0
5 2 2 09:30 10:30 32400 50400 1
6 2 3 11:00 14:00 32400 37800 0
您基本上将当前的arr
和dep
与arr_min
(不包括当前情况的min(arr)
值)和dep_max
(不包括{{1}当前情况)。