定义时间间隔重叠的指标

时间:2019-09-06 22:38:30

标签: r dataframe

我有

 household       person     start time   end time
     1           1          07:45:00    21:45:00
     1           2          09:45:00    17:45:00
     1           3          22:45:00    23:45:00
     1           4          08:45:00    01:45:00
     1           1          23:50:00    24:00:00
     2           1          07:45:00    21:45:00
     2           2          016:45:00   22:45:00

我想找到一列以查找家庭成员之间的重叠时间。

指标是:如果某人的开始和结束时间与其他成员的交集为1,否则为0

在上面的示例中,第一家庭的第一,第二和第四个人的时间相交,因此指示器为1,而第三和第五行与该家庭中的其他人没有相交。

输出:

 household       person     start time   end time      overlap
      1           1          07:45:00    21:45:00           1
      1           2          09:45:00    17:45:00           1
      1           3          22:45:00    23:45:00           0
      1           4          08:45:00    01:45:00           1
      1           1          23:50:00    24:00:00           0     
      2           1          07:45:00    21:45:00           1
      2           2          016:45:00   22:45:00           1

dput格式的数据:

         structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L), PERNO = c(1, 
         1, 1, 1, 1, 1), arr = structure(c(30300, 35280, 37200, 32400, 
         34200, 39600), class = c("hms", "difftime"), units = "secs"), 
         dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400
), class = c("hms", "difftime"), units = "secs")), class =  c("grouped_df", 
        "tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
          SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA, 
       -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

2 个答案:

答案 0 :(得分:0)

Tidyverse解决方案

这是tidyverse语法的解决方案。基本思想是相同的。我们对家庭(在您当前的示例数据中为sampn)进行多对多合并匹配,并删除将人与他们自己进行比较的情况(perno)。我们检查是否有重叠,然后每个家庭和每个人崩溃到单个记录。请注意,如果所有记录具有相同的perno,则此代码将出错。

compare <- 
  df %>% 
  left_join(df %>% 
              rename(compare_PERNO = PERNO, 
                     compare_arr = arr,
                     compare_dep = dep), by = ("SAMPN")) %>% 
  filter(PERNO != compare_PERNO) %>% 
  mutate(overlap = arr <= compare_dep & dep >= compare_arr) %>% 
  group_by(SAMPN, PERNO) %>% 
  summarize(overlap = max(overlap))

带有家庭分组的SQL解决方案

按家庭分组数据实际上使此问题稍微容易一些。同样,我正在使用SQL来解决它。在内部SQL语句中,我进行了许多匹配,将一个家庭的所有成员与所有其他成员匹配,删除了将一个人与其自身匹配的所有情况。然后,在外部SQL语句中,我们将每个家庭和每个人的记录减少到一条,这表明它们是否重叠。


df = data.frame(
  household = c(rep(1,5), rep(2,2)),
  person = c(1:5, 1:2),
  start_time=as.Date(c("2017-05-31","2018-01-14", "2019-02-03", "2018-01-19", "2019-04-17",
                       "2018-02-03", "2018-03-03"), 
                     format="%Y-%m-%d"),
  end_time=as.Date(c("2018-01-17", "2018-01-20", "2019-04-15", "2018-02-20", "2019-05-17", 
                     "2019-03-03", "2019-03-03"), 
                   format="%Y-%m-%d"))

library(sqldf)
compare <- sqldf(
  "
  SELECT * FROM (
    SELECT L.* , 
      CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
      ELSE 0 END AS overlap
    FROM df as L
    LEFT JOIN df as R ON L.household = R.household
    WHERE L.person != R.person 
  ) 
  GROUP BY household, person
  HAVING overlap = MAX(overlap)
  "
)

无家庭分组的SQL解决方案

这是您的问题的SQL解决方案。我执行无密钥多对多合并以将每一行与另一行进行比较(但不将其与自身进行比较),然后我将大数据帧解析为每个ID的单个记录,该记录记录是否找到了匹配项。您的数据不是一个很好的代表(使用R中的dput函数),因此我使用了一个示例数据集。如果您无法根据自己的确切数据进行调整,请发布可复制的数据,我会为您提供帮助。

df = data.frame(
  id = 1:3,
  start_time=as.Date(c("2017-05-31","2018-01-14", "2018-02-03"), format="%Y-%m-%d"),
  end_time=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"))


library(sqldf)
compare <- sqldf(
  "
  SELECT * FROM (
    SELECT L.* , 
      CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
      ELSE 0 END AS overlap
    FROM df as L
    CROSS JOIN df as R 
    WHERE L.id != R.id 
  ) 
  GROUP BY ID
  HAVING overlap = MAX(overlap)
  "
)


答案 1 :(得分:0)

我尝试了一种tidyverse解决方案:

library(tidyverse)
df = structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L), 
     PERNO = c(1:3, 1:3), arr = structure(c(30300, 35280, 37200, 32400, 
              34200, 39600), class = c("hms", "difftime"), units = "secs"), 
     dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400), class = c("hms", "difftime"), units = "secs")), class =  c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

然后,我添加了:

df %>% group_by(SAMPN) %>% 
   mutate(
     arr_min = mapply(function(x) min(arr[-x]), 1:n()),  
     dep_max = mapply(function(x) max(dep[-x]), 1:n()),
     overlap = ifelse(arr<arr_min | dep>dep_max, 0, 1)
   ) 

您将获得:

  SAMPN PERNO arr    dep    arr_min dep_max overlap
  <int> <int> <time> <time>   <dbl>   <dbl>   <dbl>
1     1     1 08:25  09:30    35280   61800       0
2     1     2 09:48  10:05    30300   61800       1
3     1     3 10:20  17:10    30300   36300       0
4     2     1 09:00  09:20    34200   50400       0
5     2     2 09:30  10:30    32400   50400       1
6     2     3 11:00  14:00    32400   37800       0

您基本上将当前的arrdeparr_min(不包括当前情况的min(arr)值)和dep_max(不包括{{1}当前情况)。