使用两个人为所有对一起休息的次数创建DF

时间:2018-10-19 13:50:14

标签: sql r

我有一个人们休息时间的数据框。它有他们的员工编号,准时,准时。

library(tidyverse)
breaks %>%
 head()

  EmployeeID         PunchInTime        PunchOutTime
1     105210 2018-10-19 07:57:07 2018-10-19 08:31:52
2     106556 2018-10-19 06:31:03 2018-10-19 07:04:27
3     100412 2018-10-19 06:29:42 2018-10-19 06:46:18
4     101917 2018-10-19 06:25:05 2018-10-19 08:01:03
5     102508 2018-10-19 06:04:02 2018-10-19 06:22:54
6     100859 2018-10-19 06:00:20 2018-10-19 06:35:33

我仍在探索数据,但是我对中断重叠时的各种变化感兴趣。最终目标是研究一起休息的人群。为了到达那里,我想创建一个邻接矩阵(在网络分析上下文中)。 现在,我只是想弄清它们是否完全重叠,但我也认为能够看到一对重叠是否超过十分钟也很有用。

(对我来说)这是棘手的问题之一,我什至不知道如何开始。我尝试了一种R策略来限制收益。我试图将员工ID分散到各列中,列出休息时间间隔(使用lubridate的interval函数。我并没有真正的下一步,而且也没有用。尽管如此,它在技术上确实可以运行。这是代码。

library(lubridate)
> breaks %>% 
+   mutate(
+     BreakInterval = interval(PunchInTime, PunchOutTime)
+   ) %>%
+   select(
+     EmployeeID,
+     BreakInterval
+     ) %>% 
+   group_by(EmployeeID) %>%
+   mutate(BreakNoPerEmployee = row_number()) %>%
+   spread(EmployeeID, BreakInterval) -> mutations
> View(mutations)
Error in validObject(.Object) : 
  invalid class “Interval” object: Inconsistent lengths: spans = 378, start 
dates = 79002

我正在考虑的第二个R策略是做某种for循环,但是我无法考虑通过逻辑来创建配对的重叠计数。似乎用子查询/自联接在SQL中可能更容易完成(因为数据最初还是存储在此)。我既有SQL和R的经验,也可以两者兼有,但是我对R更有经验。

1 个答案:

答案 0 :(得分:1)

可能不是最优雅的解决方案,但这是一种尝试。执行交叉连接,因此,如果您有大量数据,则可能会迅速爆炸:

解决方案

library(tidyverse)

breaks %<>% group_by(EmployeeID) %>% mutate(break_no = row_number())

b1 <- breaks %>% 
  setNames(paste0(names(.), "1"))

b2 <- breaks %>% 
  setNames(paste0(names(.), "2"))

# create a paired comparison for each break
breaks_merge <- merge(b1, b2, by = NULL) %>% 
  # filter depending on your end goal, might be a good sanity check
  filter(EmployeeID1 != EmployeeID2) %>% 
  mutate(int_b1 = interval(PunchInTime1, PunchOutTime1),
         int_b2 = interval(PunchInTime2, PunchOutTime2),
         breaks_overlap = int_overlaps(int_b1, int_b2))

# adjacency matrix a little awkward because of multiple employees with multiple breaks
breaks_adj <- breaks_merge %>%
  select(-matches("^[Punch|int]")) %>%
  unite("Emp1_break", EmployeeID1, break_no1, sep = "_") %>%
  unite("Emp2_break", EmployeeID2, break_no2, sep = "_") %>%
  spread(Emp2_break, breaks_overlap)

> breaks_adj
  Emp1_break 100412_1 100859_1 101917_1 102508_1 105210_1 106556_1
1   100412_1       NA     TRUE     TRUE    FALSE    FALSE     TRUE
2   100859_1     TRUE       NA     TRUE     TRUE    FALSE     TRUE
3   101917_1     TRUE     TRUE       NA    FALSE     TRUE     TRUE
4   102508_1    FALSE     TRUE    FALSE       NA    FALSE    FALSE
5   105210_1    FALSE    FALSE     TRUE    FALSE       NA    FALSE
6   106556_1     TRUE     TRUE     TRUE    FALSE    FALSE       NA

数据

breaks <- structure(list(EmployeeID = c(105210L, 106556L, 100412L, 101917L, 
102508L, 100859L), PunchInTime = structure(c(1539961027, 1539955863, 
1539955782, 1539955505, 1539954242, 1539954020), class = c("POSIXct", 
"POSIXt"), tzone = ""), PunchOutTime = structure(c(1539963112, 
1539957867, 1539956778, 1539961263, 1539955374, 1539956133), class = c("POSIXct", 
"POSIXt"), tzone = "")), .Names = c("EmployeeID", "PunchInTime", 
"PunchOutTime"), row.names = c(NA, -6L), class = "data.frame")

注意

此外,如果您以dplyrlubridate的间隔玩耍,则有一些unexpected errors you may encounter