我有一个人们休息时间的数据框。它有他们的员工编号,准时,准时。
library(tidyverse)
breaks %>%
head()
EmployeeID PunchInTime PunchOutTime
1 105210 2018-10-19 07:57:07 2018-10-19 08:31:52
2 106556 2018-10-19 06:31:03 2018-10-19 07:04:27
3 100412 2018-10-19 06:29:42 2018-10-19 06:46:18
4 101917 2018-10-19 06:25:05 2018-10-19 08:01:03
5 102508 2018-10-19 06:04:02 2018-10-19 06:22:54
6 100859 2018-10-19 06:00:20 2018-10-19 06:35:33
我仍在探索数据,但是我对中断重叠时的各种变化感兴趣。最终目标是研究一起休息的人群。为了到达那里,我想创建一个邻接矩阵(在网络分析上下文中)。 现在,我只是想弄清它们是否完全重叠,但我也认为能够看到一对重叠是否超过十分钟也很有用。 >
(对我来说)这是棘手的问题之一,我什至不知道如何开始。我尝试了一种R策略来限制收益。我试图将员工ID分散到各列中,列出休息时间间隔(使用lubridate的interval
函数。我并没有真正的下一步,而且也没有用。尽管如此,它在技术上确实可以运行。这是代码。
library(lubridate)
> breaks %>%
+ mutate(
+ BreakInterval = interval(PunchInTime, PunchOutTime)
+ ) %>%
+ select(
+ EmployeeID,
+ BreakInterval
+ ) %>%
+ group_by(EmployeeID) %>%
+ mutate(BreakNoPerEmployee = row_number()) %>%
+ spread(EmployeeID, BreakInterval) -> mutations
> View(mutations)
Error in validObject(.Object) :
invalid class “Interval” object: Inconsistent lengths: spans = 378, start
dates = 79002
我正在考虑的第二个R策略是做某种for循环,但是我无法考虑通过逻辑来创建配对的重叠计数。似乎用子查询/自联接在SQL中可能更容易完成(因为数据最初还是存储在此)。我既有SQL和R的经验,也可以两者兼有,但是我对R更有经验。
答案 0 :(得分:1)
可能不是最优雅的解决方案,但这是一种尝试。执行交叉连接,因此,如果您有大量数据,则可能会迅速爆炸:
library(tidyverse)
breaks %<>% group_by(EmployeeID) %>% mutate(break_no = row_number())
b1 <- breaks %>%
setNames(paste0(names(.), "1"))
b2 <- breaks %>%
setNames(paste0(names(.), "2"))
# create a paired comparison for each break
breaks_merge <- merge(b1, b2, by = NULL) %>%
# filter depending on your end goal, might be a good sanity check
filter(EmployeeID1 != EmployeeID2) %>%
mutate(int_b1 = interval(PunchInTime1, PunchOutTime1),
int_b2 = interval(PunchInTime2, PunchOutTime2),
breaks_overlap = int_overlaps(int_b1, int_b2))
# adjacency matrix a little awkward because of multiple employees with multiple breaks
breaks_adj <- breaks_merge %>%
select(-matches("^[Punch|int]")) %>%
unite("Emp1_break", EmployeeID1, break_no1, sep = "_") %>%
unite("Emp2_break", EmployeeID2, break_no2, sep = "_") %>%
spread(Emp2_break, breaks_overlap)
> breaks_adj
Emp1_break 100412_1 100859_1 101917_1 102508_1 105210_1 106556_1
1 100412_1 NA TRUE TRUE FALSE FALSE TRUE
2 100859_1 TRUE NA TRUE TRUE FALSE TRUE
3 101917_1 TRUE TRUE NA FALSE TRUE TRUE
4 102508_1 FALSE TRUE FALSE NA FALSE FALSE
5 105210_1 FALSE FALSE TRUE FALSE NA FALSE
6 106556_1 TRUE TRUE TRUE FALSE FALSE NA
breaks <- structure(list(EmployeeID = c(105210L, 106556L, 100412L, 101917L,
102508L, 100859L), PunchInTime = structure(c(1539961027, 1539955863,
1539955782, 1539955505, 1539954242, 1539954020), class = c("POSIXct",
"POSIXt"), tzone = ""), PunchOutTime = structure(c(1539963112,
1539957867, 1539956778, 1539961263, 1539955374, 1539956133), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("EmployeeID", "PunchInTime",
"PunchOutTime"), row.names = c(NA, -6L), class = "data.frame")
此外,如果您以dplyr
和lubridate
的间隔玩耍,则有一些unexpected errors you may encounter。