我有像这样的数据集
id data time moreData
<int> <int> <dttm> <dbl>
1 1 4 2017-05-12 18:34:20 4450
2 2 4 2017-05-12 18:37:07 2800
3 3 4 2017-05-12 18:37:10 1900
4 4 4 2017-05-12 18:37:59 1950
5 5 4 2017-05-12 18:38:40 2500
包含时间戳。你可以说这些数据是对网站的请求&#34;我想近似&#34;会话&#34;。
换句话说,如果行 i 和 i之间的时差,我希望将行 1,2,...,n 分组。 + 1 少于1分钟。
因此,数据将分组在{1}和{2,3,4,5}。
请注意,这不是一个重复的问题,询问有关在预定时间间隔内分组的其他问题 - 我不在乎第一个和最后一个元素之间的时差有多大,我只关心相邻行的差异。
我怎样才能做到这一点?
示例数据:
structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860,
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198,
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343,
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493,
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id",
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
答案 0 :(得分:2)
您可以使用difftime
base
中的R
功能。
代码:
# Wanted time difference in minutes
wantedDiff <- 1
timeDiff <- abs(difftime(df$time[-nrow(df)],
df$time[-1],
units = "mins"))
df$group <- cumsum(c(0, as.numeric(timeDiff >= wantedDiff)))
结果:
id user time group 1 1 4 2017-05-12 19:34:20 0 2 2 4 2017-05-12 19:37:07 1 3 3 4 2017-05-12 19:37:10 1 4 4 4 2017-05-12 19:39:32 2 5 5 4 2017-05-12 19:39:33 2 6 6 4 2017-05-12 19:39:57 2 7 7 4 2017-05-12 19:39:58 2 8 8 4 2017-05-12 19:40:00 2 9 9 4 2017-05-12 19:41:49 3 10 10 4 2017-05-12 19:41:52 3 11 11 4 2017-05-12 19:42:19 3 12 12 4 2017-05-12 19:42:20 3 13 13 4 2017-05-12 19:42:23 3 14 14 4 2017-05-12 19:42:23 3 15 15 4 2017-05-12 19:43:24 4 16 16 4 2017-05-12 19:43:25 4 17 17 4 2017-05-12 19:43:27 4 18 18 4 2017-05-12 19:44:52 5 19 19 4 2017-05-12 19:44:53 5 20 20 4 2017-05-12 19:44:55 5
说明:
difftime
计算当前行与上一行之间的绝对时间差
units
timeDiff
)如下所示:Time differences in mins [1] 2.78333333 0.05000000 2.36666667 0.01666667 0.40000000 0.01666667 0.03333333 1.81666667 0.05000000 0.45000000 [11] 0.01666667 0.05000000 0.00000000 1.01666667 0.01666667 0.03333333 1.41666667 0.01666667 0.03333333
wantedDiff
并将此逻辑输出转换为数字cumsum
数字输出(添加+1
,即切换到新组)数据:
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860,
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198,
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343,
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493,
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id",
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
答案 1 :(得分:1)
以下是使用扩展的示例数据集的解决方案。这种方法的关键部分是使用lubridate::ymd_hms
将字符串转换为可以进行算术运算的时间,然后使用lag
来确定时间是否在前一行的一分钟内。然后,您可以使用for
循环通过在每次到达不在前一行的一分钟内的行时递增组编号来创建组。当然可以加强一点,并且很想知道是否有人可以在不诉诸for
循环和bind_cols
的情况下做到这一点!
library(tidyverse)
tbl <- tibble(
id = 1:8,
time = c("2017-05-12 18:34:20",
"2017-05-12 18:37:07",
"2017-05-12 18:37:10",
"2017-05-12 18:37:59",
"2017-05-12 18:38:40",
"2017-05-12 18:40:40",
"2017-05-12 18:40:49",
"2017-05-12 18:43:40"
)
)
tbl2 <- tbl %>%
mutate(time = ymd_hms(time)) %>%
mutate(separation = time - lag(time, default = 0)) %>%
mutate(onemin = separation <= 60)
group_ids = 1
for (i in 2:nrow(tbl2)){
if (tbl2$onemin[i] == FALSE){
group_ids[i] <- group_ids[i - 1] +1
} else
group_ids[i] <- group_ids[i - 1]
}
tbl2 %>%
bind_cols(., group = group_ids) %>%
select(id, time, group)
# A tibble: 8 x 3
id time group
<int> <dttm> <dbl>
1 1 2017-05-12 18:34:20 1.00
2 2 2017-05-12 18:37:07 2.00
3 3 2017-05-12 18:37:10 2.00
4 4 2017-05-12 18:37:59 2.00
5 5 2017-05-12 18:38:40 2.00
6 6 2017-05-12 18:40:40 3.00
7 7 2017-05-12 18:40:49 3.00
8 8 2017-05-12 18:43:40 4.00
答案 2 :(得分:1)
一种可能的解决方案是使用lag
包中的dplyr
函数和cumsum
形式base r
。
方法是:
difftime
超过60,则该行位于新组(newgroup
)cumsum
上执行newgroup
以获取每行的组号。代码是:
#data
library(dplyr)
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
time = structure(c(1494606860,1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198,
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343,
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493,
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id",
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
df %>% mutate(difftime = ifelse(is.na(as.numeric(time - lag(time))),0,as.numeric(time - lag(time)))) %>%
mutate(newroup = ifelse(difftime > 60, 1, 0)) %>%
mutate(group = factor(cumsum(newroup))) %>%
select(id, user,time, group)
#Result
id user time group
<int> <int> <dttm> <fctr>
1 1 4 2017-05-12 17:34:20 0
2 2 4 2017-05-12 17:37:07 1
3 3 4 2017-05-12 17:37:10 1
4 4 4 2017-05-12 17:39:32 2
5 5 4 2017-05-12 17:39:33 2
6 6 4 2017-05-12 17:39:57 2
7 7 4 2017-05-12 17:39:58 2
8 8 4 2017-05-12 17:40:00 2
9 9 4 2017-05-12 17:41:49 3
10 10 4 2017-05-12 17:41:52 3
11 11 4 2017-05-12 17:42:19 3
12 12 4 2017-05-12 17:42:20 3
13 13 4 2017-05-12 17:42:23 3
14 14 4 2017-05-12 17:42:23 3
15 15 4 2017-05-12 17:43:24 4
16 16 4 2017-05-12 17:43:25 4
17 17 4 2017-05-12 17:43:27 4
18 18 4 2017-05-12 17:44:52 5
19 19 4 2017-05-12 17:44:53 5
20 20 4 2017-05-12 17:44:55 5