我的目标是找到用户第一次使用服务和第二次用户使用服务之间经过的时间。应该排除仅使用过该服务一次的用户。
例如,我有一个示例数据集test
:
> test
start_time user_id
1 2018-01-17 22:10:21 1
2 2018-01-17 22:10:08 2
3 2018-01-18 07:02:36 3
4 2018-01-18 07:24:18 4
5 2018-01-18 15:08:45 2
6 2018-01-18 15:26:57 1
7 2018-01-18 15:37:47 1
8 2018-01-18 20:12:43 3
9 2018-01-18 20:01:08 2
10 2018-01-18 22:42:02 2
我能够逐一使用difftime
:
output$time_lapse[1] <- abs(difftime(test$start_time[1], test$start_time[6]))
但是这需要很长时间才能完成大型数据集。如何使用data.table
或dplyr
进行迭代?
从上面的测试数据集中输出的小时数如下:
> output
user_id time_lapse
1 1 17.27667
2 2 16.97694
3 3 13.16861
4 4 NA
任何建议将不胜感激!以下是示例数据:
> dput(test)
structure(list(start_time = structure(c(1516255821, 1516255808,
1516287756, 1516289058, 1516316925, 1516318017, 1516318667, 1516335163,
1516334468, 1516344122), class = c("POSIXct", "POSIXt"), tzone = ""),
user_id = c(1, 2, 3, 4, 2, 1, 1, 3, 2, 2)), .Names = c("start_time",
"user_id"), row.names = c(NA, 10L), class = "data.frame")
答案 0 :(得分:1)
以下是data.table
方法
library(data.table)
test <- structure(list(start_time = structure(c(1516255821, 1516255808,
1516287756, 1516289058, 1516316925, 1516318017, 1516318667, 1516335163,
1516334468, 1516344122), class = c("POSIXct", "POSIXt"), tzone = ""),
user_id = c(1, 2, 3, 4, 2, 1, 1, 3, 2, 2)), .Names = c("start_time",
"user_id"), row.names = c(NA, 10L), class = "data.frame")
setDT(test)
test[, .(time_lapse = difftime(start_time[2], start_time[1])), by = user_id]
# user_id time_lapse
# 1: 1 17.27667 hours
# 2: 2 16.97694 hours
# 3: 3 13.16861 hours
# 4: 4 NA hours
答案 1 :(得分:0)
dplyr
df = structure(list(start_time = structure(c(1516255821, 1516255808,
1516287756, 1516289058, 1516316925, 1516318017, 1516318667, 1516335163,
1516334468, 1516344122), class = c("POSIXct", "POSIXt"), tzone = ""),
user_id = c(1, 2, 3, 4, 2, 1, 1, 3, 2, 2)), .Names = c("start_time",
"user_id"), row.names = c(NA, 10L), class = "data.frame")
library(dplyr)
df %>%
group_by(user_id) %>%
mutate(dif = difftime(start_time,lag(start_time),units = "hours")) %>%
filter(row_number()==2)
结果:
# A tibble: 3 x 3
# Groups: user_id [3]
start_time user_id dif
<dttm> <dbl> <time>
1 2018-01-18 21:08:45 2 16.97694 hours
2 2018-01-18 21:26:57 1 17.27667 hours
3 2018-01-19 02:12:43 3 13.16861 hours