按行间隔时间分组

时间:2018-02-07 22:01:15

标签: r

我有像这样的数据集

      id  data                time moreData
   <int> <int>              <dttm>    <dbl>
 1     1     4 2017-05-12 18:34:20     4450
 2     2     4 2017-05-12 18:37:07     2800
 3     3     4 2017-05-12 18:37:10     1900
 4     4     4 2017-05-12 18:37:59     1950
 5     5     4 2017-05-12 18:38:40     2500

包含时间戳。你可以说这些数据是对网站的请求&#34;我想近似&#34;会话&#34;。

换句话说,如果行 i i之间的时差,我希望将行 1,2,...,n 分组。 + 1 少于1分钟。

因此,数据将分组在{1}和{2,3,4,5}。

请注意,这不是一个重复的问题,询问有关在预定时间间隔内分组的其他问题 - 我不在乎第一个和最后一个元素之间的时差有多大,我只关心相邻行的差异。

我怎样才能做到这一点?

示例数据:

structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

3 个答案:

答案 0 :(得分:2)

您可以使用difftime base中的R功能。

代码:

# Wanted time difference in minutes
wantedDiff <- 1
timeDiff <- abs(difftime(df$time[-nrow(df)], 
                         df$time[-1], 
                         units = "mins"))
df$group <- cumsum(c(0, as.numeric(timeDiff >= wantedDiff)))

结果:

   id user                time group
1   1    4 2017-05-12 19:34:20     0
2   2    4 2017-05-12 19:37:07     1
3   3    4 2017-05-12 19:37:10     1
4   4    4 2017-05-12 19:39:32     2
5   5    4 2017-05-12 19:39:33     2
6   6    4 2017-05-12 19:39:57     2
7   7    4 2017-05-12 19:39:58     2
8   8    4 2017-05-12 19:40:00     2
9   9    4 2017-05-12 19:41:49     3
10 10    4 2017-05-12 19:41:52     3
11 11    4 2017-05-12 19:42:19     3
12 12    4 2017-05-12 19:42:20     3
13 13    4 2017-05-12 19:42:23     3
14 14    4 2017-05-12 19:42:23     3
15 15    4 2017-05-12 19:43:24     4
16 16    4 2017-05-12 19:43:25     4
17 17    4 2017-05-12 19:43:27     4
18 18    4 2017-05-12 19:44:52     5
19 19    4 2017-05-12 19:44:53     5
20 20    4 2017-05-12 19:44:55     5

说明:

  • 使用difftime计算当前行与上一行之间的绝对时间差
    • 我们可以在此处指定差异units
    • 输出(timeDiff)如下所示:
Time differences in mins
 [1] 2.78333333 0.05000000 2.36666667 0.01666667 0.40000000 0.01666667 0.03333333 1.81666667 0.05000000 0.45000000
[11] 0.01666667 0.05000000 0.00000000 1.01666667 0.01666667 0.03333333 1.41666667 0.01666667 0.03333333
  • 测试时间差是否大于或等于wantedDiff并将此逻辑输出转换为数字
  • cumsum数字输出(添加+1,即切换到新组)

数据:

df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

答案 1 :(得分:1)

以下是使用扩展的示例数据集的解决方案。这种方法的关键部分是使用lubridate::ymd_hms将字符串转换为可以进行算术运算的时间,然后使用lag来确定时间是否在前一行的一分钟内。然后,您可以使用for循环通过在每次到达不在前一行的一分钟内的行时递增组编号来创建组。当然可以加强一点,并且很想知道是否有人可以在不诉诸for循环和bind_cols的情况下做到这一点!

library(tidyverse)
tbl <- tibble(
  id = 1:8,
  time = c("2017-05-12 18:34:20",
           "2017-05-12 18:37:07",
           "2017-05-12 18:37:10",
           "2017-05-12 18:37:59",
           "2017-05-12 18:38:40",
           "2017-05-12 18:40:40",
           "2017-05-12 18:40:49",
           "2017-05-12 18:43:40"
           )
)

tbl2 <- tbl %>%
  mutate(time = ymd_hms(time)) %>%
  mutate(separation = time - lag(time, default = 0)) %>%
  mutate(onemin = separation <= 60)

group_ids = 1
for (i in 2:nrow(tbl2)){
  if (tbl2$onemin[i] == FALSE){
    group_ids[i] <- group_ids[i - 1] +1
  } else
  group_ids[i] <- group_ids[i - 1]
}

tbl2 %>%
  bind_cols(., group = group_ids) %>%
  select(id, time, group)

# A tibble: 8 x 3
     id time                group
  <int> <dttm>              <dbl>
1     1 2017-05-12 18:34:20  1.00
2     2 2017-05-12 18:37:07  2.00
3     3 2017-05-12 18:37:10  2.00
4     4 2017-05-12 18:37:59  2.00
5     5 2017-05-12 18:38:40  2.00
6     6 2017-05-12 18:40:40  3.00
7     7 2017-05-12 18:40:49  3.00
8     8 2017-05-12 18:43:40  4.00

答案 2 :(得分:1)

一种可能的解决方案是使用lag包中的dplyr函数和cumsum形式base r

方法是:

  • 查找每行之间的时差(秒)
  • 如果difftime超过60,则该行位于新组(newgroup
  • cumsum上执行newgroup以获取每行的组号。

代码是:

    #data
    library(dplyr)
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
     4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
     time = structure(c(1494606860,1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
          1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
           1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
          1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
          "user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
              "tbl", "data.frame"))



df %>% mutate(difftime = ifelse(is.na(as.numeric(time - lag(time))),0,as.numeric(time - lag(time)))) %>%
       mutate(newroup = ifelse(difftime > 60, 1, 0)) %>%
       mutate(group = factor(cumsum(newroup))) %>%
      select(id, user,time, group)

    #Result
      id  user time                group 
   <int> <int> <dttm>              <fctr>
 1     1     4 2017-05-12 17:34:20 0     
 2     2     4 2017-05-12 17:37:07 1     
 3     3     4 2017-05-12 17:37:10 1     
 4     4     4 2017-05-12 17:39:32 2     
 5     5     4 2017-05-12 17:39:33 2     
 6     6     4 2017-05-12 17:39:57 2     
 7     7     4 2017-05-12 17:39:58 2     
 8     8     4 2017-05-12 17:40:00 2     
 9     9     4 2017-05-12 17:41:49 3     
10    10     4 2017-05-12 17:41:52 3     
11    11     4 2017-05-12 17:42:19 3     
12    12     4 2017-05-12 17:42:20 3     
13    13     4 2017-05-12 17:42:23 3     
14    14     4 2017-05-12 17:42:23 3     
15    15     4 2017-05-12 17:43:24 4     
16    16     4 2017-05-12 17:43:25 4     
17    17     4 2017-05-12 17:43:27 4     
18    18     4 2017-05-12 17:44:52 5     
19    19     4 2017-05-12 17:44:53 5     
20    20     4 2017-05-12 17:44:55 5