R:计算指定时间范围内不同类别的数量

时间:2017-01-17 09:12:31

标签: r data.table dplyr distinct-values

这是一些虚拟数据:

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

我想为每个user_id计算指定时间段内不同categories的数量(例如过去7天,14天),包括当前订单

解决方案如下所示:

 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

我发布了类似问题herehere,但没有一个问题指的是计算指定时间段内的累积唯一值。非常感谢你的帮助!

3 个答案:

答案 0 :(得分:3)

以下是两个data.table解决方案,一个包含两个嵌套lapply,另一个使用非equi连接

第一个是一个相当笨拙的data.table解决方案,但它重现了预期的答案。它可以在任意数量的时间范围内工作。 (虽然@ alistaire的简明tidyverse解决方案在他的评论中也可以修改。

它使用两个嵌套的lapply。第一个循环在时间范围内,第二个循环在日期上​​。临时结果与原始数据相结合,然后从长格式转换为宽格式,以便我们以每个时间帧的单独列结束。

library(data.table)
tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) rbindlist(
           lapply(unique(dt$date), 
                  function(ldate) {
                    dt[between(date, ldate - ldays, ldate), 
                       .(distinct = sprintf("distinct_%02i", ldays), 
                         date = ldate, 
                         N = uniqueN(category)), 
                       by = .(user_id)]
                  })
         )
  )
)
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

以下是使用data.table&#39; 非等联接而非第二个lapply的变体following a suggestion by @Frank

tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) {
           dt[.(user_id = user_id, dago = date - ldays, d = date), 
              on=.(user_id, date >= dago, date <= d), 
              .(distinct = sprintf("distinct_%02i", ldays),
                N = uniqueN(category)), 
              by = .EACHI]
         }
  )
)[, date := NULL]
# 
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 

数据:

dt <- fread("user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple")
dt[, date := as.IDate(date)]
BTW:过去7天,14天中的措词有些误导,因为时间段实际上包括8天和15天,分别为

答案 1 :(得分:2)

U建议使用runner软件包。您可以在具有runner功能的运行窗口上使用任何R功能。下面的代码获取延迟的输出,即过去7天+当前和过去14天+当前(当前8天和15天):

df <- read.table(
  text = "  user_id       date category
       27 2016-01-01    apple
  27 2016-01-03    apple
  27 2016-01-05     pear
  27 2016-01-07     plum
  27 2016-01-10    apple
  27 2016-01-14     pear
  27 2016-01-16     plum
  11 2016-01-01    apple
  11 2016-01-03     pear
  11 2016-01-05     pear
  11 2016-01-07     pear
  11 2016-01-10    apple
  11 2016-01-14    apple
  11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))



library(dplyr)
library(runner)
df %>%
  group_by(user_id) %>%
  mutate(distinct_7  = runner(category, k = 7 + 1, idx = date, 
                              f = function(x) length(unique(x))),
         distinct_14 = runner(category, k = 14 + 1, idx = date, 
                              f = function(x) length(unique(x))))

packagefunction文档中的更多信息。

答案 2 :(得分:1)

在tidyverse中,您可以使用map_int迭代一组值并简化为整数àsapplyvapply。通过比较或帮助n_distinct计算对象子集的length(unique(...))(如between)的不同事件,并从当天减去适当数量的最小值设置,然后设置

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2