这是一些虚拟数据:
user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple
我想为每个user_id
计算指定时间段内不同categories
的数量(例如过去7天,14天),包括当前订单
解决方案如下所示:
user_id date category distinct_7 distinct_14
27 2016-01-01 apple 1 1
27 2016-01-03 apple 1 1
27 2016-01-05 pear 2 2
27 2016-01-07 plum 3 3
27 2016-01-10 apple 3 3
27 2016-01-14 pear 3 3
27 2016-01-16 plum 3 3
11 2016-01-01 apple 1 1
11 2016-01-03 pear 2 2
11 2016-01-05 pear 2 2
11 2016-01-07 pear 2 2
11 2016-01-10 apple 2 2
11 2016-01-14 apple 2 2
11 2016-01-16 apple 1 2
答案 0 :(得分:3)
以下是两个data.table
解决方案,一个包含两个嵌套lapply
,另一个使用非equi连接。
第一个是一个相当笨拙的data.table
解决方案,但它重现了预期的答案。它可以在任意数量的时间范围内工作。 (虽然@ alistaire的简明tidyverse
解决方案在他的评论中也可以修改。
它使用两个嵌套的lapply
。第一个循环在时间范围内,第二个循环在日期上。临时结果与原始数据相结合,然后从长格式转换为宽格式,以便我们以每个时间帧的单独列结束。
library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
# date user_id category distinct_07 distinct_14
# 1: 2016-01-01 27 apple 1 1
# 2: 2016-01-03 27 apple 1 1
# 3: 2016-01-05 27 pear 2 2
# 4: 2016-01-07 27 plum 3 3
# 5: 2016-01-10 27 apple 3 3
# 6: 2016-01-14 27 pear 3 3
# 7: 2016-01-16 27 plum 3 3
# 8: 2016-01-01 11 apple 1 1
# 9: 2016-01-03 11 pear 2 2
#10: 2016-01-05 11 pear 2 2
#11: 2016-01-07 11 pear 2 2
#12: 2016-01-10 11 apple 2 2
#13: 2016-01-14 11 apple 2 2
#14: 2016-01-16 11 apple 1 2
以下是使用data.table
&#39; 非等联接而非第二个lapply
的变体following a suggestion by @Frank:
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
数据:
dt <- fread("user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple")
dt[, date := as.IDate(date)]
BTW:过去7天,14天中的措词有些误导,因为时间段实际上包括8天和15天,分别为。
答案 1 :(得分:2)
U建议使用runner软件包。您可以在具有runner
功能的运行窗口上使用任何R功能。下面的代码获取延迟的输出,即过去7天+当前和过去14天+当前(当前8天和15天):
df <- read.table(
text = " user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character"))
library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7 = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))
答案 2 :(得分:1)
在tidyverse中,您可以使用map_int
迭代一组值并简化为整数àsapply
或vapply
。通过比较或帮助n_distinct
计算对象子集的length(unique(...))
(如between
)的不同事件,并从当天减去适当数量的最小值设置,然后设置
library(tidyverse)
df %>% group_by(user_id) %>%
mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
## user_id date category distinct_7 distinct_14
## <int> <date> <fctr> <int> <int>
## 1 27 2016-01-01 apple 1 1
## 2 27 2016-01-03 apple 1 1
## 3 27 2016-01-05 pear 2 2
## 4 27 2016-01-07 plum 3 3
## 5 27 2016-01-10 apple 3 3
## 6 27 2016-01-14 pear 3 3
## 7 27 2016-01-16 plum 3 3
## 8 11 2016-01-01 apple 1 1
## 9 11 2016-01-03 pear 2 2
## 10 11 2016-01-05 pear 2 2
## 11 11 2016-01-07 pear 2 2
## 12 11 2016-01-10 apple 2 2
## 13 11 2016-01-14 apple 2 2
## 14 11 2016-01-16 apple 1 2