我是R的新手,我有一个数据集,其中包含客户编号和几千个事件的日期。数据的格式如下:
data <- data.frame("Client"=c(rep(1, 4), rep(2, 3), rep(3, 2)), "Date"=as.Date(c("2015-11-20", "2015-12-04", "2016-01-08", "2016-04-07", "2015-12-19", "2016-02-02", "2016-02-21", "2016-01-04", "2016-02-12")), "Event"=rep(1, 9))
data
Client Date Event
1 1 2015-11-20 1
2 1 2015-12-04 1
3 1 2016-01-08 1
4 1 2016-04-07 1
5 2 2015-12-19 1
6 2 2016-02-02 1
7 2 2016-02-21 1
8 3 2016-01-04 1
9 3 2016-02-12 1
给定一组参考日期,
refdates <- as.Date(c("2016-01-01", "2016-03-01"))
我想计算每个客户发生的事件数量(1)参考日期后30天,(2)参考日期前0-30天,以及(3)参考日期前31-60天对于参考日期集。
我希望输出是一个如下所示的数据框:
Client RefDate post30 prior30 prior31.60
1 1 2016-01-01 1 1 1
2 1 2016-03-01 0 0 1
3 2 2016-01-01 0 1 0
4 2 2016-03-01 0 2 0
5 3 2016-01-01 1 0 0
6 3 2016-03-01 0 1 1
我觉得我应该能够用plyr做到这一点,但我感觉有点过头了。有人能指出我正确的方向吗?
答案 0 :(得分:3)
这是一个基础R方法。
do.call(rbind, lapply(refdates, FUN=function(i) {
aggregate(cbind("post30"=data$Date - i > -1 & data$Date - i < 31,
"prior30"=data$Date - i > -31 & data$Date - i < 0,
"prior31.60"=data$Date - i > -61 & data$Date - i < -30),
list(data$Client), FUN=sum)
}))
这是一个快速分解:
aggregate
函数将特定参考日期的每个客户的时间窗口内的逻辑值相加。cbind
允许我们一次计算多个窗口,并为输出添加名称。lapply
会一直运行参考日期并致电aggregate
。这将返回我们正在寻找的列表。do.call
接受data.frames列表和rbinds
来创建单个data.frame。答案 1 :(得分:1)
我在我的例子中使用了dplyr。你说它只有几千行,所以如果参考日期的数量不是太大,这不应该是计算量太大。
require(dplyr)
data <- data.frame("Client"=c(rep(1, 4), rep(2, 3), rep(3, 2)), "Date"=as.Date(c("2015-11-20", "2015-12-04", "2016-01-08", "2016-04-07", "2015-12-19", "2016-02-02", "2016-02-21", "2016-01-04", "2016-02-12")), "Event"=rep(1, 9))
data
refdates <- as.Date(c("2016-01-01", "2016-03-01"))
data %>%
merge(refdates, all = T) %>%
rename(RefDate = y) %>%
mutate(
post30 = ifelse(between(Date - RefDate, 1, 31), 1, 0),
prior30 = ifelse(between(Date - RefDate, -30, 0), 1, 0),
prior30.60 = ifelse(between(Date - RefDate, -60, -31), 1, 0)
) %>%
group_by(Client, RefDate) %>%
summarise(post30 = sum(post30),
prior30 = sum(prior30),
prior30.60 = sum(prior30.60)
)
这产生了:
Client RefDate post30 prior30 prior30.60
(dbl) (date) (dbl) (dbl) (dbl)
1 1 2016-01-01 1 1 1
2 1 2016-03-01 0 0 1
3 2 2016-01-01 0 1 0
4 2 2016-03-01 0 2 0
5 3 2016-01-01 1 0 0
6 3 2016-03-01 0 1 1
答案 2 :(得分:1)
使用dplyr
:
library(dplyr)
out <- data %>%
merge(refdates) %>%
rename(RefDate = y) %>%
group_by(Client, RefDate) %>%
mutate(Date.diff = Date - RefDate) %>%
summarise(post30 = sum(Date.diff < 30 & Date.diff > 0),
prior30 = sum(Date.diff < 0 & Date.diff > -30),
prior31.60 = sum(Date.diff < -30 & Date.diff > -60))
out
Client RefDate post30 prior30 prior31.60
(dbl) (date) (int) (int) (int)
1 1 2016-01-01 1 1 1
2 1 2016-03-01 0 0 1
3 2 2016-01-01 0 1 0
4 2 2016-03-01 0 2 0
5 3 2016-01-01 1 0 0
6 3 2016-03-01 0 1 1