我有一个包含1008412个观测值的大型数据集,
列为customer_id
(int),visit_date
(日期,格式:“2010-04-04”),visit_spend
(浮动)。
此日期函数用于聚合地图周感兴趣的数字范围13-65:
weekofperiod <- function(dt) {
as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
每个customer_id在53周的时间内拥有可变的总访问次数。
对于每个customer_id
,我想通过spend_per_week
获得weekofperiod()
的汇总。
下面的代码在功能上是正确但非常慢 - 评论让它更快?
此外,aggregate()
产生稀疏输出,其中没有访问的周数缺失,我将spend_per_week
初始化为0,然后逐行手动分配来自aggregate()的非零结果,以确保结果始终具有53行。当然可以做得更好吗?
示例数据集行如下所示:
customer_id visit_date visit_spend
72 40 2011-03-15 18.38
73 40 2011-03-20 23.45
74 79 2010-04-07 150.87
75 79 2010-04-17 101.90
76 79 2010-05-02 111.90
这里是空周的聚合调用和调整的代码:
for (cid in all_tt_cids) {
print_pnq('Getting statistics for cid', cid)
# Get row indices of the selected subset, for just this cid's records
I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
# (other code to compute other per-cid statistics)
# spend_per_week (mode;mean;sd)
# Aggregate spend_per_week, but beware this should be 0 for those week with no visits
spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
for (i in 1:nrow(nonzero_spends_per_week)) {
spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
}
colnames(spend_per_week)[2] <- 'spend_per_week'
# (code to compute and store per-cid statistics on spend_per_week)
}
答案 0 :(得分:6)
如果替换for
循环,您的最大加速将会到来。我不能从你的例子中说出来,因为你在循环中覆盖了每个客户,但是如果你想保留所有主题的信息,这是一种方法。
对于测试,首先定义原始方法的函数,然后定义没有循环的新方法:
weekofperiod <- function(dt) {
as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
FastMethod <- function(tt) {
tt$week = weekofperiod(tt$visit_date)
spend_per_week.tmp = as.data.frame(tapply(tt$visit_spend, tt[,c(1,4)], sum))
spend_per_week = data.frame(matrix(0, nrow=nrow(spend_per_week.tmp), ncol=length(13:65)))
colnames(spend_per_week) = 13:65
rownames(spend_per_week) = rownames(spend_per_week.tmp)
spend_per_week[, colnames(spend_per_week.tmp)] = spend_per_week.tmp
spend_per_week
}
OrigMethod <- function(tt) {
all_tt_cids = unique(tt$customer_id)
for (cid in all_tt_cids) {
# Get row indices of the selected subset, for just this cid's records
I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
# Aggregate spend_per_week, but beware this should be 0 for those week with no visits
spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)))
nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
for (i in 1:nrow(nonzero_spends_per_week)) {
spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
}
colnames(spend_per_week)[2] <- 'spend_per_week'
}
spend_per_week
}
现在模拟一个更大的数据集,以便更容易比较:
n.row = 10^4
n.cust = 10^3
customer_id = 1:n.cust
dates = seq(as.Date('2010-04-01'), as.Date('2011-03-31'), by=1)
visit_date = sample(dates, n.row, replace=T)
visit_spend = runif(n.row, 0, 200)
tt = data.frame(customer_id, visit_date, visit_spend)
最后,比较两种方法:
> system.time(FastMethod(tt))
user system elapsed
0.082 0.001 0.083
> system.time(OrigMethod(tt))
user system elapsed
4.505 0.007 4.514
这已经快了50倍,我打赌你可以通过更多优化让它变得更好。祝你好运!
答案 1 :(得分:3)
这是使用data.table
的更快的方法,也更容易阅读。
FasterMethod <- function(tt){
# LOAD LIBRARIES
require(reshape2)
require(data.table)
tt <- transform(tt, week_of_period = weekofperiod(visit_date))
# AGGREGATE SPEND BY CUSTOMER AND WEEK OF PERIOD
tt <- data.table(tt)
ans <- tt[,list(spend = sum(visit_spend)), 'customer_id, week_of_period']
# RESHAPE TO CUSTOMER ID VS. WEEK OF PERIOD
dcast(ans, customer_id ~ week_of_period, value_var = 'spend')
}
我们可以使用FastMethod
针对OrigMethod
和rbenchmark
对此进行基准测试,并看到我们获得的加速比FastMethod
快1.3倍,总体加速为70x
library(rbenchmark)
benchmark(FastMethod(tt), FasterMethod(tt), replications = 40)
test elapsed relative
FastMethod(tt) 5.594 1.346654
FasterMethod(tt) 4.154 1.000000
如果您不关心将最终输出重塑为客户ID与周期相比,您可以进一步加快速度(2.5倍与FastMethod
相比)。