使用R 3.1.1。
我有一个包含交易数据的数据集。每个客户至少购买了两次(我已经提交了原始数据)。我想做的是,将每个交易标记为"首次购买者"交易或"重复买家"交易。问题是,我想定义一个"重复买方交易"作为一个在过去交易的特定时间范围内的一个,所以它并不像每个客户标记第一个那样简单,因为"第一个"其余为"重复"。如果客户超过1年没有购买(52.25周,我希望他/她第一次被计算在内!)
我认为实现这一目标的最佳方法是非常低效,我认为(完全披露,仍在运行,因此启动可能是错误的)。我使用嵌套for循环...... :(
有关如何更有效地完成此任务的任何建议?在此先感谢您的帮助和建议!代码在整个过程中都有评论,所以我会让它自己说话,但如果不清楚,请告诉我!
#let's ensure the repdata is ordered by date first
attach(repdata)
repdata <- repdata[order(date),]
detach(repdata)
#now, we loop through repdata and decide whether purchase
#is a first time or repeat buyer
#setting time frame to 1 year (52.25 weeks as we use week as units below)
timeframe = 52.25
#add new column to repdata that we will use below
repdata$rpt52wk <- ""
#for each row in repdata, do the following
for(i in seq_along(repdata$date))
{
#assume that this is a first purchase; set rpt52wk var for [i] to "FIRST TIME BUYER"
repdata$rpt52wk[i] = "FIRST TIME BUYER"
#look at all previous transactions
#we can ignore higher indexed transactions (we sorted the data, ascending by date)
for (j in seq_along(repdata$date[1:(i-1)]))
{
#if a transaction is found in which the same member bought within the timeframe
else if(repdata$MEMBER_ID[i] == repdata$MEMBER_ID[j] &
(difftime(repdata$date[i],repdata$date[j],units="weeks")<timeframe))
{
#then this is a repeat buyer; set rpt var for [i] appropriately
repdata$rpt52wk[i]="REPEAT BUYER"
}
}
}
&#13;
添加失败的测试数据,至少在我使用目前为止提供的两个解决方案的情况下运行。
MEMBER_ID date
1 2011-04-13
2 2011-04-22
3 2011-04-17
3 2011-04-26
4 2011-04-13
4 2011-04-16
4 2011-04-16
5 2011-04-20
5 2011-04-13
5 2011-04-18
6 2011-04-13
7 2011-04-13
8 2011-04-25
8 2011-04-20
9 2011-04-14
10 2011-04-14
11 2011-04-18
12 2011-04-15
13 2011-04-15
14 2011-04-13
#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC",
"2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC",
"2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC",
"2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC",
"2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata
(请注意,我发现代码有一个i = 1的错误。我现在要忽略它,而不是在for循环中添加另一个if语句)
答案 0 :(得分:0)
您可以尝试使用ddply。
首先生成按日期排序的数据集,时间范围为52周。
#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC",
"2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC",
"2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC",
"2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC",
"2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata <- repdata[order(repdata$date),]
repdata
# define a timeframe of 4 weeks
timeframe <- as.difftime(52, units = "weeks")
然后调整以下代码:
library(plyr)
first.buyers <- ddply(repdata, .(MEMBER_ID),
function(x) x[c(TRUE, diff(x$date) > timeframe),])
first.buyers <- mutate(first.buyers, rpt52wk = "FIRST TIME BUYER")
final <- merge(repdata,first.buyers, all = TRUE)
final[is.na(final$rpt52wk),"rpt52wk"] <- "REPEAT BUYER"
我们得到以下结果:
MEMBER_ID date rpt52wk
1 1 2011-04-13 FIRST TIME BUYER
2 2 2011-04-22 FIRST TIME BUYER
3 3 2011-04-17 FIRST TIME BUYER
4 3 2011-04-26 REPEAT BUYER
5 4 2011-04-13 FIRST TIME BUYER
6 4 2011-04-16 REPEAT BUYER
7 4 2011-04-16 REPEAT BUYER
8 5 2011-04-13 FIRST TIME BUYER
9 5 2011-04-18 REPEAT BUYER
10 5 2011-04-20 REPEAT BUYER
11 6 2011-04-13 FIRST TIME BUYER
12 7 2011-04-13 FIRST TIME BUYER
13 8 2011-04-20 FIRST TIME BUYER
14 8 2011-04-25 REPEAT BUYER
15 9 2011-04-14 FIRST TIME BUYER
16 10 2011-04-14 FIRST TIME BUYER
17 11 2011-04-18 FIRST TIME BUYER
18 12 2011-04-15 FIRST TIME BUYER
19 13 2011-04-15 FIRST TIME BUYER
20 14 2011-04-13 FIRST TIME BUYER
ddply按MEMBER_ID拆分数据框,并将函数应用于每个子集。 每个子集都是具有固定MEMBER_ID和有序日期的数据帧。 第一个元素将始终对应于第一个买方,对于下一个元素,您必须确定自上次交易以来经过的时间是否大于您的阈值(如果是,则该成员可以再次被视为第一个买方)。
在上面的代码中,您应该在进行比较时检查时间单位是否一致(x $ date)&gt;时间范围(取决于您的日期格式)
一旦你找到第一次买家,我认为接下来的步骤是相当明确的。