Question

使用R 3.1.1。

我有一个包含交易数据的数据集。每个客户至少购买了两次（我已经提交了原始数据）。我想做的是，将每个交易标记为＆＃34;首次购买者＆＃34;交易或＆＃34;重复买家＆＃34;交易。问题是，我想定义一个＆＃34;重复买方交易＆＃34;作为一个在过去交易的特定时间范围内的一个，所以它并不像每个客户标记第一个那样简单，因为＆＃34;第一个＆＃34;其余为＆＃34;重复＆＃34;。如果客户超过1年没有购买（52.25周，我希望他/她第一次被计算在内！）

我认为实现这一目标的最佳方法是非常低效，我认为（完全披露，仍在运行，因此启动可能是错误的）。我使用嵌套for循环...... :(

有关如何更有效地完成此任务的任何建议？在此先感谢您的帮助和建议！代码在整个过程中都有评论，所以我会让它自己说话，但如果不清楚，请告诉我！

＆＃13;

#let's ensure the repdata is ordered by date first
attach(repdata)
repdata <- repdata[order(date),] 
detach(repdata)
  
#now, we loop through repdata and decide whether purchase 
#is a first time or repeat buyer

#setting time frame to 1 year (52.25 weeks as we use week as units below)
timeframe = 52.25
#add new column to repdata that we will use below
repdata$rpt52wk <- ""

#for each row in repdata, do the following
for(i in seq_along(repdata$date)) 
{
  #assume that this is a first purchase; set rpt52wk var for [i] to "FIRST TIME BUYER"
  repdata$rpt52wk[i] = "FIRST TIME BUYER"
  
  #look at all previous transactions 
  #we can ignore higher indexed transactions (we sorted the data, ascending by date)
  for (j in seq_along(repdata$date[1:(i-1)]))
  {
    #if a transaction is found in which the same member bought within the timeframe
   else if(repdata$MEMBER_ID[i] == repdata$MEMBER_ID[j] & 
         (difftime(repdata$date[i],repdata$date[j],units="weeks")<timeframe))
    {
      #then this is a repeat buyer; set rpt var for [i] appropriately
      repdata$rpt52wk[i]="REPEAT BUYER"
    }
  }
}

＆＃13;

添加失败的测试数据，至少在我使用目前为止提供的两个解决方案的情况下运行。

MEMBER_ID       date
      1 2011-04-13
      2 2011-04-22
      3 2011-04-17
      3 2011-04-26
      4 2011-04-13
      4 2011-04-16
      4 2011-04-16
      5 2011-04-20
      5 2011-04-13
      5 2011-04-18
      6 2011-04-13
      7 2011-04-13
      8 2011-04-25
      8 2011-04-20
      9 2011-04-14
     10 2011-04-14
     11 2011-04-18
     12 2011-04-15
     13 2011-04-15
     14 2011-04-13

#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC", 
          "2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC", 
          "2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC", 
          "2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC", 
          "2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata

（请注意，我发现代码有一个i = 1的错误。我现在要忽略它，而不是在for循环中添加另一个if语句）

Answer 1

您可以尝试使用ddply。

首先生成按日期排序的数据集，时间范围为52周。

#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC", 
          "2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC", 
          "2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC", 
          "2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC", 
          "2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata <- repdata[order(repdata$date),]
repdata

# define a timeframe of 4 weeks
timeframe <- as.difftime(52, units = "weeks")

然后调整以下代码：

library(plyr)

first.buyers <- ddply(repdata, .(MEMBER_ID),
                  function(x) x[c(TRUE, diff(x$date) > timeframe),])
first.buyers <- mutate(first.buyers, rpt52wk = "FIRST TIME BUYER")

final <- merge(repdata,first.buyers, all = TRUE)
final[is.na(final$rpt52wk),"rpt52wk"] <- "REPEAT BUYER"

我们得到以下结果：

   MEMBER_ID       date          rpt52wk
1          1 2011-04-13 FIRST TIME BUYER
2          2 2011-04-22 FIRST TIME BUYER
3          3 2011-04-17 FIRST TIME BUYER
4          3 2011-04-26     REPEAT BUYER
5          4 2011-04-13 FIRST TIME BUYER
6          4 2011-04-16     REPEAT BUYER
7          4 2011-04-16     REPEAT BUYER
8          5 2011-04-13 FIRST TIME BUYER
9          5 2011-04-18     REPEAT BUYER
10         5 2011-04-20     REPEAT BUYER
11         6 2011-04-13 FIRST TIME BUYER
12         7 2011-04-13 FIRST TIME BUYER
13         8 2011-04-20 FIRST TIME BUYER
14         8 2011-04-25     REPEAT BUYER
15         9 2011-04-14 FIRST TIME BUYER
16        10 2011-04-14 FIRST TIME BUYER
17        11 2011-04-18 FIRST TIME BUYER
18        12 2011-04-15 FIRST TIME BUYER
19        13 2011-04-15 FIRST TIME BUYER
20        14 2011-04-13 FIRST TIME BUYER

ddply按MEMBER_ID拆分数据框，并将函数应用于每个子集。每个子集都是具有固定MEMBER_ID和有序日期的数据帧。第一个元素将始终对应于第一个买方，对于下一个元素，您必须确定自上次交易以来经过的时间是否大于您的阈值（如果是，则该成员可以再次被视为第一个买方）。

在上面的代码中，您应该在进行比较时检查时间单位是否一致（x $ date）＆gt;时间范围（取决于您的日期格式）

一旦你找到第一次买家，我认为接下来的步骤是相当明确的。

根据其他列与df中其他行的关系更改行中的df列值

1 个答案: