根据其他列与df中其他行的关系更改行中的df列值

时间:2014-11-28 20:53:38

标签: r dataframe

使用R 3.1.1。

我有一个包含交易数据的数据集。每个客户至少购买了两次(我已经提交了原始数据)。我想做的是,将每个交易标记为"首次购买者"交易或"重复买家"交易。问题是,我想定义一个"重复买方交易"作为一个在过去交易的特定时间范围内的一个,所以它并不像每个客户标记第一个那样简单,因为"第一个"其余为"重复"。如果客户超过1年没有购买(52.25周,我希望他/她第一次被计算在内!)

我认为实现这一目标的最佳方法是非常低效,我认为(完全披露,仍在运行,因此启动可能是错误的)。我使用嵌套for循环...... :(

有关如何更有效地完成此任务的任何建议?在此先感谢您的帮助和建议!代码在整个过程中都有评论,所以我会让它自己说话,但如果不清楚,请告诉我!



#let's ensure the repdata is ordered by date first
attach(repdata)
repdata <- repdata[order(date),] 
detach(repdata)
  
#now, we loop through repdata and decide whether purchase 
#is a first time or repeat buyer

#setting time frame to 1 year (52.25 weeks as we use week as units below)
timeframe = 52.25
#add new column to repdata that we will use below
repdata$rpt52wk <- ""

#for each row in repdata, do the following
for(i in seq_along(repdata$date)) 
{
  #assume that this is a first purchase; set rpt52wk var for [i] to "FIRST TIME BUYER"
  repdata$rpt52wk[i] = "FIRST TIME BUYER"
  
  #look at all previous transactions 
  #we can ignore higher indexed transactions (we sorted the data, ascending by date)
  for (j in seq_along(repdata$date[1:(i-1)]))
  {
    #if a transaction is found in which the same member bought within the timeframe
   else if(repdata$MEMBER_ID[i] == repdata$MEMBER_ID[j] & 
         (difftime(repdata$date[i],repdata$date[j],units="weeks")<timeframe))
    {
      #then this is a repeat buyer; set rpt var for [i] appropriately
      repdata$rpt52wk[i]="REPEAT BUYER"
    }
  }
}
&#13;
&#13;
&#13;

添加失败的测试数据,至少在我使用目前为止提供的两个解决方案的情况下运行。

MEMBER_ID       date
      1 2011-04-13
      2 2011-04-22
      3 2011-04-17
      3 2011-04-26
      4 2011-04-13
      4 2011-04-16
      4 2011-04-16
      5 2011-04-20
      5 2011-04-13
      5 2011-04-18
      6 2011-04-13
      7 2011-04-13
      8 2011-04-25
      8 2011-04-20
      9 2011-04-14
     10 2011-04-14
     11 2011-04-18
     12 2011-04-15
     13 2011-04-15
     14 2011-04-13

#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC", 
          "2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC", 
          "2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC", 
          "2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC", 
          "2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata

(请注意,我发现代码有一个i = 1的错误。我现在要忽略它,而不是在for循环中添加另一个if语句)

1 个答案:

答案 0 :(得分:0)

您可以尝试使用ddply。

首先生成按日期排序的数据集,时间范围为52周。

#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC", 
          "2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC", 
          "2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC", 
          "2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC", 
          "2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata <- repdata[order(repdata$date),]
repdata

# define a timeframe of 4 weeks
timeframe <- as.difftime(52, units = "weeks")

然后调整以下代码:

library(plyr)

first.buyers <- ddply(repdata, .(MEMBER_ID),
                  function(x) x[c(TRUE, diff(x$date) > timeframe),])
first.buyers <- mutate(first.buyers, rpt52wk = "FIRST TIME BUYER")

final <- merge(repdata,first.buyers, all = TRUE)
final[is.na(final$rpt52wk),"rpt52wk"] <- "REPEAT BUYER"

我们得到以下结果:

   MEMBER_ID       date          rpt52wk
1          1 2011-04-13 FIRST TIME BUYER
2          2 2011-04-22 FIRST TIME BUYER
3          3 2011-04-17 FIRST TIME BUYER
4          3 2011-04-26     REPEAT BUYER
5          4 2011-04-13 FIRST TIME BUYER
6          4 2011-04-16     REPEAT BUYER
7          4 2011-04-16     REPEAT BUYER
8          5 2011-04-13 FIRST TIME BUYER
9          5 2011-04-18     REPEAT BUYER
10         5 2011-04-20     REPEAT BUYER
11         6 2011-04-13 FIRST TIME BUYER
12         7 2011-04-13 FIRST TIME BUYER
13         8 2011-04-20 FIRST TIME BUYER
14         8 2011-04-25     REPEAT BUYER
15         9 2011-04-14 FIRST TIME BUYER
16        10 2011-04-14 FIRST TIME BUYER
17        11 2011-04-18 FIRST TIME BUYER
18        12 2011-04-15 FIRST TIME BUYER
19        13 2011-04-15 FIRST TIME BUYER
20        14 2011-04-13 FIRST TIME BUYER

ddply按MEMBER_ID拆分数据框,并将函数应用于每个子集。 每个子集都是具有固定MEMBER_ID和有序日期的数据帧。 第一个元素将始终对应于第一个买方,对于下一个元素,您必须确定自上次交易以来经过的时间是否大于您的阈值(如果是,则该成员可以再次被视为第一个买方)。

在上面的代码中,您应该在进行比较时检查时间单位是否一致(x $ date)&gt;时间范围(取决于您的日期格式)

一旦你找到第一次买家,我认为接下来的步骤是相当明确的。