Question

我有一个非常大的客户数据集，他们在购买东西时有日期（年）。我希望R给我：

每年新客户的数量，
前一年（n-1）的客户百分比。

我的数据如下：

customer_id     year    
12001           2007 
12001           2008 
12001           2009
12002           2006
12002           2007
12003           2005
...             ...

每个客户随着时间的推移进行各种购买。

我想要的输出是这样的：

# Table1
year    no. of new customers
2005          34
2006          25
2007          17
...          ...

表1报告了每年的独特新值;和

# Table2
year    % of customers that also purchased at (year-1)
2005       25%
2006       17%
...        ...

表2显示了2005年记录的所有客户中的＆＃34;其中25％也记录在2004年; 2006年记录的所有客户中，17％也记录在2006年; 。等＆＃34;

我知道第一部分是partially answered，但它并不适用于R.我无法在其他地方找到类似的问题。

Answer 1

除非我误解了某些内容，否则以下内容可能会有所帮助：

tab = table(DF)
tab
#           year
#customer_id 2005 2006 2007 2008 2009 2010
#      12001    0    0    1    1    1    0
#      12002    0    1    1    0    0    0
#      12003    1    0    0    0    0    0
#      12004    1    0    1    0    0    0
#      12006    0    0    0    1    0    0
#      12007    0    0    0    1    1    0
#      12008    0    0    0    0    0    1

#new customers per year
as.data.frame(table(factor(colnames(tab)[max.col(tab, "first")], colnames(tab))))
#  Var1 Freq
#1 2005    2
#2 2006    1
#3 2007    1
#4 2008    2
#5 2009    0
#6 2010    1

#pct
as.data.frame(as.table((colSums((tab[, -1] == tab[, -ncol(tab)]) * (tab[, -1] == 1)) / colSums(tab[, -1])) * 100))
#  Var1      Freq
#1 2006   0.00000
#2 2007  33.33333
#3 2008  33.33333
#4 2009 100.00000
#5 2010   0.00000

在哪里＆＃34; DF＆＃34;：

DF = structure(list(customer_id = c(12001L, 12001L, 12001L, 12002L, 
12002L, 12003L, 12004L, 12004L, 12006L, 12007L, 12007L, 12008L
), year = c(2007L, 2008L, 2009L, 2006L, 2007L, 2005L, 2005L, 
2007L, 2008L, 2008L, 2009L, 2010L)), .Names = c("customer_id", 
"year"), class = "data.frame", row.names = c(NA, -12L))

Answer 2

生成一些样本数据

set.seed(31)
nSamples=5000
df<-data.frame(id=sample(12001:12100,nSamples,replace=T),
               year=sample(2001:2014,nSamples,replace=T))

您可以使用表格来确定每位客户每年的购买量

t_purchasePerYear<-table(df$year,df$id)

然后，您可以获得每年客户数量的变化

nCustPerYear <- apply(t_purchasePerYear,1,function(x){sum(x>0)})
nCustPerYear
nYear = length(nCustPerYear)
nNewCustPerYear <- nCustPerYear[2:nYear] - nCustPerYear[1:(nYear-1)]
nNewCustPerYear

制作第二张今年购买但未持续购买的客户表

t_didBuyThisYearAndLast <- t_purchasePerYear[2:nYear,]>0 & t_purchasePerYear[1:(nYear-1),]>0

现在获得今年和最后一次购买的客户数量

nBuyThisYearAndLast <- apply(t_didBuyThisYearAndLast,1,function(x){sum(x)})
nBuyThisYearAndLast

除以每年的客户数量以获得百分比

pcntBuyThisYearAndLast <- nBuyThisYearAndLast / nCustPerYear[2:nYear] *100
pcntBuyThisYearAndLast

R - 计算每年新客户数和前一年购买的客户百分比

2 个答案: