我想清楚地计算出每个SKU的首次购买日期和最后购买日期之间从公司购买的客户数量。这是在我明确计算出SQL中给定的每个SKU的客户数量(以及找到第一个和最后一个购买日期)之后,
我有成功解决此问题的代码;但是,它使用for循环,并且花费了太长时间,因为有成千上万的SKU。这是我的SKU表外观的简短示例:
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
然后,我有另一个表,该表长约600万行,与销售日期和我称为OrderTable的CustomerID有所不同。我无法汇总每天的不重复计数并将它们汇总在一起,因为这会使在不同日期购买的客户增加一倍。我必须使用SKUCount表中看到的每个FirstPurchase / LastPurchase排列重新计算非重复计数。从那里开始,我使用以下代码来计算给定时间范围内的不同客户数量:
library(dplyr)
for (i in 1:nrow(SKUCount))
{
SKUCount[i, c('DateCustomers')] <-
sapply(OrderTable %>%
filter(Date >= SKUCount[i,'FirstPurchase'],
Date <= SKUCount[i,'LastPurchase']) %>%
select(CustomerID),
function(x) length(unique(x)))
}
正如我之前指出的那样,这段代码可以工作,但是非常慢(每行约0.5秒)。有没有一种更快的方法来计算非重复计数,或者有一个更聪明的解决方案来解决我的问题?
答案 0 :(得分:0)
尝试这个:
library("purrrlyr")
library("dplyr")
#First creating the datasets including OrderTable (please correct me if I got it wrong!):
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
OrderTable <- data.frame(Date=c('2014-06-02', '2014-08-02', '2015-02-03', '2017-05-13'
,'2015-05-02', '2014-06-03', '2016-07-13', '2017-09-30', '2018-07-01', '2019-01-09'),
CustomerID=c('121','212','3434','24232','121','124','212','131','412','3634'))
#changing factors to date
SKUCount$FirstPurchase<-as.Date(SKUCount$FirstPurchase,format = "%Y-%m-%d")
SKUCount$LastPurchase<-as.Date(SKUCount$LastPurchase,format = "%Y-%m-%d")
OrderTable$Date<-as.Date(OrderTable$Date,format = "%Y-%m-%d")
#defining a function, named FUN, which limit the Date from OrderTable between
#the two date arguments (FirstPurchase and LastPurchase) and returns the
#distinct count of CustomerID's from OrderTable:
FUN <- function(FirstPurchase,LastPurchase){
Rtrn<-OrderTable %>%
filter(Date >= FirstPurchase,
Date <= LastPurchase) %>%
summarize(n_distinct(CustomerID))
as.numeric(Rtrn)
}
SKUCount %>%
rowwise() %>%
mutate(DateCustomers= FUN(FirstPurchase,LastPurchase))
# Source: local data frame [3 x 5]
# Groups: <by row>
#
# # A tibble: 3 x 5
# SKU NumberOfCustomers FirstPurchase LastPurchase DateCustomers
# <fct> <dbl> <date> <date> <dbl>
# 1 123 204543 2014-05-02 2017-09-30 6
# 2 456 92703 2014-02-03 2018-07-01 7
# 3 789 305727 2016-05-13 2019-01-09 5