如何在不使用for循环的情况下在R中实现动态计数?

时间:2019-05-28 20:32:20

标签: r

我想清楚地计算出每个SKU的首次购买日期和最后购买日期之间从公司购买的客户数量。这是在我明确计算出SQL中给定的每个SKU的客户数量(以及找到第一个和最后一个购买日期)之后,

我有成功解决此问题的代码;但是,它使用for循环,并且花费了太长时间,因为有成千上万的SKU。这是我的SKU表外观的简短示例:

SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')

SKUCount <- data.frame(SKUID, NumberOfCustomers, 
                       SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers', 
                        'FirstPurchase', 'LastPurchase')

然后,我有另一个表,该表长约600万行,与销售日期和我称为OrderTable的CustomerID有所不同。我无法汇总每天的不重复计数并将它们汇总在一起,因为这会使在不同日期购买的客户增加一倍。我必须使用SKUCount表中看到的每个FirstPurchase / LastPurchase排列重新计算非重复计数。从那里开始,我使用以下代码来计算给定时间范围内的不同客户数量:

library(dplyr)

for (i in 1:nrow(SKUCount))
{
  SKUCount[i, c('DateCustomers')] <-
    sapply(OrderTable %>%
              filter(Date >= SKUCount[i,'FirstPurchase'],
                     Date <= SKUCount[i,'LastPurchase']) %>%
              select(CustomerID),
           function(x) length(unique(x)))
}

正如我之前指出的那样,这段代码可以工作,但是非常慢(每行约0.5秒)。有没有一种更快的方法来计算非重复计数,或者有一个更聪明的解决方案来解决我的问题?

1 个答案:

答案 0 :(得分:0)

尝试这个:

    library("purrrlyr")
    library("dplyr")

#First creating the datasets including OrderTable (please correct me if I got it wrong!):
    SKUID <- c('123', '456', '789')
    NumberOfCustomers <- c(204543, 92703, 305727)
    SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
    SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')

    SKUCount <- data.frame(SKUID, NumberOfCustomers, 
                           SKUFirstPurchase, SKULastPurchase)
    colnames(SKUCount) <- c('SKU', 'NumberOfCustomers', 
                            'FirstPurchase', 'LastPurchase')

    OrderTable <- data.frame(Date=c('2014-06-02', '2014-08-02', '2015-02-03', '2017-05-13'
    ,'2015-05-02', '2014-06-03', '2016-07-13', '2017-09-30', '2018-07-01', '2019-01-09'),
    CustomerID=c('121','212','3434','24232','121','124','212','131','412','3634'))

#changing factors to date
    SKUCount$FirstPurchase<-as.Date(SKUCount$FirstPurchase,format = "%Y-%m-%d")
    SKUCount$LastPurchase<-as.Date(SKUCount$LastPurchase,format = "%Y-%m-%d")
    OrderTable$Date<-as.Date(OrderTable$Date,format = "%Y-%m-%d")

#defining a function, named FUN, which limit the Date from OrderTable between 
#the two date arguments (FirstPurchase and LastPurchase) and returns the 
#distinct count of CustomerID's from OrderTable:
FUN <- function(FirstPurchase,LastPurchase){
              Rtrn<-OrderTable %>%
              filter(Date >= FirstPurchase,
              Date <= LastPurchase)  %>%
              summarize(n_distinct(CustomerID))
              as.numeric(Rtrn)
              }

接下来,您要获取数据集SKUCount,并通过将函数FUN应用于变量的每一行来创建名为DateCustomers的变量:

    SKUCount %>% 
      rowwise() %>% 
       mutate(DateCustomers= FUN(FirstPurchase,LastPurchase))
      # Source: local data frame [3 x 5]
      # Groups: <by row>
      #   
      #   # A tibble: 3 x 5
      #   SKU   NumberOfCustomers FirstPurchase LastPurchase DateCustomers
      # <fct>             <dbl> <date>        <date>               <dbl>
      #   1 123              204543 2014-05-02    2017-09-30          6
      # 2 456               92703 2014-02-03    2018-07-01            7
      # 3 789              305727 2016-05-13    2019-01-09            5