加快交易延迟的计算

时间:2019-09-11 20:36:06

标签: r

我正在使用RFM的扩展展开来开发聚类模型。我要添加的功能之一是“定期”-交易之间的平均时间。我有一个解决方案(如下),但是它非常缓慢且笨拙。

该代码获得唯一的客户编号列表, 列表的循环为每个客户设置了数据框。 按降序对唯一日期进行排序,然后传递给days_calc函数。 此函数获取日期,复制列,删除复制列中的第一个条目,在最后一个位置添加null。然后从a列中减去b列。

这产生了预期的结果,但是非常缓慢,3万名客户大约需要3个小时。最后一组客户约为30万,因此我想使其更加实用。

示例:

   CUST_CODE      SHOP_DATE           
   <fct>          <dttm>             
 1 CUST0000000031 2006-04-16 00:00:00
 2 CUST0000000068 2006-04-14 00:00:00
 3 CUST0000000068 2006-04-10 00:00:00
 4 CUST0000000131 2006-04-16 00:00:00
 5 CUST0000000164 2006-04-11 00:00:00
 6 CUST0000000180 2006-04-15 00:00:00
 7 CUST0000000180 2006-04-10 00:00:00
 8 CUST0000000324 2006-04-15 00:00:00
 9 CUST0000000324 2006-04-11 00:00:00
10 CUST0000000358 2006-04-14 00:00:00

预期输出:

$days.between
Time difference of NA secs

$days.between
Time differences in days
[1]  4 NA

$days.between
Time difference of NA secs

$days.between
Time difference of NA secs

$days.between
Time differences in days
[1]  5 NA

$days.between
Time differences in days
[1]  4 NA

$days.between
Time difference of NA secs

其中输出是每个客户的日期之间的差额,如果客户只有1笔交易,则其NA。 理想情况下,我想要一个整数向量列表,而不是difftime对象,但是我不知道如何以这种方式构造数据。

# Function to calculate the cadence between each transaction date and the previous date for each customer
# takes an object with CUST_CODE and SHOP_DATE.
# Copies the list of dates removing the first entry and adding NULL as the last entry.
# This allows subtraction of columns.
# returns a vector of the days between the columns
days_calc = function(dates) {
    dates.list1 = dates[,"SHOP_DATE"]
    dates.list2 = dates.list1[-1,]
    dates.list2[nrow(dates.list2)+1,] = NULL

    df = data.frame(c(dates.list1, dates.list2))
    days.between = df %>%
        mutate(days.between = SHOP_DATE - SHOP_DATE.1) %>%
        select(days.between)
    return(as.vector(days.between))
}

# Prepares the data to go into days_calc function
# Slice the table into customer numbers and transaction dates
# Dates must be ordered in descending order to allow the days_calc function to work correctly.
dates = tbl[, c("CUST_CODE", "SHOP_DATE")] %>%
    group_by(CUST_CODE) %>%
    distinct(SHOP_DATE) %>%
    arrange(desc(SHOP_DATE))  %>%
    select(c(CUST_CODE,SHOP_DATE) )

# Generate a list of unique customer numbers to subset the customer table
customers = tbl[,"CUST_CODE"] %>%
    unique() %>%
    as.vector()

# Loop over the list of customers.
# Subset the overall table by each customer number
# Add the returned vector to a list of days_between vectors
days = c()
for( i in 1:length(customers)) {
    days = c(days, days_calc(dates %>% filter(CUST_CODE == customers[i])))
    if(i %% 50 == 0){
        print(paste(c(round((i / length(customers)*100), 2), "%"), collapse = ""))
  }
}

1 个答案:

答案 0 :(得分:0)

这是一个简单的dplyr解决方案。奇怪的是,在datediff向量的前面添加了一个NA强制转换为数字,从而简单地解决了这个问题。

result = df %>% 
  group_by(CUST_CODE) %>% 
  arrange(SHOP_DATE) %>%
  mutate(days_from_previous_shop = c(NA, diff(SHOP_DATE))) 

## full data frame result
result
# # A tibble: 10 x 3
# # Groups:   CUST_CODE [7]
#    CUST_CODE      SHOP_DATE           days_from_previous_shop
#    <fct>          <dttm>                                <dbl>
#  1 CUST0000000068 2006-04-10 00:00:00                      NA
#  2 CUST0000000180 2006-04-10 00:00:00                      NA
#  3 CUST0000000164 2006-04-11 00:00:00                      NA
#  4 CUST0000000324 2006-04-11 00:00:00                      NA
#  5 CUST0000000068 2006-04-14 00:00:00                       4
#  6 CUST0000000358 2006-04-14 00:00:00                      NA
#  7 CUST0000000180 2006-04-15 00:00:00                       5
#  8 CUST0000000324 2006-04-15 00:00:00                       4
#  9 CUST0000000031 2006-04-16 00:00:00                      NA
# 10 CUST0000000131 2006-04-16 00:00:00                      NA

## this seems nicer to me
filter(result, !is.na(days_from_previous_shop))
# # A tibble: 3 x 3
# # Groups:   CUST_CODE [3]
#   CUST_CODE      SHOP_DATE           days_from_previous_shop
#   <fct>          <dttm>                                <dbl>
# 1 CUST0000000068 2006-04-14 00:00:00                       4
# 2 CUST0000000180 2006-04-15 00:00:00                       5
# 3 CUST0000000324 2006-04-15 00:00:00                       4

使用此数据:

df = read.table(text = "CUST_CODE      SHOP_DATE           
 1 CUST0000000031 '2006-04-16 00:00:00'
 2 CUST0000000068 '2006-04-14 00:00:00'
 3 CUST0000000068 '2006-04-10 00:00:00'
 4 CUST0000000131 '2006-04-16 00:00:00'
 5 CUST0000000164 '2006-04-11 00:00:00'
 6 CUST0000000180 '2006-04-15 00:00:00'
 7 CUST0000000180 '2006-04-10 00:00:00'
 8 CUST0000000324 '2006-04-15 00:00:00'
 9 CUST0000000324 '2006-04-11 00:00:00'
10 CUST0000000358 '2006-04-14 00:00:00'", header = T)

df$SHOP_DATE = as.POSIXct(df$SHOP_DATE)