我正在使用RFM的扩展展开来开发聚类模型。我要添加的功能之一是“定期”-交易之间的平均时间。我有一个解决方案(如下),但是它非常缓慢且笨拙。
该代码获得唯一的客户编号列表, 列表的循环为每个客户设置了数据框。 按降序对唯一日期进行排序,然后传递给days_calc函数。 此函数获取日期,复制列,删除复制列中的第一个条目,在最后一个位置添加null。然后从a列中减去b列。
这产生了预期的结果,但是非常缓慢,3万名客户大约需要3个小时。最后一组客户约为30万,因此我想使其更加实用。
示例:
CUST_CODE SHOP_DATE
<fct> <dttm>
1 CUST0000000031 2006-04-16 00:00:00
2 CUST0000000068 2006-04-14 00:00:00
3 CUST0000000068 2006-04-10 00:00:00
4 CUST0000000131 2006-04-16 00:00:00
5 CUST0000000164 2006-04-11 00:00:00
6 CUST0000000180 2006-04-15 00:00:00
7 CUST0000000180 2006-04-10 00:00:00
8 CUST0000000324 2006-04-15 00:00:00
9 CUST0000000324 2006-04-11 00:00:00
10 CUST0000000358 2006-04-14 00:00:00
预期输出:
$days.between
Time difference of NA secs
$days.between
Time differences in days
[1] 4 NA
$days.between
Time difference of NA secs
$days.between
Time difference of NA secs
$days.between
Time differences in days
[1] 5 NA
$days.between
Time differences in days
[1] 4 NA
$days.between
Time difference of NA secs
其中输出是每个客户的日期之间的差额,如果客户只有1笔交易,则其NA。 理想情况下,我想要一个整数向量列表,而不是difftime对象,但是我不知道如何以这种方式构造数据。
# Function to calculate the cadence between each transaction date and the previous date for each customer
# takes an object with CUST_CODE and SHOP_DATE.
# Copies the list of dates removing the first entry and adding NULL as the last entry.
# This allows subtraction of columns.
# returns a vector of the days between the columns
days_calc = function(dates) {
dates.list1 = dates[,"SHOP_DATE"]
dates.list2 = dates.list1[-1,]
dates.list2[nrow(dates.list2)+1,] = NULL
df = data.frame(c(dates.list1, dates.list2))
days.between = df %>%
mutate(days.between = SHOP_DATE - SHOP_DATE.1) %>%
select(days.between)
return(as.vector(days.between))
}
# Prepares the data to go into days_calc function
# Slice the table into customer numbers and transaction dates
# Dates must be ordered in descending order to allow the days_calc function to work correctly.
dates = tbl[, c("CUST_CODE", "SHOP_DATE")] %>%
group_by(CUST_CODE) %>%
distinct(SHOP_DATE) %>%
arrange(desc(SHOP_DATE)) %>%
select(c(CUST_CODE,SHOP_DATE) )
# Generate a list of unique customer numbers to subset the customer table
customers = tbl[,"CUST_CODE"] %>%
unique() %>%
as.vector()
# Loop over the list of customers.
# Subset the overall table by each customer number
# Add the returned vector to a list of days_between vectors
days = c()
for( i in 1:length(customers)) {
days = c(days, days_calc(dates %>% filter(CUST_CODE == customers[i])))
if(i %% 50 == 0){
print(paste(c(round((i / length(customers)*100), 2), "%"), collapse = ""))
}
}
答案 0 :(得分:0)
这是一个简单的dplyr
解决方案。奇怪的是,在datediff向量的前面添加了一个NA
强制转换为数字,从而简单地解决了这个问题。
result = df %>%
group_by(CUST_CODE) %>%
arrange(SHOP_DATE) %>%
mutate(days_from_previous_shop = c(NA, diff(SHOP_DATE)))
## full data frame result
result
# # A tibble: 10 x 3
# # Groups: CUST_CODE [7]
# CUST_CODE SHOP_DATE days_from_previous_shop
# <fct> <dttm> <dbl>
# 1 CUST0000000068 2006-04-10 00:00:00 NA
# 2 CUST0000000180 2006-04-10 00:00:00 NA
# 3 CUST0000000164 2006-04-11 00:00:00 NA
# 4 CUST0000000324 2006-04-11 00:00:00 NA
# 5 CUST0000000068 2006-04-14 00:00:00 4
# 6 CUST0000000358 2006-04-14 00:00:00 NA
# 7 CUST0000000180 2006-04-15 00:00:00 5
# 8 CUST0000000324 2006-04-15 00:00:00 4
# 9 CUST0000000031 2006-04-16 00:00:00 NA
# 10 CUST0000000131 2006-04-16 00:00:00 NA
## this seems nicer to me
filter(result, !is.na(days_from_previous_shop))
# # A tibble: 3 x 3
# # Groups: CUST_CODE [3]
# CUST_CODE SHOP_DATE days_from_previous_shop
# <fct> <dttm> <dbl>
# 1 CUST0000000068 2006-04-14 00:00:00 4
# 2 CUST0000000180 2006-04-15 00:00:00 5
# 3 CUST0000000324 2006-04-15 00:00:00 4
使用此数据:
df = read.table(text = "CUST_CODE SHOP_DATE
1 CUST0000000031 '2006-04-16 00:00:00'
2 CUST0000000068 '2006-04-14 00:00:00'
3 CUST0000000068 '2006-04-10 00:00:00'
4 CUST0000000131 '2006-04-16 00:00:00'
5 CUST0000000164 '2006-04-11 00:00:00'
6 CUST0000000180 '2006-04-15 00:00:00'
7 CUST0000000180 '2006-04-10 00:00:00'
8 CUST0000000324 '2006-04-15 00:00:00'
9 CUST0000000324 '2006-04-11 00:00:00'
10 CUST0000000358 '2006-04-14 00:00:00'", header = T)
df$SHOP_DATE = as.POSIXct(df$SHOP_DATE)