我有以下格式的交易数据。
transaction <- data.frame(account_id=c('ID001','ID001','ID002','ID002','ID001','ID002'), transaction_date=c('2017-01-02','2017-01-03','2017-01-03','2017-01-05','2017-01-06','2017-01-08'))
我想标记哪个帐户在2017年1月1日和2017年12月31日连续三个月处于非活动状态。这是我的R代码。
library(dplyr)
list_account_id <- transaction %>% distinct(account_id)
list_account_id <- list_account_id$account_id
churn_label <- data.frame("account_id" = c('ID000'), "churn_date" = c(as.Date('9999/99/99')), "label"=c(0))
for (int_account_id in list_account_id){
start_date <- as.Date('2017-01-01')
break_while <- FALSE
while(!break_while){
end_date <- start_date+90
int_transaction <- transaction %>% filter(account_id == int_account_id)
int_transaction %>% filter(as.Date(transaction_date) <= end_date, as.Date(transaction_date) >= start_date) %>% summarise(n=n())
sum_ntransaction <- int_transaction %>% filter(as.Date(transaction_date) <= end_date, as.Date(transaction_date) >= start_date) %>% summarise(n=n())
if(sum_ntransaction$n == 0){
churn_label_temp <- data.frame("account_id" = c(int_account_id), "churn_date" = c(start_date), "label"=c(1))
churn_label <- rbind(churn_label, churn_label_temp)
break_while <- TRUE
}
if(end_date == as.Date('2017-12-31')){
churn_label_temp <- data.frame("account_id" = c(int_account_id), "churn_date" = c('9999/99/99'), "label"=c(0))
churn_label <- rbind(churn_label, churn_label_temp)
break_while <- TRUE
}
start_date <- start_date+1
}
}
我的代码的输出是
但是,我的代码太慢了。还有另一种方法吗?
答案 0 :(得分:2)
还要考虑90天!= 3个月!对于几个月的计算,我更喜欢lubridate
包中的%m+%运算符。
transaction <- data.frame(account_id=c('ID001','ID001','ID002','ID002','ID001','ID002'), transaction_date=c('2017-01-02','2017-01-03','2017-01-03','2017-01-05','2017-01-06','2017-01-08'))
library(tidyverse)
library(lubridate)
transaction %>%
#set transaction date as dates
mutate( transaction_date = as.Date( transaction_date ) ) %>%
#group by account id
group_by( account_id ) %>%
#arrange on date
arrange( transaction_date ) %>%
#inactive more than 3 months? Check if the next transaction date is larger than the currect transaction date + 3 months
mutate( inactive_label = ifelse( transaction_date %m+% months(3) > lead( transaction_date ), 0, 1 ) ) %>%
#also check the first and last row of each group (first not after 2017-01-01 + 3 months, last not before 31-12-2017 - 3 months)
mutate( inactive_label = ifelse( row_number() == 1 & transaction_date > as.Date("2017-01-01") %m+% months(3), 1, inactive_label ) ) %>%
mutate( inactive_label = ifelse( row_number() == n() & transaction_date %m+% months(3) < as.Date("2017-12-31") , 1, inactive_label ) )
# # A tibble: 6 x 3
# # Groups: account_id [2]
# account_id transaction_date inactive_label
# <fct> <date> <dbl>
# 1 ID001 2017-01-02 0
# 2 ID001 2017-01-03 0
# 3 ID002 2017-01-03 0
# 4 ID002 2017-01-05 0
# 5 ID001 2017-01-06 1
# 6 ID002 2017-01-08 1
答案 1 :(得分:1)
根据准确的客户流失定义,您可能只对最大交易日期感兴趣,并检查此交易距离受抚养人期限结束是否超过3个月。例如,您可以使用以下代码:
#Set up you timeline
end_dependent <- as.Date("2017-31-12", format= "%Y-%d-%m")
library(lubridate)
churndate <- end_dependent %m-% months(3)
# transaction date in date format
transaction$transaction_date <- as.Date(transaction$transaction_date,format= "%Y-%d-%m")
# Aggregate per account taking the maximum transaction date
transaction_per_account <- aggregate(list(max_transaction_date = transaction$transaction_date), by=list(account_id =transaction$account_id), max)
# Calculate a binary dependent variable for churn and churn date
transaction_per_account$churn <- ifelse(transaction_per_account$max_transaction_date <= churndate, 1, 0 )
ifelse(transaction_per_account$churn == 1, transaction_per_account$churn_date <- transaction_per_account$max_transaction_date, transaction_per_account$churn_date <- as.Date("9999-01-01", format= "%Y-%d-%m") )
这给出了输出:
> transaction_per_account
account_id max_transaction_date churn churn_date
1 ID001 2017-06-01 1 2017-06-01
2 ID002 2017-08-01 1 2017-08-01