我有一张贷款发放表,并且我已经预处理了这样的客户还款
customerID | balanceChange | trxDate | TYPE
242105 | 500 | 20170605 | loan
242105 | 1500 | 20170605 | loan
242105 | -1000 | 20170607 | payment
242111 | 500 | 20170605 | loan
242111 | -500 | 20170606 | payment
242111 | 500 | 20170607 | loan
242111 | -500 | 20170609 | payment
242151 | 500 | 20170605 | loan
我想做的是(1)计算每天发放的每笔贷款,其中有多少已全额偿还,以及(2)客户需要多少天付钱给他们。
还款规则当然是先进先出(先入先出),所以最早的贷款先退还。
在上面的例子中,解决方案是
trxDate | nRepayments | timeGap(days)
20170605 | 2 | 1.5
20170606 | 0 | 0
20170607 | 1 | 2
因此,解释为什么解决方案就是20170605,有4笔贷款发放(2对客户ID 242105,另外2对242111和242151),但只有2笔贷款被偿还( 500给予242105,500给242111)。 timeGap是每个客户需要支付多少天的总和的平均值(242105在20170607还原 - 2天,而242111在20170606还原 - 1天),所以(2 + 1)/ 2 = 1.5。
我尝试使用以下R脚本计算nRepayments(我想如果我这样做timeGap应该是小菜一碟)。
#Recoveries
data_loans_rec <- data_loans %>% arrange(customerID, trxDate) %>% as.data.table()
data_loans_rec[is.na(data_loans_rec)] <- 0
data_loans_rec <- data_loans_rec[, index := seq_len(.N), by = customerID][!(index == 1 & TYPE == "payment")][, index := seq_len(.N), by = customerID]
n_loans_given <- data_loans[TYPE == "loan", ][, .(nloans = .N), .(payment)][order(payment)]
n_loans_rec <- copy(n_loans_given)
n_loans_rec[, nloans:=0]
unique_cust <- unique(data_loans_rec$customerID)
#Check repayment for every customer================
for (i in 1:length(unique_cust)) {
cur_cust <- unique_cust[i]
list_loan <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(balanceChange)] )
list_loan_time <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(trxDate) ])
list_pay <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "payment", .(balanceChange) ])
if (dim(list_pay)[1] == 0) { #if there are no payments
list_pay <- c(0)
}
sum_paid <- sum(abs(list_pay))
i_paid_until <- 0
for (i_loantime in 1:(dim(list_loan_time)[1])) {
#if there is only one loan
if (i_loantime == 0) {
i_loantime <- 1
}
loan_curr <- list_loan[i_loantime]
loan_left <- loan_curr - sum_paid
if (loan_left <= 0) {
n_loans_rec[trxDate == list_loan_time[i_loantime], nloans:=nloans+1]
sum_paid <- sum_paid - loan_curr
print (paste(i_loantime, list_loan_time[i_loantime], n_loans_rec[trxDate == list_loan_time[i_loantime], .(nloans)]))
# break
} else {
break
}
}
print (i)
}
这个想法是为每个客户制作一份贷款清单,贷款时间和付款清单。 最佳案例场景是指客户的贷款总额等于或少于(由于脏数据)总支付金额(全额付款)。然后,还款数量等于发给该客户的贷款数量。 平均情况是客户进行部分付款的时间。在这种情况下,我总结了支付总额,我迭代了客户提供的每笔贷款,同时总结了我迭代的贷款总额。如果贷款金额最终超过付款金额,那么我会计算客户付款实际涵盖的贷款数量。
问题是我有数以百万计的客户,他们每个人都至少提供了5次贷款和付款。因此,由于我使用的是嵌套循环,需要数小时才能完成。
所以,我在这里问是否有人遇到过这个问题和/或有更好,更有效的解决方案。
提前致谢!
答案 0 :(得分:0)
你的逻辑非常复杂,我不想尝试完全复制它;我的目的只是为了给你一些如何优化的想法。
另外,正如评论中所提到的,您可以尝试并行化,或者使用其他编程语言。
无论如何,由于你的设置已经在data.table
,你可以尝试尽可能多地使用全局操作到全套,这通常会比你的大循环更快。像这样的东西。
我首先根据客户ID计算余额和完成的付款总额:
data_loans_rec <- data_loans_rec[, balance := sum(balanceChange), by = customerID]
data_loans_rec <- data_loans_rec[, sumPayments := sum(balanceChange[TYPE == "payment"]), by = customerID]
有了这个,你已经知道每个余额为0的客户已经偿还了所有东西:
data_loans_rec <- data_loans_rec[TYPE == "loan" & balance == 0, repaid := TRUE, by = list(customerID, index)]
如果您有数百万客户,这些操作当然会读取大量数据,但我要说data.table
应该很快处理它们。
对于其他客户,但仅针对那些贷款的寄存器而且您还不知道他们是否已经偿还,您可以使用data.table
功能。
setRepaid <- function(balanceChange, sumPayments) {
# note that here you get a vector for all the loans of a customer
sumPay <- (-1) * sumPayments[1]
if (sumPay == 0)
return(rep(FALSE, length(balanceChange)))
number_of_loans_paid <- 0
for (i in 1:length(balanceChange)) {
if (sum(balanceChange[1:i]) > sumPay)
break
number_of_loans_paid <- number_of_loans_paid + 1
}
return(c(rep(TRUE, number_of_loans_paid), rep(FALSE, length(balanceChange)-number_of_loans_paid)))
}
data_loans_rec <- data_loans_rec[TYPE == "loan" & is.na(repaid), repaid := setRepaid(balanceChange, sumPayments), by = list(customerID) ]
有了这个,你得到了理想的结果,至少在你的例子中。
customerID balanceChange trxDate TYPE index balance sumPayments repaid
1: 242105 500 20170605 loan 1 1000 -1000 TRUE
2: 242105 1500 20170605 loan 2 1000 -1000 FALSE
3: 242105 -1000 20170607 payment 3 1000 -1000 NA
4: 242111 500 20170605 loan 1 0 -1000 TRUE
5: 242111 -500 20170606 payment 2 0 -1000 NA
6: 242111 500 20170607 loan 3 0 -1000 TRUE
7: 242111 -500 20170609 payment 4 0 -1000 NA
8: 242151 500 20170605 loan 1 500 0 FALSE
优点是:最终循环适用于更少的客户,你有一些已经预先计算的东西,你依靠data.table
来实际替换你的循环。希望这种方法能给你带来改进。我认为这是一次尝试。