我有两个数据集,Transaction_long和Transaction_short。 Transaction_long在数据集中有许多政策和价格报价以及购买点(用真表示)。 Transaction_short只有购买点的条目。
我的目标是在Transaction_short数据集中添加一个名为Policy_Change_Frequency的列。对于短数据集中的每个客户,在长数据集中迭代该客户的行并计算策略更改的时间。
要查找策略更改,我可以使用sum(diff(Transaction_Long $ policy)!= 0)但不确定如何迭代这两个数据集并获得结果
详细说明:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
需要在事务短数据集中添加策略更改频率列,因此我的最终Transcation短数据集将如下所示
最终数据集应如下所示
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2
答案 0 :(得分:0)
考虑用于策略更改的计算列,该列用于标记每个客户中上一行的更改。然后,聚合计数。由于需要两个聚合(每个客户的最后一行和PolicyChanged
计数),因此使用合并:
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2
答案 1 :(得分:0)
@Parfait。谢谢你的解决方案。我使用R
中的sqldf包解决了这个问题for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
因为我在长数据集中有大约500K行,而在短数据集中大约有100K ..这需要一段时间..还有其他任何不需要循环的解决方案吗?谢谢