在R中,迭代两个数据集并返回结果而不循环

时间:2016-04-17 02:50:31

标签: r loops

我有两个数据集,Transaction_long和Transaction_short。 Transaction_long在数据集中有许多政策和价格报价以及购买点(用真表示)。 Transaction_short只有购买点的条目。

我的目标是在Transaction_short数据集中添加一个名为Policy_Change_Frequency的列。对于短数据集中的每个客户,在长数据集中迭代该客户的行并计算策略更改的时间。

要查找策略更改,我可以使用sum(diff(Transaction_Long $ policy)!= 0)但不确定如何迭代这两个数据集并获得结果

详细说明:

Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long

Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627

Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627

需要在事务短数据集中添加策略更改频率列,因此我的最终Transcation短数据集将如下所示

最终数据集应如下所示

Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq 
Joe,101,1,3,501,3
Mary,103,1,2,627,2

2 个答案:

答案 0 :(得分:0)

考虑用于策略更改的计算列,该列用于标记每个客户中上一行的更改。然后,聚合计数。由于需要两个聚合(每个客户的最后一行和PolicyChanged计数),因此使用合并:

Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
                                  function(i) 
                                    if (i > 1) {
                                       ifelse(Transaction_Long$Policy[i-1]==
                                              Transaction_Long$Policy[i], 0,
                                              ifelse(Transaction_Long$Customer_ID[i-1] !=
                                                     Transaction_Long$Customer_ID[i], 0, 1))
                                     } else { 0 }
                                   )

Transaction_Final  <- merge(aggregate(.~ Customer_ID + Customer_Name, 
                                     Transaction_Long[,c(1:5)], FUN = tail, n = 1),
                            aggregate(.~ Customer_ID + Customer_Name, 
                                     Transaction_Long[,c(1:2,6)], FUN = sum),
                            by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
#  Customer_ID   Customer_Name Purchased Policy   Price PolicyChangedFreq
#1         101             Joe         1      3     501                 3
#2         103            Mary         1      2     627                 2

答案 1 :(得分:0)

@Parfait。谢谢你的解决方案。我使用R

中的sqldf包解决了这个问题
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where    customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
} 

因为我在长数据集中有大约500K行,而在短数据集中大约有100K ..这需要一段时间..还有其他任何不需要循环的解决方案吗?谢谢