Question

我有两个非常大的产品需求和退货数据集（每个数据集约400万个条目，但长度不等）。第一个数据集给出[1]需求日期，[2]客户的ID和[3]产品的ID。第二个数据集给出了[1]返回日期，[2]客户的ID和[3]产品的ID。

现在，我希望将给定客户和产品的所有需求与同一客户和产品的回报相匹配。成对的产品类型和客户并不是唯一的，因为客户可以多次要求产品。因此，我希望将产品需求与数据集中最早的回报相匹配。还可能发生某些产品未被退回，或者某些产品被退回而未被要求（因为客户返回在数据集中的起始数据之前需要的项目）。

为此，我编写了以下代码：

transactionNumber = 1:nrow(demandSet)  #transaction numbers for the demandSet
matchedNumber = rep(0, nrow(demandSet)) #vector of which values in the returnSet correspond to the transactions in the demandSet

for (transaction in transactionNumber){
indices <- which(returnSet[,2]==demandSet[transaction,2]&returnSet[,3]==demandSet[transaction,3]) 
if (length(indices)>0){
    matchedNumber[transaction] <- indices[which.min(returnSet[indices,][,1])] #Select the index of the transaction with the minimum date
} 
}

然而，这需要大约一天的时间来计算。有人有更好的建议吗？请注意，来自match two columns with two other columns的建议在此处不起作用，因为match（）会溢出内存。

作为一个工作示例考虑

demandDates = c(1,1,1,5,6,6,8,8)
demandCustIds = c(1,1,1,2,3,3,1,1)
demandProdIds = c(1,2,3,4,1,5,2,6)
demandSet = data.frame(demandDates,demandCustIds,demandProdIds)

returnDates = c(1,1,4,4,4)
returnCustIds = c(4,4,1,1,1)
returnProdIds = c(5,7,1,2,3)
returnSet = data.frame(returnDates,returnCustIds,returnProdIds)

（这实际上不能正常工作，因为事务7与返回4不正确匹配，但是为了问题，让我假设我想要的东西......我可以稍后解决这个问题）

Answer 1

require(data.table)

DD<-data.table(demandSet,key="demandCustIds,demandProdIds")
DR<-data.table(returnSet,key="returnCustIds,returnProdIds")
DD[DR,mult="first"]   

   demandCustIds demandProdIds demandDates returnDates
1:             1             1           1           4
2:             1             2           1           4
3:             1             3           1           4
4:             4             5          NA           1
5:             4             7          NA           1

有条件地匹配两个大数据集的多列中的元素

1 个答案: