我试图找到一种方法来使用替代而不是连接来使用两个键来过滤DT。 dplyr中的解决方案如下所示:
Enter 3 Numbers (Separated By White-space): 1 2 3
The Total Number: 5
我尝试使用filter(DF, A == a | B == b)
和data.table
上的密钥设置在A
中执行相同的操作,但到目前为止还没有运气。
我不想使用B
表格,因为矢量搜索效果较差。
让我们以下面的数据为例:
DT[A == a | B == b]
答案 0 :(得分:1)
感谢@Frank的回答 - 结果证明这是正确的方法。
弗兰克提出了mya = DT[A==a,which=TRUE]; myb = DT[B==b,which=TRUE]; DT[union(mya,myb)]
,因为它进行了两次二进制搜索。
我在较大的数据集(97671 x 13)上做了一些基准测试,这就是它的样子(还添加了一些有问题的尝试;为比较添加了连接示例):
> microbenchmark(filter(ref.transactions, TalentID == talent.id | RecurringProfileID == recurring.profile.id), ref.transactions[TalentID == talent.id | RecurringProfileID == recurring.profile.id], unique(rbindlist(list(ref.transactions[.(talent.id)], ref.transactions[.(unique(c(talent.id, NA)), recurring.profile.id)]))), unique(rbind(ref.transactions[.(talent.id)], ref.transactions[.(unique(c(talent.id, NA)), recurring.profile.id)])), ref.transactions[.(talent.id, recurring.profile.id)], {mya = ref.transactions[TalentID==talent.id,which=TRUE]; myb = ref.transactions[RecurringProfileID==recurring.profile.id,which=TRUE]; ref.transactions[union(mya,myb)]})
Unit: milliseconds
expr min lq mean median uq max neval
filter(ref.transactions, TalentID == talent.id | RecurringProfileID == recurring.profile.id) 10.039814 11.874223 14.278728 12.560975 13.562596 45.023206 100
ref.transactions[TalentID == talent.id | RecurringProfileID == recurring.profile.id] 6.934124 7.838649 9.323780 8.227186 8.822951 40.115687 100
unique(rbindlist(list(ref.transactions[.(talent.id)], ref.transactions[.(unique(c(talent.id, NA)), recurring.profile.id)]))) 9.859269 10.826785 13.546877 11.663016 13.073455 47.173324 100
unique(rbind(ref.transactions[.(talent.id)], ref.transactions[.(unique(c(talent.id, NA)), recurring.profile.id)])) 9.910144 11.027810 14.633140 11.663457 12.920559 57.256676 100
ref.transactions[.(talent.id, recurring.profile.id)] 1.196426 1.316740 1.513665 1.470091 1.574857 2.799963 100
{ mya = ref.transactions[TalentID == talent.id, which = TRUE] myb = ref.transactions[RecurringProfileID == recurring.profile.id, which = TRUE] ref.transactions[union(mya, myb)] } 1.710616 1.978395 3.085824 2.121029 2.370705 30.513052 100
> df.res <- filter(ref.transactions, TalentID == talent.id | RecurringProfileID == recurring.profile.id)
> mya = ref.transactions[TalentID==talent.id,which=TRUE]; myb = ref.transactions[RecurringProfileID==recurring.profile.id,which=TRUE]; dt.res <- ref.transactions[union(mya,myb)]
> identical(df.res, dt.res)
[1] TRUE