在R

时间:2017-03-21 15:56:43

标签: r machine-learning matching bigdata

作为一个玩具示例,请考虑以下内容:我们有真实数据xy扰乱版本的xz,它们混合了混合行:

x = matrix(1:100, nrow = 100, ncol= 4 , byrow = FALSE)
y = x + matrix( .001 * rnorm(n = 400), nrow = 100, ncol= 4)
z = rbind(x,y)
z = z[sample(nrow(z)),]

我们如何在R中有效地找到或估计z中的匹配行?我最多只想获得属于x的行或每行仅来自xy的行,而不是两者。我查看了包RecordLinkage,但在纯数字情况下,mabye存在更有效的解决方案。此外,在我的设置中,我有100K +行和20列,并且在完整数据集上调用compare.dedup需要太多内存。

编辑:我尝试了建议的方法:

setseed(100)
x = matrix( 1:100, nrow = 100, ncol= 4 ,byrow = FALSE)
y = x + matrix( .001 * rnorm(n = 400), nrow = 100, ncol= 4)
z = rbind(x,y)

#z = z[sample(nrow(z)),]


res = caret::findLinearCombos(t(z))
res$remove%>%sort

结果如下所示。我们看到我们得到8.0以及被扰动的8.00572--与9和10相同。它适用于一些但不是一般的。

  
    

z [res $ remove,1]%>%sort
    [1] 2.000000 3.000000 4.000000 4.0000952 5.000000 5.001135 6.000000 6.000008 7.000000 7.001225 [11] 8.000000 8.000572 8.997471 9.000000 10.000000 10.000135 10.999871 11.000000 12.000000 12.000113 [21] 12.999917 13.000000 13.998705 14.000000 15.000000 15.001787 16.000000 16.002099 17.000000 17.000232 [31] 18.000000 18.000062 19.000000 19.000354 20.000000 20.000725 21.000000 21.000268 21.999909 22.000000 [41] 22.999861 23.000000 24.000000 24.001042 26.0000567 27.000000 27.000610 [51] 27.999102 28.000000 29.000000 29.000263 30.000000 30.001195 31.000000 31.000267 32.000000 32.000999 [61] 33.000000 33.001137 34.000000 34.000603 35.000000 35.001352 36.000000 36.001945 36.998791 37.000000 [71] 38.000000 38.003187 38.999596 39.000000 39.997090 40.000000 40.999639 41.000000 42.000000 42.000220 [81] 43.000000 43.000062 44.000000 44.000170 45.000000 45.000222 4 5.998763 46.000000 47.000000 47.001132 [91] 47.999887 48.000000 49.000000 49.002185 50.000000 50.000743 51.000000 51.002065 52.000000 52.001307 [101] 52.998977 53.000000 53.999975 54.000000 54.999356 55.000000 56.000000 56.001569 57.000000 57.000013 [111] 58.000000 58.001158 58.999849 59.000000 59.999147 60.000000 61.000000 61.001045 61.999888 62.000000 [121] 62.998223 63.000000 63.999040 64.000000 64.998698 65.000000 66.000000 66.000069 66.999729 67.000000 [131] 68.000000 68.000566 69.000000 69.000426 69.998899 70.000000 71.000000 71.000105 71.999957 72.000000 [141] 73.000000 73.000644 73.999902 74.000000 74.999892 75.000000 76.000000 76.000321 77.000000 77.000765 [151] 78.000000 78.000649 78.999644 79.000000 79.998975 80.000000 80.998300 81.000000 82.000000 82.001297 [161] 82.998977 83.000000 83.998629 84.000000 84.999534 85.000000 85.998803 86.000000 87.000000 87.001064 [171] 87.999871 88.000000 8 8.998835 89.000000 89.998987 90.000000 91.000000 91.001467 92.000000 92.001252 [181] 93.000000 93.000839 93.998372 94.000000 94.999120 95.000000 95.999964 96.000000 96.999911 97.000000 [191] 98.000000 98.002148 99.000000 99.000914 100.000000 100.001824

  

1 个答案:

答案 0 :(得分:1)

插入符号包有一个函数 findLinearCombos(),它可以帮助您识别矩阵列之间的线性依赖关系(通过省略行并每次计算排名),你想要转置矩阵的情况。我试试看。