Question

我在R中有一个函数，它将较小的向量与较大的向量进行比较，然后查找匹配的位置并使用该信息从较大的数据帧中提取数据。

mass_lst

其中mass_lst <- c(315, 243, 484, 121)是复合质量列表：

ex：AB_massLst_numeric

AB_massLst_numeric <- c(323, 474, 812, 375, 999, 271, 676, 232)是更大的群众列表：

ex：AB_lst

match_df是一个更大的数据框，我从位置向量中提取数据。

rbind是一个空数据框，我将test <- sapply(mass_lst, compare_masses)数据发送到。{/ p>

问题是这个函数有一个for循环，即使我使用

也需要很长时间

{{1}}

所以我的问题是如何让这个功能更快并且可能删除for循环？我的数据在现实生活中比我提供的例子大得多。我不能想办法不迭代使这个功能起作用。

Answer 1

尝试使用do.call将其全部打包并使用rbind，这样它就可以同时执行所有match_df <- do.call(rbind.data.frame, lapply( mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02,]))个调用，而不是一次执行一个。

do.call

在回复有关dplyr::bind_rows与AB_lst_numeric相比的AB_lst速度的评论时，我创建了data.frame，其中1k值介于0到1000之间，并且对应mass_lst {{1} }以及包含100个元素的rbenchmark向量。以下是使用do.call进行此测试的结果，您可以看到bind_rows和bind_rows调用具有相当的可比性（benchmark( match_df <- compare_masses(mass_lst), match_df <- do.call(rbind.data.frame, lapply( mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02,])), match_df <- bind_rows(lapply( mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02,]))) ## 3 match_df <- bind_rows(lapply(mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02, ])) ## 1 match_df <- compare_masses(mass_lst) ## 2 match_df <- do.call(rbind.data.frame, lapply(mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02, ])) ## replications elapsed relative user.self sys.self user.child sys.child ## 3 100 1.453 1.000 1.387 0.059 0 0 ## 1 100 3.050 2.099 2.983 0.051 0 0 ## 2 100 1.974 1.359 1.905 0.060 0 0效率提高了36％，而110％效率增益与原始解决方案相比）。

//...
//...
$.each( data, function( id, meta ) {
        items.push( "<ul id='" + id + "'>");
        items.push( "<li class='path'>" + meta.path + "</li>" );
        items.push( "<li class='lang'>" + meta.lang + "</li>" );
        items.push( "<li class='title'>" + meta.title + "</li>" );
        items.push( "</ul>");
    });
//...
//...

Answer 2

这应该是一个矢量化解决方案。使用发布的compare_masses函数。它明显快于其他解决方案。

写一个匿名函数进行矢量化。你在循环中做的比较相同。

pos = Vectorize(FUN = function(y) {abs(AB_massLst_numeric-y) < 0.02}, vectorize.args = "y")

找到要分组的索引，此步骤将替换do.call(rbind,...)或bind_rows。这个步骤应该很快，因为它只是对大小为length(AB_massLst_numeric) x length(mass_lst)的矩阵进行逻辑比较。需要执行此步骤，因为我无法使vectorize函数与which很好地协作。

i = unlist(apply(X = matrix(sample(c(T,F), 100, r = T), nrow = 10), MARGIN = 2, FUN = which))

子集和存储

AB_lst[i,]

编辑：使用发布的compare_masses函数。它明显快于其他解决方案。

Unit: microseconds
           expr      min       lq      mean   median       uq      max neval  cld
      Vectorize  318.595  327.280  358.9813  355.112  386.892  413.739    10  b  
        do.call 1418.473 1510.853 1569.7161 1578.954 1635.606 1744.173    10    d
      bind_rows  744.570  801.420  813.9346  815.435  836.161  871.297    10   c 
 compare_masses  135.808  138.176  158.0344  158.508  169.365  197.395    10 a

更大的测试数据集

Unit: nanoseconds
           expr      min       lq         mean   median       uq       max neval cld
      Vectorize   239242   292341   342314.079   324714   359455   3480844  1000 a  
 compare_masses      395     1975     3674.669     3554     4738     19346  1000 a  
        do.call 16570424 18223007 21092022.254 20921183 22194176 159718470  1000   c
      bind_rows 13423572 14869680 17027330.356 17008639 18061341 116983885  1000  b

Answer 3

使用R的向量回收功能。首先构建长度为N * m的positions向量，其中N是AB_lst中的行数，m是length(mass_lst)。然后使用此向量从数据框中选择行。

请参阅下面的完整可运行示例。

positions <- c()
compare_masses <- function(mass_lst){
  for (i in seq_along(mass_lst)) {
    positions <- c(positions, which(abs(AB_massLst_numeric - mass_lst[i]) < 0.02))
   }
   return(AB_lst[positions,])
}

mass_lst <- c(375, 243, 676, 121)
AB_massLst_numeric <- c(323, 474, 812, 375, 999, 271, 676, 232, 676)

AB_lst <- data.frame(x=1,y=AB_massLst_numeric)
match_df <- AB_lst[c(),]

compare_masses(mass_lst)

Answer 4

您可以循环查找所需的行索引，然后根据该数据选择行：

set.seed(1)
DF <- data.frame(x=runif(1e2), y=sample(letters, 1e2, rep=T))
LIST <- list(0, 0.2, 0.4, 0.5)
DF[unlist(lapply(LIST, function(y) which(abs(DF$x - y) < .02))), ]

对于我们的虚拟数据，这会产生：

            x y
24 0.01017122 b
70 0.01065314 d
5  0.19193779 e
40 0.21181133 l
65 0.21488963 q
80 0.20122201 q
16 0.39572663 e
23 0.41434742 x
30 0.41330587 t
67 0.40899105 p
73 0.40808877 x
78 0.49894035 o
79 0.49745918 o

注意我们选择的值确实在目标值的0.02之内。

慢功能，如何从R中删除for循环

4 个答案: