Question

我有两个数据框：

set.seed(123) myData<-data.frame(id=1:10, pos=21:30) refData<-data.frame(id=letters[1:15], pos=sample(10:40,15))

看起来像

我想要一个myData的扩展数据框。对于myData中的每一行，我想检查refData中是否有一个条目，距离小于2个数字，如果是这样，我希望将refData的ID粘贴到myData的新列中。最后，我的新数据框应如下所示：

 id1 pos1     newColumn
   1   21 c, g, i, j, l
   2   22    c, g, i, j
   3   23       c, g, i
   4   24          g, i
   5   25             g
   6   26              
   7   27              
   8   28             k
   9   29          h, k
  10   30       h, k, m

显然，我可以使用以下循环来做到这一点，这很好用：

myData$newColumn<-rep(NA, nrow(myData))
for(i in 1:nrow(myData)){
  ww<-which(abs(refData$pos2 - myData$pos1[i]) <=  2)
  myData$newColumn[i]<-paste(refData[ww,1],collapse=", ")
}

但是，我正在寻找一种非常快速的方法，因为我的真实数据有大约10 ^ 6个条目，而我真正的refData大约有10 ^ 7个条目。

我非常感谢快速方式的任何帮助和想法！

Answer 1

你可以尝试：

myData$newColumn = lapply(myData$pos, 
                 function(x) {paste(refData$id[abs(refData$pos-x)<3],collapse=', ')})

输出：

   id pos     newColumn
1   1  21 c, g, i, j, l
2   2  22    c, g, i, j
3   3  23       c, g, i
4   4  24          g, i
5   5  25             g
6   6  26              
7   7  27              
8   8  28             k
9   9  29          h, k
10 10  30       h, k, m

希望这有帮助！

Answer 2

另一种选择是

myData$newColumn <- sapply(myData$pos, function(x) paste(refData$id[refData$pos >= x-2 & refData$pos <= x+2], collapse = ", "))

n = 1000的基准显示@ Florian的解决方案略微提前：

set.seed(123)
myData<-data.frame(id=1:1000, pos=sample(21:30, 1000, replace = T))
refData<-data.frame(id=sample(letters[1:15], 1000, replace = T), pos=sample(10:40, 1000, replace = T))

myData$newColumn<-rep(NA, nrow(myData))

library(microbenchmark)
microbenchmark(for(i in 1:nrow(myData)){
  ww<-which(abs(refData$pos - myData$pos[i]) <=  2)
  myData$newColumn[i]<-paste(refData[ww, "id"],collapse=", ")
},
myData$newColumn2 <- sapply(myData$pos, function(x) paste(refData$id[refData$pos >= x-2 & refData$pos <= x+2], collapse = ", ")),
myData$newColumn3 <- lapply(myData$pos, function(x) paste(refData$id[abs(refData$pos - x) <  3], collapse = ", ")))

Unit: milliseconds

    expr
 for (i in 1:nrow(myData)) {     ww <- which(abs(refData$pos - myData$pos[i]) <= 2)     myData$newColumn[i] <- paste(refData[ww, "id"], collapse = ", ") }
                 myData$newColumn2 <- sapply(myData$pos, function(x) paste(refData$id[refData$pos >=      x - 2 & refData$pos <= x + 2], collapse = ", "))
                                    myData$newColumn3 <- lapply(myData$pos, function(x) paste(refData$id[abs(refData$pos -      x) < 3], collapse = ", "))
      min       lq     mean   median       uq       max neval cld
 62.97657 64.74155 70.01541 68.81024 71.02023 206.80477   100   c
 46.55872 47.90585 50.75397 50.42333 53.42990  58.01813   100  b 
 36.69362 37.34244 39.70480 38.54905 42.49614  46.27513   100 a

在R

2 个答案: