Question

我试图找出vector characters是否映射到另一个，并寻找在R中快速执行此操作的方式。

具体来说，我的字符字母是氨基酸：

aa.LETTERS <- c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T')

我有一个肽和蛋白质序列的载体：

set.seed(1)
peptides.vec <- sapply(1:100,function(p) paste(aa.LETTERS[sample(20,ceiling(runif(1,8,12)),replace=T)],collapse=""))
proteins.vec <- sapply(1:1000,function(p) paste(aa.LETTERS[sample(20,ceiling(runif(1,200,400)),replace=T)],collapse=""))

我想尝试查看peptides.vec中每个肽序列是否存在于proteins.vec中的任何序列中。

这是明显的做法之一：

mapping.mat <- do.call(rbind,lapply(peptides.vec,function(p){
   grepl(p,proteins.vec)
}))

另一个人正在使用Biostrings Bioconductor套餐：

require(Biostrings)
peptides.set <- AAStringSet(x=peptides.vec)
proteins.set <- AAStringSet(x=proteins.vec)
mapping.mat <- vcountPDict(peptides.set,proteins.set)

对于我正在使用的维度来说，两者都很慢：

> microbenchmark(do.call(rbind,lapply(peptides.vec,function(p){
   grepl(p,proteins.vec)
 })),times=100)
Unit: milliseconds
                                                                             expr      min       lq     mean   median       uq      max neval
 do.call(rbind, lapply(peptides.vec, function(p) {     grepl(p, proteins.vec) })) 477.2509 478.8714 482.8937 480.4398 484.3076 509.8098   100
> microbenchmark(vcountPDict(peptides.set,proteins.set),times=100)
Unit: milliseconds
                                    expr    min       lq     mean   median       uq      max neval
 vcountPDict(peptides.set, proteins.set) 283.32 284.3334 285.0205 284.7867 285.2467 290.6725   100

知道如何更快地完成这项工作吗？

Answer 1

正如我的评论中所提到的，添加addChecksum会带来一些性能提升，而且“stringi”也可能会带来很好的提升。

以下是一些测试：

fixed = TRUE

去“stringi”！

R中字符匹配的快捷方式

1 个答案: