我有一个命名列表,代表一系列生物途径,其中名称是途径名称,列表中的载体是属于该途径的蛋白质。一个小例子是:
ann <- structure(list(`GO:0000010` = c("Q33DR2", "Q9CZQ1", "D6RHT8",
"F6ZCX7", "B8JJX0", "Q33DR3", "F6T4Z4", "E0CYM9"), `GO:0000016` = c("Q5XLR9",
"Q3TZ78", "F8VPT3"), `GO:0000026` = c("Q8BTP0", "Q3TZM9", "A0A077K846",
"F6R220", "A0A077K9W9"), `GO:0000032` = c("Q924M7", "Q3V100",
"F6Q3K8", "Q921Z9"), `GO:0000033` = c("Q9DBE8", "F6RBY3", "Q8BMZ4",
"Q8K2A8", "F6XUH0", "D6RCW8", "Q6P8H8", "Q3URN2")), .Names = c("GO:0000010",
"GO:0000016", "GO:0000026", "GO:0000032", "GO:0000033"))
我对成对的通路感兴趣:
pairs <- t(combn(names(ann), 2))
对于每对途径,我想获得蛋白质的所有可能组合,其中蛋白质#1在途径#1中,而蛋白质#2在途径#2中。所需的输出是两列矩阵的列表,其中列#1包含路径#1中的蛋白质,列#2包含路径#2中的蛋白质。到目前为止,我有这个:
protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]])))
但是,由于我感兴趣的对的总数非常大(通常> 1,000),因此在所有可能的对上映射expand.grid
会花费很长时间-大约几个小时。
是否有更快的方法从此清单中获取每对生物途径中所有可能的蛋白质组合?
答案 0 :(得分:1)
我认为rep.int()
的运行速度要比其他question:
尝试以下操作:
expand.grid.jc <- function(seq1,seq2) {
cbind(Var1 = rep.int(seq1, length(seq2)),
Var2 = rep.int(seq2, rep.int(length(seq1),length(seq2))))
}
protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(ann[[.x]], ann[[.y]])))
干杯!,
答案 1 :(得分:1)
如果您追求速度,则可以轻松实现Rcpp
版本:
// [[Rcpp::export]]
CharacterMatrix fast2Expand(CharacterVector x, CharacterVector y) {
unsigned long int lenX = x.size(), lenY = y.size();
CharacterMatrix result = no_init_matrix(lenX * lenY, 2);
for (std::size_t i = 0, count = 0; i < lenY; ++i) {
for (std::size_t j = 0; j < lenX; ++j, ++count){
result(count, 0) = x[j];
result(count, 1) = y[i];
}
}
return result;
}
它比原始版本快10x
,比20%
版本快rep.int
(对于此示例):
microbenchmark(OP = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]]))),
Rcpp = purrr::map2(pairs[, 1], pairs[, 2], ~ fast2Expand(ann[[.x]], ann[[.y]])),
repInt = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(ann[[.x]], ann[[.y]]))))
Unit: microseconds
expr min lq mean median uq max neval
OP 1104.700 1136.4370 1536.4048 1188.9990 1481.4940 6730.960 100
Rcpp 105.505 126.9975 149.9009 138.1195 150.2015 663.146 100
repInt 133.044 151.0175 223.9815 165.5435 203.5335 1269.194 100
这是一个基于OP的示例而设计的示例,纯粹是为了比较效率:
annBig <- lapply(1:5, function(x) rep(ann[[x]], 100))
names(annBig) <- names(ann)
microbenchmark(OP = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(annBig[[.x]], annBig[[.y]]))),
Rcpp = purrr::map2(pairs[, 1], pairs[, 2], ~ fast2Expand(annBig[[.x]], annBig[[.y]])),
repInt = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(annBig[[.x]], annBig[[.y]]))), times = 20)
Unit: milliseconds
expr min lq mean median uq max neval
OP 522.56536 533.39393 562.60750 555.45345 588.4514 640.8584 20
Rcpp 48.12683 56.17155 92.30095 92.23838 125.8065 142.2949 20
repInt 80.28625 107.32329 140.32793 152.13732 160.9656 193.1310 20