Question

希望有人可以提供帮助。我在R中有大量的直向同源映射，这被证明是非常耗费时间的。我在下面发布了一个示例结构。已经尝试过明显的答案，例如逐行迭代（对于i in 1：nrow（df））和字符串拆分，或者使用sapply，并且速度非常慢。因此，我希望有一个矢量化选项。

stringsasFactors = F

# example accession mapping
map <- data.frame(source = c("1", "2 4", "3", "4 6 8", "9"), 
                  target = c("a b", "c", "d e f", "g", "h i"))

# example protein list
df <- data.frame(sourceIDs = c("1 2", "3", "4", "5", "8 9"))

# now, map df$sourceIDs to map$target


# expected output
> matches
[1] "a b c" "d e f" "g"     ""      "g h i"

我感谢任何帮助！

Answer 1

在大多数情况下，解决此类问题的最佳方法是创建每行一次观察的data.frames。

CELERY_TASK_DEFAULT_QUEUE

现在map_split <- lapply(map, strsplit, split = ' ') long_mappings <- mapply(expand.grid, map2$source, map2$target, SIMPLIFY = FALSE) all_map <- do.call(rbind, long_mappings) names(all_map) <- c('source', 'target')看起来像这样：

all_map

为source target 1 1 a 2 1 b 3 2 c 4 4 c 5 3 d 6 3 e 7 3 f 8 4 g 9 6 g 10 8 g 11 9 h 12 9 i ...

执行相同的操作

df

为sourceIDs_split <- strsplit(df$sourceIDs, ' ') df_long <- data.frame( index = rep(seq_along(sourceIDs_split), lengths(sourceIDs_split)), source = unlist(sourceIDs_split) )提供此信息：

df_long

现在他们只需要合并和折叠。

  index source
1     1      1
2     1      2
3     2      3
4     3      4
5     4      5
6     5      8
7     5      9

优化R中的匹配

1 个答案: