R:基于多个文件重叠的子集

时间:2015-04-27 20:17:37

标签: r

在R中,我想做以下事情:

我有一个包含5个数据帧的gene.list,其中每个数据框如下所示:

col1
name1
name2
name3
...

首先,我想提取这五个数据帧的重叠。输出必须是新的数据帧:输出

我有另一个列表,名为coverage.list,包含11个数据帧。每个数据框都如下所示

col1     col2    col3
name1-a  1       2
name2-c  3       4
name3-d  5       6
name4-e  7       8

现在,从coverage.list中的每个数据框,我想提取col1中的值以在上一步中创建的新输出数据帧中存在的值开始的行。输出应该是一个名为coverage.new.list

的新列表

第一步:提取5个数据帧的重叠,我试图使用

Reduce(intersect, coverage.list)) 

但我得到一条消息'数据框有0列和0行'。但是,当我在此列表中使用venn函数时,我得到了正确的重叠计数

你能指出我正确的解决方案吗?

1 个答案:

答案 0 :(得分:1)

我认为这就是你要找的东西

library(dplyr)
library(tidyr)

# Inner join on the gene.list tables. Inner join gene.list[[1]] with gene.list[[2]] then 
#  inner join the result with gene.list[[3]] then inner join
#  then inner join with gene.list[[4]] then with gene.list[[5]]

output <- inner_join(gene.list[[1]], gene.list[[2]]) %>% inner_join(gene.list[[3]]) %>% 
  inner_join(gene.list[[4]]) %>% inner_join(gene.list[[5]])

coverage.list.new <- lapply(coverage.list, function(x) {x %>% mutate(backup=col1) %>%
     separate(col1, c("col1", "col1_2"), sep="-") %>% filter(col1 %in% output$col1) %>%
     mutate(col1=backup) %>% select(-c(backup, col1_2))})

<强> 更新

coverage.list.new <- lapply(coverage.list, function(x) {x %>% 
     mutate(backup=col1, col1=sub("-", "@", col1)) %>%
     separate(col1, c("col1", "col1_2"), sep="@") %>% filter(col1 %in% output$col1) %>%
     mutate(col1=backup) %>% select(-c(backup, col1_2))})
# with col1=sub("-", "@", col1) in mutate i am substituting the first - with @ 
# in order to then split col1 by the @. If you have @ in your col1 to begin with 
# then choose a symbol that does not exist in your col1 and replace  
# in the code above the @ symbol with your chosen symbol.

示例数据

gene.list <- list(data.frame(col1=c("name1", "name2", "name3")),
              data.frame(col1=c("name1", "name3", "name4")),
              data.frame(col1=c("name1", "name3", "name4")),
              data.frame(col1=c("name1", "name3", "name4")),
              data.frame(col1=c("name1", "name3", "name4")))

coverage.list <- list(data.frame(col1=c("name1-a", "name2-c", "name3-d", "name4-e"), 
                             col2=c(1, 3, 5, 7), col3=c(2, 4, 6, 8)),
                  data.frame(col1=c("name3-a", "name4-c", "name3-d", "name4-e"), 
                             col2=c(1, 3, 5, 7), col3=c(2, 4, 6, 8)))