从几个数据集中找到共同基因并对它们进行排序

时间:2018-01-05 13:36:37

标签: r dataframe

原始数据:

# reproducible example
df1 <- read.table(text = "
Data1   Data2   Data3   Data4
gene6    -  gene1   gene1
-  gene2    -  gene2
gene3    -  gene3   gene5
gene2   gene4   gene2   gene4
gene4   gene1   gene5   -
gene1   gene3   gene6   gene3",
                  header = TRUE, stringsAsFactors = FALSE)
df1
#   Data1 Data2 Data3 Data4
# 1 gene6     - gene1 gene1
# 2     - gene2     - gene2
# 3 gene3     - gene3 gene5
# 4 gene2 gene4 gene2 gene4
# 5 gene4 gene1 gene5     -
# 6 gene1 gene3 gene6 gene3

预期输出需要像这样:

    Data1   Data2   Data3   Data4
1   gene1   gene1   gene1   gene1
2   gene2   gene2   gene2   gene2
3   gene3   gene3   gene3   gene3
4   gene4   gene4    -      gene4
5    -      -       gene5   gene5
6   gene6    -      gene6    -

4 个答案:

答案 0 :(得分:4)

下次你提出问题时,请参阅评论中@Sotos提供的链接,目前还不清楚你在问什么。但是,我喜欢这个拼图,所以无论如何我都试了一下。这是一个可能的答案:

# Sample data
df=read.table(text="Data1   Data2   Data3   Data4
gene6    -  gene1   gene1
-  gene2    -  gene2
gene3    -  gene3   gene5
gene2   gene4   gene2   gene4
gene4   gene1   gene5   -
gene1   gene3   gene6   gene3",header=T,stringsAsFactors=F)

# Helper function
completecolumn <- function(x,allgenes)
{
  allgenes[!allgenes %in% x]='-'
  return(allgenes)
}

# apply our function
allgenes=sort(setdiff(unique(unlist(df)), "-"))
df = do.call(cbind,lapply(df,completecolumn,allgenes))

输出:

     Data1   Data2   Data3   Data4  
[1,] "gene1" "gene1" "gene1" "gene1"
[2,] "gene2" "gene2" "gene2" "gene2"
[3,] "gene3" "gene3" "gene3" "gene3"
[4,] "gene4" "gene4" "-"     "gene4"
[5,] "-"     "-"     "gene5" "gene5"
[6,] "gene6" "-"     "gene6" "-"    

希望这有帮助。

答案 1 :(得分:2)

以下是使用mixedsort包中的gtools

的另一个想法
library(gtools)

i1 <- mixedsort(unique(df[df != '-']))
sapply(df, function(i) i[match(i1, i)]))

给出了

     Data1   Data2   Data3   Data4  
[1,] "gene1" "gene1" "gene1" "gene1"
[2,] "gene2" "gene2" "gene2" "gene2"
[3,] "gene3" "gene3" "gene3" "gene3"
[4,] "gene4" "gene4" NA      "gene4"
[5,] NA      NA      "gene5" "gene5"
[6,] "gene6" NA      "gene6" NA     

答案 2 :(得分:2)

不是您想要的输出,但可能更有用:

library(tidyr)

# wide-to-long format, then table with margins to see "common" gene counts
addmargins(table(gather(df1)))

#        value
# key      - gene1 gene2 gene3 gene4 gene5 gene6 Sum
#   Data1  1     1     1     1     1     0     1   6
#   Data2  2     1     1     1     1     0     0   6
#   Data3  1     1     1     1     0     1     1   6
#   Data4  1     1     1     1     1     1     0   6
#   Sum    5     4     4     4     3     2     2  24

答案 3 :(得分:2)

使用一点重塑的替代解决方案:

df=read.table(text="Data1   Data2   Data3   Data4
gene6    -  gene1   gene1
-  gene2    -  gene2
gene3    -  gene3   gene5
gene2   gene4   gene2   gene4
gene4   gene1   gene5   -
gene1   gene3   gene6   gene3",header=T,stringsAsFactors=F)

library(tidyverse)

df %>%
  gather() %>%
  filter(value != "-") %>%
  mutate(id = as.integer(substr(value, 5, nchar(value)))) %>%
  spread(key, value) %>%
  select(-id)

#   Data1 Data2 Data3 Data4
# 1 gene1 gene1 gene1 gene1
# 2 gene2 gene2 gene2 gene2
# 3 gene3 gene3 gene3 gene3
# 4 gene4 gene4  <NA> gene4
# 5  <NA>  <NA> gene5 gene5
# 6 gene6  <NA> gene6  <NA>