Question

我有一个基因名称矩阵，其表达值在不同的组织中。然而，分析是独立进行的，并非所有基因都存在于所有组织中。将每个组织的基因列表简单地粘贴在彼此之下。现在它看起来像这样：

 GeneName   Tissue A Tissue B
Gene A  1------------
Gene B  1------------
Gene C  2-----------
Gene A ---------3
Gene D ----------2

我想折叠基因名称倍数，以便得到如下矩阵：

GeneName   Tissue A Tissue B
Gene A 1---------3
Gene B 1---------
Gene C 2----------
Gene D ---------2

编辑：谢谢你的回答。但是，我错过了添加基因名称是他们自己的列，而行名称只是数字1-n。我尝试将名称列设置为行名row.names(mydataframe)<-mydataframe$GeneName，但收到以下错误消息Error in row.names＆lt; - .data.frame ( tmp { {1}} 据我所知，我不能使用具有非唯一值的列作为行名称，如果我需要在基因名称列之后命名行以便能够折叠矩阵，这似乎会让我陷入catch-22？

Answer 1

假设缺失值为'NA'且'Gene D'输出中的'Tissue.B'值为2，您可以使用

 res <- rowsum(m1, row.names(m1), na.rm=TRUE)
 is.na(res) <- res==0
 res
 #       Tissue.A Tissue.B
 #Gene A        1        3
 #Gene B        1       NA
 #Gene C        2       NA
 #Gene D       NA        2

如果是带有'GeneName'作为列

的data.frame

 library(dplyr)
 df1 %>%
    group_by(GeneName) %>% 
    summarise_each(funs(sum=sum(., na.rm=TRUE)))
 #    GeneName Tissue.A Tissue.B
 #1   Gene A        1        3
 #2   Gene B        1        0
 #3   Gene C        2        0
 #4   Gene D        0        2

我们可以像以前一样用0替换NA。

或使用aggregate

中的base R

  aggregate(.~GeneName, df1, sum, na.rm=TRUE, na.action=NULL)

数据

 m1 <- structure(c(1L, 1L, 2L, NA, NA, NA, NA, NA, 3L, 2L), .Dim = c(5L, 
 2L), .Dimnames = list(c("Gene A", "Gene B", "Gene C", "Gene A", 
"Gene D"), c("Tissue.A", "Tissue.B")))

 df1 <- structure(list(GeneName = c("Gene A", "Gene B", "Gene C",
  "Gene A", 
 "Gene D"), Tissue.A = c(1L, 1L, 2L, NA, NA), Tissue.B = c(NA, 
 NA, NA, 3L, 2L)), .Names = c("GeneName", "Tissue.A", "Tissue.B"
 ), class = "data.frame", row.names = c(NA, -5L))

折叠在变量名称中加倍

1 个答案:

数据