我有一个基因名称矩阵,其表达值在不同的组织中。然而,分析是独立进行的,并非所有基因都存在于所有组织中。将每个组织的基因列表简单地粘贴在彼此之下。现在它看起来像这样:
GeneName Tissue A Tissue B
Gene A 1------------
Gene B 1------------
Gene C 2-----------
Gene A ---------3
Gene D ----------2
我想折叠基因名称倍数,以便得到如下矩阵:
GeneName Tissue A Tissue B
Gene A 1---------3
Gene B 1---------
Gene C 2----------
Gene D ---------2
编辑:谢谢你的回答。但是,我错过了添加基因名称是他们自己的列,而行名称只是数字1-n。我尝试将名称列设置为行名row.names(mydataframe)<-mydataframe$GeneName
,但收到以下错误消息Error in
row.names&lt; - .data.frame (
tmp { {1}}
据我所知,我不能使用具有非唯一值的列作为行名称,如果我需要在基因名称列之后命名行以便能够折叠矩阵,这似乎会让我陷入catch-22?
答案 0 :(得分:3)
假设缺失值为'NA'且'Gene D'输出中的'Tissue.B'值为2,您可以使用
res <- rowsum(m1, row.names(m1), na.rm=TRUE)
is.na(res) <- res==0
res
# Tissue.A Tissue.B
#Gene A 1 3
#Gene B 1 NA
#Gene C 2 NA
#Gene D NA 2
如果是带有'GeneName'作为列
的data.frame library(dplyr)
df1 %>%
group_by(GeneName) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE)))
# GeneName Tissue.A Tissue.B
#1 Gene A 1 3
#2 Gene B 1 0
#3 Gene C 2 0
#4 Gene D 0 2
我们可以像以前一样用0
替换NA
。
或使用aggregate
base R
aggregate(.~GeneName, df1, sum, na.rm=TRUE, na.action=NULL)
m1 <- structure(c(1L, 1L, 2L, NA, NA, NA, NA, NA, 3L, 2L), .Dim = c(5L,
2L), .Dimnames = list(c("Gene A", "Gene B", "Gene C", "Gene A",
"Gene D"), c("Tissue.A", "Tissue.B")))
df1 <- structure(list(GeneName = c("Gene A", "Gene B", "Gene C",
"Gene A",
"Gene D"), Tissue.A = c(1L, 1L, 2L, NA, NA), Tissue.B = c(NA,
NA, NA, 3L, 2L)), .Names = c("GeneName", "Tissue.A", "Tissue.B"
), class = "data.frame", row.names = c(NA, -5L))