我不确定从哪里开始,我可以使用一些指示
我有几个对象是不同长度的字符串,包含基因的名称。我想成对地比较所有对象并获得每对列表之间共享基因的数量(例如使用intersect()
)。我想将所有成对比较存储在矩阵中以制作热图
但我不确定如何最好地进行比较以及如何存储结果。我应该首先将所有对象分组到数据框中吗?
我有24个名为names_something
的对象:
> length(names_G63)
[1] 4518
> head(names_G63)
[1] "SARC_00002" "SARC_00004" "SARC_00005" "SARC_00012" "SARC_00022" "SARC_00025"
> length(names_C28)
[1] 9190
> head(names_C28)
[1] "SARC_00001" "SARC_00002" "SARC_00003" "SARC_00004" "SARC_00005" "SARC_00008"
比较会给出一个数字,显示列表之间共享基因的数量:
> length(intersect(names_G63, names_C28))
[1] 4097
我想将这些数字存储为矩阵,例如:
G63 C28 B124
G63 0
C28 4097 0
B124 3000 345 0
答案 0 :(得分:2)
我认为你正在寻找这样的东西。一个矩阵,告诉你在expts / sets之间共享多少个基因。
#First a vector of all genes
genes <-unlist(lapply(1:1500, function(x) paste(sample(LETTERS, 5, replace = T), collapse="")))
#Now five pseudo experiments that each generated a set 100 random genes from the set above
geneList <- lapply(1:5, function(x) sample(genes, 100))
#Now we turn the list of genes into a table of expts x genes
genedf <- stack(setNames(geneList, nm=paste("Expt", seq_along(geneList))))
#Use the cross product to count the overlaps
table(genedf[2:1]) %*% t(table(genedf[2:1]))
# ind
# ind Expt 1 Expt 2 Expt 3 Expt 4 Expt 5
# Expt 1 100 8 5 7 7
# Expt 2 8 100 5 5 10
# Expt 3 5 5 100 8 4
# Expt 4 7 5 8 100 8
# Expt 5 7 10 4 8 100
geneList<-lapply(ls(pattern="names_"), get)
names(geneList) <- ls(pattern="names_")
genedf <- stack(setNames(geneList, nm=names(geneList)))
table(genedf[2:1]) %*% t(table(genedf[2:1]))