操作每个实验有多个列的数据框

时间:2016-01-25 16:14:46

标签: r dataframe

我有许多测序实验,每个测试实验对于几百个基因中的每一个都有多个结果,当数据从另一个程序输出时,它对我来说不是一个有用的格式,因为所有实验和每个结果都列在顶部每个基因都有一行。我已经编写了一个示例数据集以及我目前如何解决这个问题作为一个例子,但我想要一个更优化的方法,因为我的数据集非常大。

 col1<- c("","", "gene1", "gene2", "gene3", "gene4")
 col2<- c("Experiment1", "Part 1", "a","b","c","d")
 col3<- c("Experiment1", "Part 2", "e", "f", "g", "h")
 col4<- c("Experiment2", "Part 1", "i", "j", "k", "l")
 col5<- c("Experiment2", "Part 2", "m", "n", "o", "p")
 pp<- data.frame(col1,col2,col3,col4,col5)
 one<-data.frame(pp$col1, pp$col2)
 onetwo<- data.frame(pp$col1,pp$col3)
 two<-data.frame(pp$col1, pp$col4)
 twotwo<-data.frame(pp$col1,pp$col5)

 one$V3[3:6]<-as.character(one[2,2])
 one<-one[-2,]
 one<-one[-1,]
 colnames(one)<- c("gene", "Experiment 1", "part")

 onetwo$V3[3:6]<-as.character(onetwo[2,2])
 onetwo<-onetwo[-2,]
 onetwo<-onetwo[-1,]
 colnames(onetwo)<- c("gene", "Experiment 1", "part")

 x1<-rbind(one, onetwo)

 two$V3[3:6]<-as.character(two[2,2])
 two<-two[-2,]
 two<-two[-1,]
 colnames(two)<- c("gene", "Experiment 2", "part")


 twotwo$V3[3:6]<-as.character(twotwo[2,2])
 twotwo<-twotwo[-2,]
 twotwo<-twotwo[-1,]
 colnames(twotwo)<- c("gene", "Experiment 2", "part")

 x2<-rbind(two, twotwo)

 x3<-merge(x1,x2)

我为大量代码道歉,但我无法专门用语言表达。 pp是示例数据帧,x3是我需要的格式。有更好的方法吗?

1 个答案:

答案 0 :(得分:0)

这可能是一种较短的方法:

pp.new <- as.data.frame(t(pp)[-1,], row.names = 1)
names(pp.new) <- c("experiment", "part", "gene1", "gene2", "gene3", "gene4")

给出:

> pp.new
   experiment   part gene1 gene2 gene3 gene4
1 Experiment1 Part 1     a     b     c     d
2 Experiment1 Part 2     e     f     g     h
3 Experiment2 Part 1     i     j     k     l
4 Experiment2 Part 2     m     n     o     p

但是,使用 reshape2 包将其转换为长格式可能更好:

library(reshape2)    
pp.long <- melt(pp.new, id=c("experiment","part"))

导致:

> pp.long
    experiment   part variable value
1  Experiment1 Part 1    gene1     a
2  Experiment1 Part 2    gene1     e
3  Experiment2 Part 1    gene1     i
4  Experiment2 Part 2    gene1     m
5  Experiment1 Part 1    gene2     b
6  Experiment1 Part 2    gene2     f
7  Experiment2 Part 1    gene2     j
8  Experiment2 Part 2    gene2     n
9  Experiment1 Part 1    gene3     c
10 Experiment1 Part 2    gene3     g
11 Experiment2 Part 1    gene3     k
12 Experiment2 Part 2    gene3     o
13 Experiment1 Part 1    gene4     d
14 Experiment1 Part 2    gene4     h
15 Experiment2 Part 1    gene4     l
16 Experiment2 Part 2    gene4     p

如果您希望获得x3中的可比较输出,则可以使用recast函数(也可以使用 reshape2 包):

recast(pp.new, part + variable ~ experiment, id.var=c("experiment","part"), value.var = "value")

给出:

    part variable Experiment1 Experiment2
1 Part 1    gene1           a           i
2 Part 1    gene2           b           j
3 Part 1    gene3           c           k
4 Part 1    gene4           d           l
5 Part 2    gene1           e           m
6 Part 2    gene2           f           n
7 Part 2    gene3           g           o
8 Part 2    gene4           h           p