我有两个样本的.csv输出,并为每个样本计算了一些“计算器”统计数据。一些“计算器”具有较低和较高的置信区间值。最后,我想为所有具有误差条的计算器绘制箱形图,以获得具有它们的计算器的置信区间。但首先,我需要将数据操作为R友好格式。
我如何接受此输入:
df <- data.frame(sample = as.factor(c("0.22um", "3um")),
nseqs = c(29445, 30212), coverage = c(0.96, 0.99),
invsimpson = c(20.36, 8.76), invsimpson_lci = c(19.99, 8.59),
invsimpson_hci =c(20.76, 8.95),
shannon = c(3.75, 3.04), shannon_lci = c(3.73, 3.02),
shannon_hci = c(3.77, 3.06))
看起来像这样:
sample nseqs coverage invsimpson invsimpson_lci invsimpson_hci shannon shannon_lci shannon_hci
1 0.22um 29445 0.96 20.36 19.99 20.76 3.75 3.73 3.77
2 3um 30212 0.99 8.76 8.59 8.95 3.04 3.02 3.06
并将其转换为:
sample calculator value lci hci
1 0.22um nseqs num <NA> <NA>
2 0.22um coverage num <NA> <NA>
3 0.22um invsimpson num num num
4 0.22um shannon num num num
5 3um nseqs num <NA> <NA>
6 3um coverage num <NA> <NA>
7 3um invsimpson num num num
8 3um shannon num num num
,其中num是来自df的对应值。该数据帧将具有NA,其中原始df对于相应的间隔
没有置信度值temp <- melt(df, id.vars= c("sample"), measure.vars=c("nseqs", "coverage", "invsimpson", "shannon"), variable.name="calculator")
partial.solution <- temp[with(base, order(group)), ]
将获得所有计算器的值但是让lci和hci排成一行有点棘手。
通用解决方案非常棒。我希望矩阵有数百个样本和可变数量的计算器。
感谢您的帮助!
答案 0 :(得分:3)
我会分2步完成:
## put in the long format simple column using melt
## no need to work in all variables
xx = melt(df[,c(1,2,3,4,7)])
## use reshape to put in the long format column with lci and hci
yy = reshape(df[,c(1,5,8,6,9)],direction='long',
varying=list(c(2,3),c(3,4)),
times=c('invsimpson','shannon'),
sep="_", v.names=c("lci", "hci"))[,c('sample','time','lci','hci')]
然后合并2个结果
merge(xx,yy,by=1:2,all.x=T)
sample variable value lci hci
1 0.22um nseqs 29445.00 NA NA
2 0.22um coverage 0.96 NA NA
3 0.22um invsimpson 20.36 19.99 3.73
4 0.22um shannon 3.75 3.73 20.76
5 3um nseqs 30212.00 NA NA
6 3um coverage 0.99 NA NA
7 3um invsimpson 8.76 8.59 3.02
8 3um shannon 3.04 3.02 8.95
答案 1 :(得分:2)
你可以试试这个:
library(reshape2)
temp <- melt(df)
df2 <- cbind(temp, colsplit(string = temp$variable, pattern = "_",
names = c("calc", "stat")))
df3 <- dcast(df2, sample + calc ~ stat, value.var = "value")
df3
# sample calc Var.3 hci lci
# 1 0.22um coverage 0.96 NA NA
# 2 0.22um invsimpson 20.36 20.76 19.99
# 3 0.22um nseqs 29445.00 NA NA
# 4 0.22um shannon 3.75 3.77 3.73
# 5 3um coverage 0.99 NA NA
# 6 3um invsimpson 8.76 8.95 8.59
# 7 3um nseqs 30212.00 NA NA
# 8 3um shannon 3.04 3.06 3.02
可能重命名和重新排序变量:
names(df3) <- c("sample", "calculator", "value", "hci", "lci")
df3[ , c("sample", "calculator", "value", "lci", "hci")]
# sample calculator value lci hci
# 1 0.22um coverage 0.96 NA NA
# 2 0.22um invsimpson 20.36 19.99 20.76
# 3 0.22um nseqs 29445.00 NA NA
# 4 0.22um shannon 3.75 3.73 3.77
# 5 3um coverage 0.99 NA NA
# 6 3um invsimpson 8.76 8.59 8.95
# 7 3um nseqs 30212.00 NA NA
# 8 3um shannon 3.04 3.02 3.06