堆栈的三列具有多个样本的均值和置信区间

时间:2013-12-27 00:50:56

标签: r dataframe data-manipulation

我有两个样本的.csv输出,并为每个样本计算了一些“计算器”统计数据。一些“计算器”具有较低和较高的置信区间值。最后,我想为所有具有误差条的计算器绘制箱形图,以获得具有它们的计算器的置信区间。但首先,我需要将数据操作为R友好格式。

我如何接受此输入:

df <- data.frame(sample = as.factor(c("0.22um", "3um")),
                 nseqs = c(29445, 30212), coverage = c(0.96, 0.99),
                 invsimpson = c(20.36, 8.76), invsimpson_lci = c(19.99, 8.59), 
                 invsimpson_hci =c(20.76, 8.95),
                 shannon = c(3.75, 3.04), shannon_lci = c(3.73, 3.02), 
                 shannon_hci = c(3.77, 3.06))

看起来像这样:

  sample nseqs coverage invsimpson invsimpson_lci invsimpson_hci shannon shannon_lci shannon_hci
1 0.22um 29445     0.96      20.36          19.99          20.76    3.75        3.73        3.77
2    3um 30212     0.99       8.76           8.59           8.95    3.04        3.02        3.06

并将其转换为:

  sample calculator value  lci  hci
1 0.22um      nseqs   num <NA> <NA>
2 0.22um   coverage   num <NA> <NA>
3 0.22um invsimpson   num  num  num
4 0.22um    shannon   num  num  num
5    3um      nseqs   num <NA> <NA>
6    3um   coverage   num <NA> <NA>
7    3um invsimpson   num  num  num
8    3um    shannon   num  num  num

,其中num是来自df的对应值。该数据帧将具有NA,其中原始df对于相应的间隔

没有置信度值
temp <- melt(df, id.vars= c("sample"), measure.vars=c("nseqs", "coverage", "invsimpson", "shannon"), variable.name="calculator")
partial.solution <- temp[with(base, order(group)), ]

将获得所有计算器的值但是让lci和hci排成一行有点棘手。

通用解决方案非常棒。我希望矩阵有数百个样本和可变数量的计算器。

感谢您的帮助!

2 个答案:

答案 0 :(得分:3)

我会分2步完成:

## put in the long format simple column using melt
## no need to work in all variables 
xx = melt(df[,c(1,2,3,4,7)])

## use reshape to put in the long format column with lci and hci
yy = reshape(df[,c(1,5,8,6,9)],direction='long',
        varying=list(c(2,3),c(3,4)),
        times=c('invsimpson','shannon'),
        sep="_", v.names=c("lci", "hci"))[,c('sample','time','lci','hci')]

然后合并2个结果

 merge(xx,yy,by=1:2,all.x=T)

 sample   variable    value   lci   hci
1 0.22um      nseqs 29445.00    NA    NA
2 0.22um   coverage     0.96    NA    NA
3 0.22um invsimpson    20.36 19.99  3.73
4 0.22um    shannon     3.75  3.73 20.76
5    3um      nseqs 30212.00    NA    NA
6    3um   coverage     0.99    NA    NA
7    3um invsimpson     8.76  8.59  3.02
8    3um    shannon     3.04  3.02  8.95

答案 1 :(得分:2)

你可以试试这个:

library(reshape2)
temp <- melt(df)

df2 <- cbind(temp, colsplit(string = temp$variable, pattern = "_",
                            names = c("calc", "stat")))

df3 <- dcast(df2, sample + calc ~ stat, value.var = "value")
df3

#   sample       calc    Var.3   hci   lci
# 1 0.22um   coverage     0.96    NA    NA
# 2 0.22um invsimpson    20.36 20.76 19.99
# 3 0.22um      nseqs 29445.00    NA    NA
# 4 0.22um    shannon     3.75  3.77  3.73
# 5    3um   coverage     0.99    NA    NA
# 6    3um invsimpson     8.76  8.95  8.59
# 7    3um      nseqs 30212.00    NA    NA
# 8    3um    shannon     3.04  3.06  3.02

可能重命名和重新排序变量:

names(df3) <- c("sample", "calculator", "value", "hci",  "lci")
df3[ , c("sample", "calculator", "value", "lci",  "hci")]

#   sample calculator    value   lci   hci
# 1 0.22um   coverage     0.96    NA    NA
# 2 0.22um invsimpson    20.36 19.99 20.76
# 3 0.22um      nseqs 29445.00    NA    NA
# 4 0.22um    shannon     3.75  3.73  3.77
# 5    3um   coverage     0.99    NA    NA
# 6    3um invsimpson     8.76  8.59  8.95
# 7    3um      nseqs 30212.00    NA    NA
# 8    3um    shannon     3.04  3.02  3.06