R,聚合函数显然会导致色谱柱水平的损失?

时间:2017-08-16 14:15:01

标签: r aggregate formula

我刚刚在RGui中遇到了一个奇怪的情况......我使用相同的脚本来使我的data.frame成为ggplot2的正确形状。所以我的数据如下所示:

      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

'data.frame':   185648 obs. of  10 variables:
 $ time        : int  5 5 5 5 5 5 6 6 6 6 ...
 $ days        : int  62 62 62 62 62 62 69 69 69 69 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
 $ parallel    : int  1 2 3 1 2 3 1 2 3 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
 $ habitat     : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0 0 0 0 0 0 0 0 0 ...

我希望aggregate计算我最多3个平行线的平均值:

df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)

之后,水平&#34;生物膜&#34;在专栏#34;栖息地&#34;迷路了。

df_mean<-droplevels(df_mean)

str(df_mean)
'data.frame':   44608 obs. of  9 variables:
 $ time        : int  1 2 1 2 1 2 1 2 1 2 ...
 $ days        : int  2 22 2 22 2 22 2 22 2 22 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
 $ habitat     : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0.00359 0 0 0 ...

所以我花了很多时间(我实际上只是意识到这一点,现在看起来有很多问题似乎都是aggregate相关的)。我删除了列&#34; cellcounts&#34;它起作用了。有趣的是,列&#34; cellcounts&#34;和&#34;栖息地&#34;总是携带&#34; biofilm&#34;相同的,因此多余的信息(&#34;生物膜&#34;始终与&#34; NA&#34;)。这是原因吗?但它以前总是有效,所以我不能理解这一点。 base::aggregate函数或类似的东西是否有变化?你有解释给我吗?我使用的是R-3.4.0,其他使用的包都是reshape,reshape2和ggplot2

很多,一个混乱的crazysantaclaus

1 个答案:

答案 0 :(得分:1)

问题来自NA,也许您的文件过去加载的方式不同,这些文件存储为字符串而不是NA值?这是通过将它们设置为"NA"字符串来解决它的方法:

levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)

'data.frame':   4 obs. of  9 variables:
  $ time        : int  1 2 1 2
$ days        : int  2 22 2 22
$ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat     : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable    : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts  : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value       : num  0 0.00359 0 0

数据

df <- read.table(text="      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
    1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
                        2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
                        3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
                        4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

                        ",header=T)