Question

我刚刚在RGui中遇到了一个奇怪的情况......我使用相同的脚本来使我的data.frame成为ggplot2的正确形状。所以我的数据如下所示：

      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

'data.frame':   185648 obs. of  10 variables:
 $ time        : int  5 5 5 5 5 5 6 6 6 6 ...
 $ days        : int  62 62 62 62 62 62 69 69 69 69 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
 $ parallel    : int  1 2 3 1 2 3 1 2 3 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
 $ habitat     : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0 0 0 0 0 0 0 0 0 ...

我希望aggregate计算我最多3个平行线的平均值：

df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)

之后，水平＆＃34;生物膜＆＃34;在专栏＃34;栖息地＆＃34;迷路了。

df_mean<-droplevels(df_mean)

str(df_mean)
'data.frame':   44608 obs. of  9 variables:
 $ time        : int  1 2 1 2 1 2 1 2 1 2 ...
 $ days        : int  2 22 2 22 2 22 2 22 2 22 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
 $ habitat     : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0.00359 0 0 0 ...

所以我花了很多时间（我实际上只是意识到这一点，现在看起来有很多问题似乎都是aggregate相关的）。我删除了列＆＃34; cellcounts＆＃34;它起作用了。有趣的是，列＆＃34; cellcounts＆＃34;和＆＃34;栖息地＆＃34;总是携带＆＃34; biofilm＆＃34;相同的，因此多余的信息（＆＃34;生物膜＆＃34;始终与＆＃34; NA＆＃34;）。这是原因吗？但它以前总是有效，所以我不能理解这一点。 base::aggregate函数或类似的东西是否有变化？你有解释给我吗？我使用的是R-3.4.0，其他使用的包都是reshape，reshape2和ggplot2

很多，一个混乱的crazysantaclaus

Answer 1

问题来自NA，也许您的文件过去加载的方式不同，这些文件存储为字符串而不是NA值？这是通过将它们设置为"NA"字符串来解决它的方法：

levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)

'data.frame':   4 obs. of  9 variables:
  $ time        : int  1 2 1 2
$ days        : int  2 22 2 22
$ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat     : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable    : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts  : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value       : num  0 0.00359 0 0

数据

df <- read.table(text=" time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value 1 1 2 control dna water 1 none Proteobacteria batch 0.000000000 2 2 22 control dna water 1 none Proteobacteria batch 0.003586543 3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000 4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000 ",header=T)

R，聚合函数显然会导致色谱柱水平的损失？

1 个答案: