我刚刚在RGui中遇到了一个奇怪的情况......我使用相同的脚本来使我的data.frame成为ggplot2的正确形状。所以我的数据如下所示:
time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value
1 1 2 control dna water 1 none Proteobacteria batch 0.000000000
2 2 22 control dna water 1 none Proteobacteria batch 0.003586543
3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000
4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000
'data.frame': 185648 obs. of 10 variables:
$ time : int 5 5 5 5 5 5 6 6 6 6 ...
$ days : int 62 62 62 62 62 62 69 69 69 69 ...
$ treatment : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
$ parallel : int 1 2 3 1 2 3 1 2 3 1 ...
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
$ habitat : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
$ cellcounts : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 0 0 0 0 0 0 0 0 0 ...
我希望aggregate
计算我最多3个平行线的平均值:
df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)
之后,水平&#34;生物膜&#34;在专栏#34;栖息地&#34;迷路了。
df_mean<-droplevels(df_mean)
str(df_mean)
'data.frame': 44608 obs. of 9 variables:
$ time : int 1 2 1 2 1 2 1 2 1 2 ...
$ days : int 2 22 2 22 2 22 2 22 2 22 ...
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
$ habitat : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
$ cellcounts : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 0.00359 0 0 0 ...
所以我花了很多时间(我实际上只是意识到这一点,现在看起来有很多问题似乎都是aggregate
相关的)。我删除了列&#34; cellcounts&#34;它起作用了。有趣的是,列&#34; cellcounts&#34;和&#34;栖息地&#34;总是携带&#34; biofilm&#34;相同的,因此多余的信息(&#34;生物膜&#34;始终与&#34; NA&#34;)。这是原因吗?但它以前总是有效,所以我不能理解这一点。 base::aggregate
函数或类似的东西是否有变化?你有解释给我吗?我使用的是R-3.4.0,其他使用的包都是reshape,reshape2和ggplot2
很多,一个混乱的crazysantaclaus
答案 0 :(得分:1)
问题来自NA
,也许您的文件过去加载的方式不同,这些文件存储为字符串而不是NA值?这是通过将它们设置为"NA"
字符串来解决它的方法:
levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)
'data.frame': 4 obs. of 9 variables:
$ time : int 1 2 1 2
$ days : int 2 22 2 22
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value : num 0 0.00359 0 0
数据强>
df <- read.table(text=" time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value
1 1 2 control dna water 1 none Proteobacteria batch 0.000000000
2 2 22 control dna water 1 none Proteobacteria batch 0.003586543
3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000
4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000
",header=T)