如何将条形图中的标准偏差条限制为最大值?

时间:2016-11-04 15:01:28

标签: r ggplot2 standard-deviation

我正在使用ggplot2创建标准偏差条的条形图。我的数据框非常大,但这里有一个截断版本,例如:

SampleName  Target.ID   Maj.Allele.Freq SD  AVG.MAF
W15-P2-1    rs1005533   99.74811083 24.98883743 93.70753223
W15-P2-2    rs1005533   100 24.98883743 93.70753223
W15-P2-3    rs1005533   100 24.98883743 93.70753223
W15-P2-4    rs1005533   100 24.98883743 93.70753223
W15-P2-1    rs1005533   99.94819995 24.98883743 93.70753223
W15-P2-2    rs1005533   100 24.98883743 93.70753223
W15-P2-3    rs1005533   100 24.98883743 93.70753223
W15-P2-4    rs1005533   100 24.98883743 93.70753223
W21-P2-1    rs1005533   100 24.98883743 93.70753223
W21-P2-2    rs1005533   100 24.98883743 93.70753223
W21-P2-3    rs1005533   99.90044798 24.98883743 93.70753223
W21-P2-4    rs1005533   99.72375691 24.98883743 93.70753223
W21-P2-1    rs1005533   100 24.98883743 93.70753223
W21-P2-2    rs1005533   100 24.98883743 93.70753223
W21-P2-3    rs1005533   100 24.98883743 93.70753223
W21-P2-4    rs1005533   0   24.98883743 93.70753223
W15-P2-1    rs10092491  52.40641711 1.340954343 51.8604281
W15-P2-2    rs10092491  53.69923603 1.340954343 51.8604281
W15-P2-3    rs10092491  52.56689284 1.340954343 51.8604281
W15-P2-4    rs10092491  50.11764706 1.340954343 51.8604281
W15-P2-1    rs10092491  50.30094583 1.340954343 51.8604281
W15-P2-2    rs10092491  50.96277279 1.340954343 51.8604281
W15-P2-3    rs10092491  50.94102886 1.340954343 51.8604281
W15-P2-4    rs10092491  51.2849162  1.340954343 51.8604281
W21-P2-1    rs10092491  53.56976202 1.340954343 51.8604281
W21-P2-2    rs10092491  50.27861123 1.340954343 51.8604281
W21-P2-3    rs10092491  52.8358209  1.340954343 51.8604281
W21-P2-4    rs10092491  51.42585551 1.340954343 51.8604281
W21-P2-1    rs10092491  52.77890467 1.340954343 51.8604281
W21-P2-2    rs10092491  52.89017341 1.340954343 51.8604281
W21-P2-3    rs10092491  53.70786517 1.340954343 51.8604281
W21-P2-4    rs10092491  50  1.340954343 51.8604281

由于最后一列(AVG.MAF)中的平均值可以产生超过最大值100的标准偏差条,因此该图显示超出y轴限制的条数为100.

Example Standard Deviation bars extend beyond 100.

以下是创建上述情节的代码:

pe1 = ggplot(half1, aes(x=Target.ID, y=AVG.MAF))+
 geom_bar(stat = "identity", position = "dodge", colour = "black", 
 width = 0.5, fill = "yellowgreen")+xlab("")+
 ylab("Average Major Allele Frequency")+
 labs(title="Allele Balance AmpliSeq Identity Sample P2")+
 geom_errorbar(aes(ymin = AVG.MAF-SD, ymax = AVG.MAF+SD), 
 width = 0.4, position = position_dodge(0.9), 
   size = 0.6)+
 theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

我尝试使用coord_cartesian截断情节,但这种情况使得情节看起来像是隐藏了一些数据:

Here the top of the standard deviation bars is cut off

以下是创建情节的代码,标准偏差条被切断:

pe1 = ggplot(half1, aes(x=Target.ID, y=AVG.MAF))+geom_bar(stat = "identity", position = "dodge", colour = "black", width = 0.5, fill = "yellowgreen")+xlab("")+ylab("Average Major Allele Frequency")+labs(title="Allele Balance AmpliSeq Identity Sample P2")+geom_errorbar(aes(ymin = AVG.MAF-SD, ymax = AVG.MAF+SD), width = 0.4, position = position_dodge(0.9), size = 0.6)+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))+coord_cartesian(ylim=c(0,100))

似乎必须有一种方法可以将标准偏差条限制为我的预期ymax为100,并且仍然可以在图中看到顶部水平条。有谁知道怎么做?

1 个答案:

答案 0 :(得分:0)

除了人们在评论中提出的问题之外,还有以下几点需要考虑:

  1. 您不需要为数据的每一行添加重复平均值的列。相反,您可以使用Maj.Allele.Freq中的实际数据值计算并绘制ggplot中的平均值。 (事实上​​,通过使用y值的列重复每个Target.ID的平均值,你实际上是在绘制平均条的多个副本,一个在另一个的顶部。 )

    您还可以在ggplot之外汇总数据(即计算平均值和标准偏差),然后使用汇总数据框进行绘图。在更复杂的情况下,这有时是必要的,但你可以在ggplot中完成所有这些工作。

  2. 在我看来,积分比这里的酒吧更好。

  3. 下面的代码同时提供了point和bar版本,还显示了如何添加数据的标准偏差或数据均值的95%置信区间。蓝线表示标准偏差,而红线表示95%置信区间。

    我提供了自举置信区间。要提供经典的正常置信区间,请从mean_cl_boot切换到mean_cl_normal

    如果您希望y轴降至零,请添加coord_cartesian(ylim=c(0,150))或您希望的任何最大y值(正如评论所讨论的那样,为了避免误导图,它应该高于顶部错误栏,无论条形代表SD还是CI)。

    ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
      stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, colour="blue") +
      stat_summary(fun.data=mean_sdl, geom="point", colour="blue", size=3) +
      stat_summary(fun.data = mean_cl_boot, colour="red", geom="errorbar", width=0.1) +
      stat_summary(fun.data = mean_cl_boot, colour="red", geom="point") +
      labs(x="", y="Average Major Allele Frequency", 
           title="Allele Balance AmpliSeq\nIdentity Sample P2") +
      theme_bw() +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) 
    

    enter image description here

    ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
      stat_summary(fun.y=mean, geom="bar", fill="yellowgreen", colour="black") +
      stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, size=1, colour="blue") +
      stat_summary(fun.data = mean_cl_boot, colour="red", geom="errorbar", width=0.1, size=0.7) +
      labs(x="", y="Average Major Allele Frequency", 
           title="Allele Balance AmpliSeq\nIdentity Sample P2") +
      theme_bw() +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))   
    

    enter image description here

    你也可以将SD和95%CI放在同一个地块上:

    pnp = position_nudge(x=0.1)
    pnm = position_nudge(x=-0.1)
    
    ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
      stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, position=pnp, aes(colour="SD")) +
      stat_summary(fun.data=mean_sdl, geom="point", position=pnp, aes(colour="SD")) +
      stat_summary(fun.data = mean_cl_boot, geom="errorbar", width=0.1, 
                   position=pnm, aes(colour="95% CI")) +
      stat_summary(fun.data = mean_cl_boot, geom="point", position=pnm, aes(colour="95% CI")) +
      labs(x="", y="Average Major Allele Frequency", colour="",
           title="Allele Balance AmpliSeq\nIdentity Sample P2") +
      theme_bw() +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
    

    enter image description here