如何在ggplot facet plot中绘制多个星号的重要性?

时间:2016-11-14 11:04:01

标签: r plot ggplot2 facet p-value

我在ggplot2中创建了一个facet plot,每个图中都显示了多个数据集作为折线图。根据统计检验,一些数据点是显着的(p≤0.05)。我想在情节上用明显的数据点上方标有星号来表明这一点。I found this example of having asterisks displayed above the significant values

星号的颜色应与图中使用的数据集的颜色相对应。当x轴上有该点的多个重要数据集时,星号应垂直堆叠,这样它们就不会通过重叠而相互模糊。

在我的输入数据中,我有一个带有p值的附加列。任何人都可以指出我这样做ggplot2的方法(如果可能的话)或者帮我处理代码。

我当前的情节(图例从右侧裁剪,以使图中的其余部分更大):

enter image description here

我目前的代码:

ggplot(MyData,aes( x = DAF, y = Mvalue ,group=Species, colour = Species)) + geom_line(size=1.3) + xlab("Frequencies") + ylab("Score")  + theme(axis.text.x=element_text(angle = -45, hjust = 0, size = 6)) + theme(axis.text.y=element_text( size = 6)) + facet_wrap(~Variant) + geom_point()

9个数据集中2个的输入数据示例(其余部分将在下面继续)。对于这个数据,有意义的星号(p≤0​​.05)将用于第6,7,8,10,14和4行。 19基于最终列中的值≤0.05:

1   Species Variant DAF Mvalue  pvalue
2   Tom 5' UTR  0.1-0.19    -1.6026346186   NA
3   Tom 5' UTR  0.2-0.29    1.1646939405    NA
4   Tom 5' UTR  0.3-0.39    0.0003859956    9.84E-01
5   Tom 5' UTR  0.4-0.49    0.0226744644    3.28E-01
6   Tom 5' UTR  0.5-0.59    0.1163627387    3.22E-05
7   Tom 5' UTR  0.6-0.69    0.1614562558    6.33E-06
8   Tom 5' UTR  0.7-0.79    0.221583632 4.29E-06
9   Tom 5' UTR  0.8-0.89    0.1231280752    1.42E-01
10  Tom 5' UTR  0.9-0.99    0.5765076152    9.13E-03
11  Tom 5' UTR  1   5.8105310419    1.87E-13
12  Jerry   5' UTR  0.1-0.19    -0.1371122871   NA
13  Jerry   5' UTR  0.2-0.29    -0.0539638465   4.30E-01
14  Jerry   5' UTR  0.3-0.39    0.1666681074    1.45E-02
15  Jerry   5' UTR  0.4-0.49    0.0081950639    9.19E-01
16  Jerry   5' UTR  0.5-0.59    -0.1204254909   1.82E-01
17  Jerry   5' UTR  0.6-0.69    0.1017622151    3.15E-01
18  Jerry   5' UTR  0.7-0.79    0.1293398031    3.16E-01
19  Jerry   5' UTR  0.8-0.89    0.2944195851    4.52E-02
20  Jerry   5' UTR  0.9-0.99    -0.2956980914   2.12E-01
21  Jerry   5' UTR  1   0.0746902715    7.63E-01

如果它更简单,我可以用0或1替换p值列,指示值是否显着。

我试图展示我以前的工作和一些示例输入数据。如果我可以改进我的问题,请告诉我。 谢谢你的建议。

这是请求的数据子集的dput()输出:

structure(list(Species = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Jerry", 
"Tom"), class = "factor"), Variant = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("3' UTR", 
"5' UTR", "Missense", "Stop gained"), class = "factor"), DAF = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L), .Label = c("0.1-0.19", "0.2-0.29", "0.3-0.39", 
"0.4-0.49", "0.5-0.59", "0.6-0.69", "0.7-0.79", "0.8-0.89", "0.9-0.99", 
"1"), class = "factor"), Mvalue = c(-1.6026346186, 1.1646939405, 
0.0003859956, 0.0226744644, 0.1163627387, 0.1614562558, 0.221583632, 
0.1231280752, 0.5765076152, 5.8105310419, -0.0251257018, -0.022586792, 
0.0089090304, 0.037280128, 0.0745842692, 0.0831538898, 0.0762765259, 
0.1750634419, 0.2095647328, NA, -0.0139837967, -0.0218524964, 
-0.023889027, -0.0042744306, 0.0949525873, 0.087866945, 0.1379730494, 
0.2719542633, 0.4726727792, NA, 0.0201430038, 0.1304518218, -0.0948886785, 
-0.2329137983, -0.0901357588, 0.0504128137, -0.2308377878, 0.4422620731, 
NA, NA, -0.1371122871, -0.0539638465, 0.1666681074, 0.0081950639, 
-0.1204254909, 0.1017622151, 0.1293398031, 0.2944195851, -0.2956980914, 
0.0746902715, -0.005168038, 0.0403712226, -0.0034692714, -0.0049252304, 
-0.0089669044, -0.0604522846, 0.1061225099, 0.0180975445, -0.1843156999, 
-0.1920104157, 0.2228406046, 0.0532141252, 0.0670815638, -0.1197784096, 
-0.235101482, -0.1920644059, -0.2493575855, -0.1564613691, -0.2600385981, 
0.069079018, 0.0503810571, 0.4346052688, 0.1300533982, 0.0662828745, 
-0.4627398332, -1.081459609, -0.7693678877, -0.4865007276, -0.0230373639, 
0.4693415234), pvalue = c(NA, NA, 0.984, 0.328, 3.22e-05, 6.33e-06, 
4.29e-06, 0.142, 0.00913, 1.87e-13, NA, NA, 0.354, NA, 1.93e-07, 
7.29e-06, 0.00288, 2.48e-05, 0.1, 0.791, 0.124, NA, 0.131, 0.824, 
4.11e-05, 0.00354, 0.000711, 3.1e-05, 0.0122, 0.871, 0.73, 0.0963, 
0.367, NA, 0.574, 0.799, 0.442, 0.267, 0.319, 0.98, NA, 0.43, 
0.0145, 0.919, 0.182, 0.315, 0.316, 0.0452, 0.212, 0.763, 0.824, 
0.096, 0.896, 0.868, 0.779, 0.124, 0.0261, 0.761, NA, NA, 6.44e-22, 
0.0407, 0.0162, NA, NA, NA, NA, NA, NA, 0.481, 0.809, 0.0236, 
0.573, 0.801, 0.172, NA, 0.186, 0.449, 0.975, 0.513)), .Names = c("Species", 
"Variant", "DAF", "Mvalue", "pvalue"), class = "data.frame", row.names = c(NA, 
-80L))

2 个答案:

答案 0 :(得分:2)

这是解决方案。它适用于您的示例中的2种物种,但应该适用于更多物种。

library(data.table)
MyData <- data.table(MyData)
MyData$signif <- ifelse(MyData$pvalue < 0.05,1,0)

确定何时在同一点上有超过1个有效值

MyData[, temp:=cumsum(signif), by=c("Variant", "DAF")]

循环根据点“y值”+ 0.5创建星号“y值”。 当存在n个有效点时,它将第n个点“y值”增加n * 0.5。

for (i in 1:length(levels(MyData$Species))) {
  MyData[temp == i , y_ast:=max(Mvalue, na.rm=TRUE)+(i*0.5), by=c("DAF", "Variant")]
}


ggplot(MyData,aes( x = DAF, y = Mvalue ,group=Species, colour = Species)) +
  geom_line(size=1) + xlab("Frequencies") + ylab("Score")  +
  theme(axis.text.x=element_text(angle = -45, hjust = 0, size = 6)) +
  theme(axis.text.y=element_text( size = 6)) +
  facet_wrap(~Variant) +
  geom_point()+
  geom_point(data = MyData[MyData$signif ==1, ],aes(x=DAF, y=y_ast),shape = "*", size=8, show.legend = FALSE)

enter image description here

如果您想按照我的建议在趋势上显示星号,只需在最后一行用y = y=y_ast替换Mvalue

PS:我习惯使用data.table,但y_ast值可以使用R base包或dplyr来计算

答案 1 :(得分:2)

一种可行的方法:
包含星号位置的数据集,即p值显着的点

library(dplyr)
df_asterisk=MyData%>%
  filter(pvalue<0.05)

另一个包含超过1个p值的值的数据帧。添加了一个新列id来区分不同的组。

Id_group=df_asterisk%>%
  group_by(Variant,DAF)%>%
  filter(n()>1)%>%
  mutate(id=data.table::rleid(Mvalue))

Mvalue用于*的位置,我们在用于映射的数据帧中更改它。我使用先前创建的列id的值来区分星号(不重叠)。一种更简单的方法可以是忽略此步骤并向映射添加随机组件(如果不满意则重绘)。 df_asterisk[with(df_asterisk,duplicated(interaction(Variant,DAF))|duplicated(interaction(Variant,DAF),fromLast = T)),]$Mvalue<-(df_asterisk[with(df_asterisk,duplicated(interaction(Variant,DAF))|duplicated(interaction(Variant,DAF),fromLast = T)),]$Mvalue)+Id_group$id/4

简介:

ggplot(MyData,aes( x = DAF, y = Mvalue ,group=Species, colour = Species)) +
  geom_line(size=1.3) +
  xlab("Frequencies") +
  ylab("Score")  +
  theme(axis.text.x=element_text(angle = -45, hjust = 0, size = 6)) +
  theme(axis.text.y=element_text( size = 6)) + 
  geom_point()+
  geom_text(data=df_asterisk,aes(x=DAF,y=Mvalue),label="*",size=5,nudge_y=1)+
  facet_wrap(~Variant)

我更改数据点以在DAF中为5'显示2个重要的p值,以查看它的外观。 enter image description here