小提琴图ggplot2,宽度来自列

时间:2016-08-27 21:31:14

标签: r ggplot2 z-axis violin-plot

我对R很新,我只将它用于可视化,所以我可能会遗漏一些简单的东西。

我想要的是,我有两列,应该是x和y轴。第三列我应该定义图的宽度。即使我从不同的答案中尝试了很多东西,但我还没有对代码进行过深入研究。让我们说代码就是这么远:

ggplot(disM, aes(x=study, y=value)) +
  geom_violin() +
  labs(list(title="Distribution", x="Studies", y="Ranges"))

并没有真正实现任何目标。

我有一张这样的表:

  Col0         study     value
1    30-31 breast cancer    357263
2    32-33 breast cancer    352067
3    34-35 breast cancer    340264
4    36-37 breast cancer    309827
5    38-39 breast cancer    298684
6    40-41 breast cancer    322570
7    42-43 breast cancer    338480
8    44-45 breast cancer    354451
9    46-47 breast cancer    429183
10   48-49 breast cancer    396942
11   50-51 breast cancer    415195
12   52-53 breast cancer    368217
13   54-55 breast cancer    445884
14   56-57 breast cancer    395652
15   58-59 breast cancer    386643
16   60-61 breast cancer    461940
17   62-63 breast cancer    473772
18   64-65 breast cancer    464228
19   66-67 breast cancer    485851
20   68-69 breast cancer    513411
21   70-71 breast cancer    576618
22   72-73 breast cancer    588724
23   74-75 breast cancer    634343
24   76-77 breast cancer    584662
25   78-79 breast cancer    608901
26   80-81 breast cancer    617286
27   82-83 breast cancer    659318
28   84-85 breast cancer    757167
29   86-87 breast cancer   1044465
30   88-89 breast cancer    982901
31   90-91 breast cancer   1114269
32   92-93 breast cancer   1110257
33   94-95 breast cancer   1742966
34   96-97 breast cancer   6379974
35   98-99 breast cancer   3437746
36 100-101 breast cancer 118984063
37   30-31  renal cancer   1055566
38   32-33  renal cancer   1089405
39   34-35  renal cancer   1228087
40   36-37  renal cancer   1265606
41   38-39  renal cancer   1264919
42   40-41  renal cancer   1248949
43   42-43  renal cancer   1391738
44   44-45  renal cancer   1453100
45   46-47  renal cancer   1443915
46   48-49  renal cancer   1429785
47   50-51  renal cancer   1372041
48   52-53  renal cancer   1339706
49   54-55  renal cancer   1418135
50   56-57  renal cancer   1484162
51   58-59  renal cancer   1582617
52   60-61  renal cancer   1571977
53   62-63  renal cancer   1652503
54   64-65  renal cancer   1742230
55   66-67  renal cancer   1859936
56   68-69  renal cancer   1928028
57   70-71  renal cancer   2041783
58   72-73  renal cancer   2108994
59   74-75  renal cancer   2154244
60   76-77  renal cancer   2218430
61   78-79  renal cancer   2333206
62   80-81  renal cancer   2377262
63   82-83  renal cancer   2345651
64   84-85  renal cancer   2402114
65   86-87  renal cancer   2519284
66   88-89  renal cancer   2542761
67   90-91  renal cancer   2587606
68   92-93  renal cancer   2308279
69   94-95  renal cancer   2980927
70   96-97  renal cancer  14108950
71   98-99  renal cancer   2762116
72 100-101  renal cancer 211513230

X轴应为研究栏,y应为Col0,小提琴图的宽度应为值栏。我无法拆分col0,因为我只将数据作为范围。

任何指针要检查什么,如何做到这一点将不胜感激。对不起,如果我错过了类似的问题。

提前致谢

2 个答案:

答案 0 :(得分:3)

我要猜一猜。 (如果我正确,您还可以查找有关pyramid plots的信息。)

重新订购标签,以便" 100-101"真的到了最后:

disM$Col0 <- factor(disM$Col0,levels=unique(disM$Col0))

重新排列以便更容易绘制多边形(我希望有一种更简单的方法可以做到这一点,但无法想到一个):

library(plyr)
disM2 <- ddply(disM,"study",
   function(dd) with(dd,
             data.frame(y=c(as.numeric(Col0),rev(as.numeric(Col0))),
                        x=c(-value/2,rev(value/2)))))


library(ggplot2); theme_set(theme_bw())
ggplot(disM2)+
    geom_polygon(aes(x,y),alpha=0.5)+
    facet_wrap(~study)+
    labs(list(title="Distribution"))+
    scale_y_continuous(breaks=as.numeric(disM$Col0),
                       labels=disM$Col0)+
    scale_x_continuous(labels=NULL)

enter image description here

答案 1 :(得分:0)

采取以下措施的另一种方式。

首先,你应该计算你的类标记(类最大+类最小/ 2)。在您的情况下,它是您的间隔的中点(您可以使用levels(x = my_data$col0)提取)。

my_data$class_mark <- rep(x = seq(from = 30.5, to = 100.5, by = 2), times = 2)

然后你应该分割你的数据:

my_data_br <- my_data[which(my_data$study == "breast cancer"),]
my_data_re <- my_data[which(my_data$study == "renal cancer"),]

您的value列确实有很大的数字,所以您应该将它们除以最小值:

my_data_br$value <- my_data_br$value/min(my_data_br$value)
my_data_re$value <- my_data_re$value/min(my_data_re$value)

之后,您应该重复每个班级标记的次数value

classmark_rep_br <- rep(x = my_data_br[,4], times = my_data_br[,3])
br_rep <- rep("breast cancer", times = length(x = classmark_rep_br))
br_data <- cbind.data.frame(br_rep, classmark_rep_br)
names(br_data) <- c("study", "value")

classmark_rep_re <- rep(x = my_data_re[,4], times = my_data_re[,3])
re_rep <- rep("renal cancer", times = length(x = classmark_rep_re))
re_data <- cbind.data.frame(re_rep, classmark_rep_re)
names(re_data) <- c("study", "value")

最后建立新的数据库:

my_data2 <- rbind.data.frame(br_data, re_data)

现在,您可以根据需要制作图表(例如the following one)并保存:

my_graph <- ggplot(data = my_data2, aes(x = study, y = value, fill = study)) + geom_violin() + 
  theme(legend.position = "none", panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.text = element_text(size = 10, face = "bold"), panel.background = element_blank(), 
        axis.line = element_line(colour = "black")) + 
  labs(x = "", y = "") + scale_fill_brewer(palette="Pastel1") + 
  scale_x_discrete(labels = c("Breast cancer", "Renal cancer")) + 
  scale_y_continuous(breaks = c(30.5, 40.5, 50.5, 60.5, 70.5, 80.5, 90.5, 100.5),
                     labels = c("30-31", "40-41", "50-51", "60-61", "70-71", "80-81", "90-91", "100-101"))

    ggsave(plot = my_graph, filename = "my_graph", path = "output/path/", device = "png", dpi = 200)