我有一个数据框:
structure(list(allele_freq = c(8, 11, 14, 7, 7, 1, 1, 1, 1, 1,
1, 10, 1, 45, 48, 1, 16, 1), gene = structure(c(2L, 4L, 2L, 7L,
6L, 12L, 10L, 9L, 11L, 13L, 8L, 5L, 1L, 1L, 2L, 2L, 3L, 14L), .Label = c("E-cadherin",
"intergenic", "CHES-1-like", "Ddr", "mino", "mspo", "ZnT35C",
"CG11984", "CG12301", "CG34356", "DCP2", "Eip63E", "hb", "spri"
), class = "factor")), row.names = c(NA, -18L), class = "data.frame", .Names = c("allele_freq",
"gene"))
这显示了基因列表,以及它们在我的数据中出现的频率。
某些基因可能在数据中出现不止一次(例如intergenic
)。我试图绘制每个基因的频率,而不是对不止一次出现的基因的allele_freq
值进行求和。
这就是我所拥有的:
library(dplyr)
bp_data <- bp_data %>%
# ... some other filtering...
mutate(allele_freq = as.numeric(allele_freq)) %>%
transform(gene = reorder(gene, -allele_freq)) %>%
droplevels()
p <- ggplot(bp_data)
p <- p + geom_bar(aes(gene, allele_freq), stat='identity')
p
此处,所有allele_freq
条目的intergenic
值正在相加。我想在我的情节中多次代表它。
答案 0 :(得分:2)
library(dplyr)
library(ggplot2)
df2 <- df %>% arrange(gene,-allele_freq) %>% group_by(gene) %>%
mutate(count = seq(n())) %>%
mutate(gene2 = paste(gene,count,sep="")) %>%
transform(gene2 = reorder(gene2, -allele_freq))
ggplot(df2,aes(x=gene2,y=allele_freq)) + geom_bar(stat='identity')
答案 1 :(得分:0)
以下是一个肮脏的技巧,但它的工作原理
df %>%
split(.$gene) %>%
do.call(rbind, .) %>%
mutate(gene = rownames(.))
# allele_freq gene
# 1 1 E-cadherin.13
# 2 45 E-cadherin.14
# 3 8 intergenic.1
# 4 14 intergenic.3
# 5 48 intergenic.15
# 6 1 intergenic.16
# 7 16 CHES-1-like
# 8 11 Ddr
# 9 10 mino
# 10 7 mspo
# 11 7 ZnT35C
# 12 1 CG11984
# 13 1 CG12301
# 14 1 CG34356
# 15 1 DCP2
# 16 1 Eip63E
# 17 1 hb
# 18 1 spri
我说脏,因为它利用do.call(rbind, ...)
的副作用来枚举相同的因子/值,而不是显式枚举值。 (注意附加的数字是值的原始行号)
使用ggplot
df %>%
split(.$gene) %>%
do.call(rbind, .) %>%
mutate(gene = rownames(.)) %>%
ggplot(., aes(x=gene, y=allele_freq)) + geom_bar(stat='identity')