在dplyr中拆分具有相同名称的因子

时间:2018-01-12 15:03:18

标签: r ggplot2

我有一个数据框:

    structure(list(allele_freq = c(8, 11, 14, 7, 7, 1, 1, 1, 1, 1, 
1, 10, 1, 45, 48, 1, 16, 1), gene = structure(c(2L, 4L, 2L, 7L, 
6L, 12L, 10L, 9L, 11L, 13L, 8L, 5L, 1L, 1L, 2L, 2L, 3L, 14L), .Label = c("E-cadherin", 
"intergenic", "CHES-1-like", "Ddr", "mino", "mspo", "ZnT35C", 
"CG11984", "CG12301", "CG34356", "DCP2", "Eip63E", "hb", "spri"
), class = "factor")), row.names = c(NA, -18L), class = "data.frame", .Names = c("allele_freq", 
"gene"))

这显示了基因列表,以及它们在我的数据中出现的频率。

某些基因可能在数据中出现不止一次(例如intergenic)。我试图绘制每个基因的频率,而不是对不止一次出现的基因的allele_freq值进行求和

这就是我所拥有的:

library(dplyr)

bp_data <- bp_data %>%
    # ... some other filtering...
    mutate(allele_freq = as.numeric(allele_freq)) %>%
    transform(gene = reorder(gene, -allele_freq)) %>%
    droplevels()

  p <- ggplot(bp_data)
  p <- p + geom_bar(aes(gene, allele_freq), stat='identity')
  p

enter image description here

此处,所有allele_freq条目的intergenic值正在相加。我想在我的情节中多次代表它。

2 个答案:

答案 0 :(得分:2)

library(dplyr)
library(ggplot2)

df2 <- df %>% arrange(gene,-allele_freq) %>% group_by(gene) %>%
  mutate(count = seq(n())) %>%
  mutate(gene2 = paste(gene,count,sep="")) %>%
  transform(gene2 = reorder(gene2, -allele_freq)) 

ggplot(df2,aes(x=gene2,y=allele_freq)) + geom_bar(stat='identity')

答案 1 :(得分:0)

以下是一个肮脏的技巧,但它的工作原理

df %>% 
   split(.$gene) %>% 
   do.call(rbind, .) %>% 
   mutate(gene = rownames(.))

   # allele_freq          gene
# 1            1 E-cadherin.13
# 2           45 E-cadherin.14
# 3            8  intergenic.1
# 4           14  intergenic.3
# 5           48 intergenic.15
# 6            1 intergenic.16
# 7           16   CHES-1-like
# 8           11           Ddr
# 9           10          mino
# 10           7          mspo
# 11           7        ZnT35C
# 12           1       CG11984
# 13           1       CG12301
# 14           1       CG34356
# 15           1          DCP2
# 16           1        Eip63E
# 17           1            hb
# 18           1          spri

我说脏,因为它利用do.call(rbind, ...)的副作用来枚举相同的因子/值,而不是显式枚举值。 (注意附加的数字是值的原始行号)

使用ggplot

df %>% 
  split(.$gene) %>% 
  do.call(rbind, .) %>% 
  mutate(gene = rownames(.)) %>% 
  ggplot(., aes(x=gene, y=allele_freq)) + geom_bar(stat='identity')