如何按类别对子组进行分类?

时间:2019-05-13 22:52:01

标签: r ggplot2

我有一些代码以堆叠的条形图显示出门的数量和门内的属。我编辑了代码,使所有NA元素都出现在每个小节的顶部,而更丰富的元素出现在底部,但是,这抛弃了我的调色板,该调色板根据门类组以及在该组中按字母顺序分配颜色。例如,细菌杆菌门被分配为蓝色,门内的每个属按字母顺序被分配为蓝色阴影。

我相信我可以更改levs变量以按字母顺序对元素进行排序,并按门进行分组,但是我还没有找到一种方法来实现。但是,目前,levs变量会按元素的丰度对元素进行排序。

#makes color pallete
ColourPalleteMulti <- function(df, group, subgroup){

  # Find how many colour categories to create and the number of colours in each
  categories <- aggregate(as.formula(paste(subgroup, group, sep="~" )), df, function(x) length(unique(x)))
  category.start <- (scales::hue_pal(l = 100)(nrow(categories))) # Set the top of the colour pallete
  category.end  <- (scales::hue_pal(l = 40)(nrow(categories))) # set the bottom

  # Build Colour pallette
  colours <- unlist(lapply(1:nrow(categories),
                           function(i){
                             colorRampPalette(colors = c(category.start[i], category.end[i]))(categories[i,2])}))
  return(colours)
}

library(tidyverse)
library("phyloseq"); packageVersion("phyloseq")
library(ggplot2)
library(scales)
library(RColorBrewer)
data("GlobalPatterns")

#filter phyloseq data
TopNOTUs <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:100])
gp.ch   <- prune_species(TopNOTUs, GlobalPatterns)

#create dataframe
mdf = psmelt(gp.ch)
mdf$group <- paste0(mdf$Phylum, "-", mdf$Genus, sep = "")

#factor by abundance
levs <- names(sort(tapply(mdf$Abundance, mdf$Genus, sum)))
#load colors
colours <-  ColourPalleteMulti(mdf, "Phylum", "Genus")

#put NA at the top
mdf %>%
  mutate(Genus = fct_explicit_na(Genus, "NA"),
         Genus = factor(Genus, levels = c("NA", levs))) %>%
  #graph
  ggplot(aes(Phylum)) + 
  geom_bar(aes(fill = Genus), colour = "grey", position = "stack") +
  scale_fill_manual("Genus", values=c("#FFFFFF",colours)) +
  ggtitle("Phylum and Genus Frequency") +
  ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5))

运行此代码将显示一个条形图,其颜色位于奇数个位置。理想情况下,图形中的每个条形图都是原色,而每个堆栈都是该颜色的不同阴影。正确创建了调色板,但是由于上述问题,颜色分配不正确。任何帮助表示赞赏!

1 个答案:

答案 0 :(得分:0)

欢迎来到stackoverflow。您正在这里做一些棘手的事情!我认为很难在函数中执行此操作,最大的障碍是将NA放在首位。只需使用tidyverse管道,我就可以将它们放在一起。

这是您的基础设置,为没有phyloseq的人们做了一些准备

# how to install if needed
#source('http://bioconductor.org/biocLite.R')
#biocLite('phyloseq')
library(tidyverse)
library(phyloseq)
library(scales)
library(RColorBrewer)
data("GlobalPatterns")

# filter phyloseq data
TopNOTUs <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:100])
gp.ch <- prune_species(TopNOTUs, GlobalPatterns)

# create dataframe
mdf <- psmelt(gp.ch)

首先,我将记录折叠为计数n

prep <-
  mdf %>%
  mutate(Genus = fct_explicit_na(Genus, "NA")) %>% 
  # summarizes data
  count(Phylum, Genus) %>% # returns n as a count
  mutate(
    group = paste(Phylum, Genus, sep = "-"),
    Phylum = fct_reorder(Phylum, n, sum),
    has_genus = Genus != "NA"
  ) %>% 
  # this step helps with the factor ordering
  arrange(Phylum, has_genus, n) %>% 
  mutate(group = fct_inorder(group)) %>% 
  # I then find some totals & an rank based on the value of n
  group_by(Phylum) %>% 
  mutate(
    ord = row_number(),
    total = n()
  ) %>% 
  ungroup()

#  Phylum         Genus             n group                      has_genus   ord total
#  <fct>          <fct>         <int> <chr>                      <lgl>     <int> <int>
#  Tenericutes    NA               52 Tenericutes-NA             FALSE         1     2
#  Tenericutes    Clostridium      26 Tenericutes-Clostridium    TRUE          2     2
#  Actinobacteria NA              130 Actinobacteria-NA          FALSE         1     3
#  Actinobacteria Rothia           26 Actinobacteria-Rothia      TRUE          2     3
#  Actinobacteria Bifidobacter~    78 Actinobacteria-Bifidobact~ TRUE          3     3

然后我使用因子值填充hcl()函数(类似于您的hue_pal()

df <-
  prep %>% 
  mutate(
    group = fct_inorder(group), # ordering in the stack
    hue = as.integer(Phylum)*25,
    light_base = 1-(ord)/(total+2),
    light = floor(light_base * 100)
  ) %>% 
  # if the genus is missing, use white, otherwise create a hexcode
  mutate(hex = ifelse(!has_genus, "#ffffff", hcl(h = hue, l = light)))

然后是情节

ggplot(df, aes(Phylum, n)) + 
  geom_col(aes(fill = group), colour = "grey") +
  scale_fill_manual(values = df$hex, breaks = (df$group)) +
  ggtitle("Phylum and Genus Frequency") +
  ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5))

enter image description here

对于第二个问题,请保留上述所有prepdf的代码,然后将它们与原始mdf表连接起来。 df表的目的仅是生成颜色,而prep是辅助表。 genushex之间应为1:1。在sample中包含prep列将返回780行,而不是30行,并且不再是1:1。这就是为什么您没有获得想要的结果的原因。 (我认为是ord列被抛出了)。因此,使用上面的内容,然后添加它。我加入了set.seed()sample_frac(),以使更改更加明显。我还旋转它以提高可读性。

set.seed(1234)
final_df <- 
  mdf %>% 
  sample_frac(0.9) %>% 
  mutate(
    Genus = fct_explicit_na(Genus, "NA"),
    # these 2 lines will sort in descending order by Proteobacteria
    rank = as.integer(Phylum == "Proteobacteria" & Genus != "NA"), # T/F == 1/0
    Sample = fct_reorder(Sample, rank, mean)
  ) %>% 
  count(Phylum, Genus, Sample, rank) %>% 
  left_join(df %>% select(-n))


ggplot(final_df, aes(Sample, n)) + 
  geom_col(aes(fill = group), position="fill") +#
  scale_fill_manual("Genus", values = df$hex, breaks = (df$group)) +
  ggtitle("Phylum and Genus Frequency") +
  ylab("Frequency") +
  scale_y_continuous(labels = percent, expand = expand_scale(0)) +
  coord_flip() +
  theme(plot.title = element_text(hjust = 0.5))

enter image description here