R ggplot2 TCGA表达数据的分组箱线图

时间:2018-12-23 18:15:26

             Gene1  Gene2  Gene3 ...
Patient_T 1    2      3      1
Patient_T 2    1      5      6 
Patient_N 1    3      6      1
Patient_N 2    3      6      1



1 个答案:

答案 0 :(得分:1)



重塑data from wide to long format(每位患者,组织类型和基因一份记录)是使用构建分组箱形图的关键。在您的情况下,数据框的行名包含两项信息:组织类型和患者ID。将它们分为两列后,我将所有Gene1Gene2Gene3列收集为两列:geneexpression_level。这就是将原始的4 x 3数据帧转换为12 x 4整洁数据集的方式。

SS Grouped box plot

# load necessary packages ----

# load necessary data ----
df <-
  data.frame(Gene1 = c(2, 1, 3, 3)
             , Gene2 = c(3, 5, 6, 6)
             , Gene3 = c(1, 6, 1, 1)
             , row.names = c("Patient_T 1"
                             , "Patient_T 2"
                             , "Patient_N 1"
                             , "Patient_N 2"))

# reshape data so that it contains one record per: ----
# - patient
# - gene
# - tissue type
tidy.df <-
  df %>%
  # pid for Patient ID
  rownames_to_column(var = "pid") %>%
  # only keep the suffix in pid
  mutate(pid = str_extract(pid, "(T|N)\\s{1}\\d{1}")) %>%
  # separate pid from tissue type in two dif columns
  separate(col = "pid"
           , into = c("type", "pid")
           , sep = "\\s{1}") %>%
  gather(key = "gene"
         , value = "expression_level"
         , matches("Gene")) %>%
  # remove 'Gene' from gene column
  # and specify the 'type' values
  mutate(gene = str_extract(gene, "\\d{1}")
         , type = case_when(
           type == "N" ~ "Normal"
           , type == "T" ~ "Tumor"
         )) %>%
  # arrange tibble by pid
  arrange(pid) %>%

# create a grouped boxplot with ggplot2 ----
# The graph should depict all the gene candidates 
# in the x-axis and the expression level 
# in the y-axis grouped by tumor and normal for each gene.
tidy.df %>%
  ggplot(aes(x = gene, y = expression_level, fill = gene)) +
  geom_boxplot() +
  # visualizes the distribution of expression level by gene by tissue type
  # i.e. one set of boxplots for nomal and tumor
  facet_wrap(facets = vars(type)) +
  ylab("Expression level") +
  labs(title = "Gene expression data by tissue type"
       , caption = "Source: TCGA")

# end of script #


