Question

生物学家和ggplot2初学者在这里。我有一个相对较大的DNA序列数据集（数百万个短DNA片段），我首先需要过滤每个序列的质量。我想说明使用ggplot2使用堆积条形图过滤掉了多少读数。

我已经发现ggplot喜欢长格式的数据并且已经使用reshape2的融合函数成功地重新格式化了它

这是目前数据的一个子集：

library sample  filter  value
LIB0    0011a   F1  1272707
LIB0    0018a   F1  1505554
LIB0    0048a   F1  1394718
LIB0    0095a   F1  2239035
LIB0    0011a   F2  250000
LIB0    0018a   F2  10000
LIB0    0048a   F2  10000
LIB0    0095a   F2  10000
LIB0    0011a   P   2118559
LIB0    0018a   P   2490068
LIB0    0048a   P   2371131
LIB0    0095a   P   3446715
LIB1    0007b   F1  19377
LIB1    0010b   F1  79115
LIB1    0011b   F1  2680
LIB1    0007b   F2  10000
LIB1    0010b   F2  10000
LIB1    0011b   F2  10000
LIB1    0007b   P   290891
LIB1    0010b   P   1255638
LIB1    0011b   P   4538

库和示例是我的ID变量（相同的示例可以在多个库中）。 'F1'和'F2'表示在此步骤中过滤掉了这么多读数，'P'表示过滤后剩余的序列读数。

我已经弄清楚如何制作一个基本的叠加条形图，但现在我遇到了麻烦，因为我无法弄清楚如何正确地重新排序x轴上的因子，因此条形图在图中基于的降序排序F1，F2和P的总和。现在的方式我认为它们是根据样本名称在库中按字母顺序排序

testdata <- read.csv('testdata.csv', header = T, sep = '\t')

ggplot(testdata, aes(x=sample, y=value, fill=filter)) + 
  geom_bar(stat='identity') +
  facet_wrap(~library, scales = 'free')

经过一些谷歌搜索后，我发现了聚合函数，它给出了每个库每个样本的总数：

aggregate(value ~ library+sample, testdata, sum)

  library sample   value
1    LIB1  0007b  320268
2    LIB1  0010b 1344753
3    LIB0  0011a 3641266
4    LIB1  0011b   17218
5    LIB0  0018a 4005622
6    LIB0  0048a 3775849
7    LIB0  0095a 5695750

虽然这确实给了我总数，但我现在不知道如何使用它来重新排序因子，特别是因为有两个我需要考虑（库和样本）。

所以我想我的问题归结为：如何根据每个库的F1，F2和P的总和在图表中订购样品？

非常感谢您给我的任何指示！

Answer 1

你快到了。您需要根据聚合数据更改testdata$sample的因子级别（我假设lib1和lib0中都没有出现样本名称）：

df <- aggregate(value ~ library+sample, testdata, sum)

testdata$sample <- factor(testdata$sample, levels = df$sample[order(-df$value)])

ggplot(testdata, aes(x=sample, y=value, fill=filter)) + 
    geom_bar(stat='identity') +
    facet_wrap(~library, scales = 'free')

ggplot2

1 个答案: