Question

我的输入看起来像这样。我想制作2个新列 - 一列是基因名称的重复数，另一列是值的总和。有人可以帮忙吗？

输入：

gene1   5
gene1   4
gene2   7
gene3   6
gene3   2
gene3   3

预期产出：

gene1    2    9
gene2    1    7
gene3    3    11

数据：

dd <- read.table(header = FALSE, stringsAsFactors = FALSE, text="gene1   5
gene1   4
gene2   7
gene3   6
gene3   2
gene3   3")

Answer 1

awk 'BEGIN {print "Gene\tCount\tSum"} {a[$1]+=$2;b[$1]++} END {for (i in a) {print i"\t"b[i]"\t"a[i]}}' file

Gene    Count   Sum
gene1   2   9
gene2   1   7
gene3   3   11

Answer 2

这就是dplyr的用途。管道运算符也使语法易于理解。 “col1”和“col2”，您必须使用以下代码中的相应名称替换：

library('dplyr')
df %>% group_by(col1) %>%
    summarise(count=n(),
    sum=sum(col2))

Answer 3

请提供实际可重复使用的代码。有关详细信息，请参阅this question。

首先，我们创建测试数据：

#libraries
library(stringr);library(plyr)

#test data
df = data.frame(gene = str_c("gene", c(1, 1, 2, rep(3, 3))),
                count = c(5, 4, 7, 6, 2, 3))

然后我们用plyr包中的ddply总结：

#ddply
ddply(df, .(gene), summarize,
      gene_count = length(count),
      sum = sum(count)
)

这样做是采用data.frame，将其除以基因列的值，然后以两种所需的方式进行汇总。请参阅Hadley's introduction to the split, apply and combine route。

结果：

   gene gene_count sum
1 gene1          2   9
2 gene2          1   7
3 gene3          3  11

还有很多其他方法可以做到这一点。

计算列中的重复ID并将值加到awk或R

3 个答案: