这是我的大数据的一个子集:
gene feature reads
A anot 2
A 3ss_A 3
A 3ss_B 5
B 5ss_A 1
B anot 4
C 3ss_A 2
C 3ss_B 8
C anot 3
C 5ss_A 6
我想将每个基因中对应于3ss和5ss特征的读数除以特征" anot"那个基因。我有每个基因的多个特征(这里没有显示),但每个基因只有一个" anot"特征
预期输出为:
gene feature reads ratio
A anot 2 1
A 3ss_A 3 1.5
A 3ss_B 5 2.5
B 5ss_A 1 0.25
B anot 4 1
C 3ss_A 2 0.666666667
C 3ss_B 8 2.666666667
C anot 3 1
C 5ss_A 6 2
我怎么能在R中做到这一点? 感谢
答案 0 :(得分:9)
以下是各种替代方案:
1)ave 像这样使用ave
。函数fun
传递一个基因的行号向量,并返回它的比率向量。没有包使用。
fun <- function(ix) with(DF[ix, ], reads / reads[feature == "anot"])
transform(DF, ratio = ave(1:nrow(DF), gene, FUN = fun))
,并提供:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000
1a)ave 以下是使用ave
的另一种方法。它用NA替换每个非anot读数,然后在每个基因中使用na.omit
将读数除以非NA:
transform(DF, ratio =
reads / ave(ifelse(feature == "anot", reads, NA), gene, FUN = na.omit))
,并提供:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000
1b)ave 这是另一个ave
变体。这一点特别简洁,但假设reads
的{{1}}值始终是非负的(在问题的示例中就是这种情况)。它会为anot
创建一个等于reads
的向量,否则为零,然后取最大值:
anot
,并提供:
transform(DF, ratio = reads / ave((feature == "anot") * reads, gene, FUN = max))
2)另一种方法是使用 gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000
,也不使用任何软件包。这里函数by
获取funby
行的子集,并返回附加比率的子集。
DF
,并提供:
funby <- function(x) transform(x, ratio = reads / reads[feature == "anot"])
do.call("rbind", by(DF, DF$gene, funby))
3)rep / table 这也不使用包。它假设 gene feature reads ratio
A.1 A anot 2 1.0000000
A.2 A 3ss_A 3 1.5000000
A.3 A 3ss_B 5 2.5000000
B.4 B 5ss_A 1 0.2500000
B.5 B anot 4 1.0000000
C.6 C 3ss_A 2 0.6666667
C.7 C 3ss_B 8 2.6666667
C.8 C anot 3 1.0000000
C.9 C 5ss_A 6 2.0000000
按基因排序(问题中的示例就是这种情况)。它会针对该基因中的行数重复每个DF
读数,然后将anot
除以该值。
reads
,并提供:
transform(DF, ratio = reads / rep(reads[feature == "anot"], table(gene)))
4)dplyr 使用dplyr包:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000
,并提供:
library(dplyr)
DF %>%
group_by(gene) %>%
mutate(ratio = reads / reads[feature == "anot"]) %>%
ungroup()
5)data.table 使用data.table包:
Source: local data frame [9 x 4]
gene feature reads ratio
(fctr) (fctr) (int) (dbl)
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000
,并提供:
library(data.table)
DT <- as.data.table(DF)
DT[, ratio := reads / reads[feature == "anot"], by = "gene"]
注意:可重复形式的输入> DT
gene feature reads ratio
1: A anot 2 1.0000000
2: A 3ss_A 3 1.5000000
3: A 3ss_B 5 2.5000000
4: B 5ss_A 1 0.2500000
5: B anot 4 1.0000000
6: C 3ss_A 2 0.6666667
7: C 3ss_B 8 2.6666667
8: C anot 3 1.0000000
9: C 5ss_A 6 2.0000000
为:
DF
答案 1 :(得分:0)
您可以尝试类似
的内容anot_reads <- yourdata[yourdata$feature == "anot",]$reads
names(anot_reads) <- yourdata[yourdata$feature == "anot",]$gene
yourdata$ratio <- yourdata$reads / anot_reads[yourdata$gene]
答案 2 :(得分:0)
您可以在R:
中使用df$ratio <- unlist(sapply(levels(df$gene),
function(l) with(subset(df, gene==l), reads / reads[feature=="anot"])))
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000
它翻译为:应用gene
:子集df的级别,将reads
除以reads
的{{1}}值。然后,feature==anot
结果并在unlist
。
但可能有一个较短的选择。