我正在尝试计算每个SNP名称的iets列中“Opp”的出现量(最终我想将“Opp”的出现量除以df $ MM。)
library(data.table)
df <- structure(list(SNP = structure(c(1L, 1L, 1L, 2L, 1L), .Label = c("rs80932150", "rs000001"), class = "factor"), FID = c(116601888L, 116621563L, 117253533L, 118635095L, 118943247L), IID = c(116601888L, 116621563L, 117253533L, 118635095L, 118943247L), NEW = structure(c(16L, 14L, 16L, 14L, 14L), .Label = c("A/A", "A/C", "A/G", "A/T", "C/A", "C/C", "C/G", "C/T", "G/A", "G/C", "G/G", "G/T", "T/A", "T/C", "T/G", "T/T"), class = "factor"), OLD = structure(c(6L, 6L, 6L, 6L, 6L), .Label = c("A/A", "A/C", "A/G", "A/T", "C/A", "C/C", "C/G", "C/T", "G/A", "G/C", "G/G", "G/T", "T/A", "T/C", "T/G", "T/T"), class = "factor"), count = c(1L, 1L, 1L, 1L, 1L), MM = c(4L, 4L, 4L, 1L, 4L), iets = c("Opp", "Het", "Opp", "Het", "Het")), .Names = c("SNP", "FID", "IID", "NEW", "OLD", "count", "MM", "iets"), class = "data.frame", row.names = c(NA, -5L))
setDT(df)
# SNP FID IID NEW OLD count MM iets
#1 rs80932150 116601888 116601888 T/T C/C 1 4 Opp
#2 rs80932150 116621563 116621563 T/C C/C 1 4 Het
#3 rs80932150 117253533 117253533 T/T C/C 1 4 Opp
#4 rs000001 118635095 118635095 T/C C/C 1 1 Het
#5 rs80932150 118943247 118943247 T/C C/C 1 4 Het
我的预期结果如下:
df
# SNP FID IID NEW OLD count MM iets oppcount percentage
#1: rs80932150 116601888 116601888 T/T C/C 1 4 Opp 2 0.5
#2: rs80932150 116621563 116621563 T/C C/C 1 4 Het 2 0.5
#3: rs80932150 117253533 117253533 T/T C/C 1 4 Opp 2 0.5
#4: rs000001 118635095 118635095 T/C C/C 1 1 Het 0 0.0
#5: rs80932150 118943247 118943247 T/C C/C 1 4 Het 2 0.5
我一直在尝试与此类似的事情,但我似乎无法弄清楚如何将出现值分配给我的oppcount / percentage列。
首先,我必须计算每个SNP的“Opp”数量,然后将其除以MM。
as.character((sum(df$iets == "Opp")/(df[,.N, by = df$SNP][[2]])))
#[1] "0.5" "2"
如何计算每个SNP(类别)出现“Opp”的数量?
答案 0 :(得分:4)
您可以使用data.table
运算符通过引用更新:=
。用:
df[, `:=` (oppcount = sum(iets=='Opp'), percentage = sum(iets=='Opp')/.N), by = SNP]
你得到:
> df
SNP FID IID NEW OLD count MM iets oppcount percentage
1: rs80932150 116601888 116601888 T/T C/C 1 4 Opp 2 0.5
2: rs80932150 116621563 116621563 T/C C/C 1 4 Het 2 0.5
3: rs80932150 117253533 117253533 T/T C/C 1 4 Opp 2 0.5
4: rs000001 118635095 118635095 T/C C/C 1 1 Het 0 0.0
5: rs80932150 118943247 118943247 T/C C/C 1 4 Het 2 0.5
或者,根据评论中@Frank的建议,您还可以使用以下两个选项之一:
# method 1
df[, c('oppcount', 'percentage') := {s = sum(iets=='Opp'); .(s, s/.N)}, by = SNP]
# method 2
df[df[, {s = sum(iets=='Opp'); .(oppcount = s, percentage = s/.N)}, by = SNP], on = 'SNP']
基础R替代方案:
transform(df,
oppcount = ave(iets, SNP, FUN = function(x) sum(x=='Opp')),
percentage = ave(iets, SNP, FUN = function(x) sum(x=='Opp')/length(x)))
正确的dplyr
替代方案是:
library(dplyr)
df %>%
group_by(SNP) %>%
mutate(oppcount = sum(iets=='Opp'),
percentage = oppcount/n())
答案 1 :(得分:0)
rs8.oppcount<-length(iets[iets=='Opp' & SNP=='rs80932150'])
rs0.oppcount<-length(iets[iets=='Opp' & SNP=='rs000001'])
这可以保存snp类别的Opp出现次数!
编辑:
df1<-group_by(df, df$SNP)
df2<-summarise(df1, oppcount = length(iets[iets=='Opp']))
df1<-merge(df1, df2, by = 'SNP')
这有用吗?
答案 2 :(得分:0)
如何使用dplyr
?
library('dplyr')
df %>% group_by(iets, SNP) %>% summarize(count=sum(count)) %>% filter(iets=='Opp')