我在这个帖子上有一个类似的问题: Using R, replace all values in a matrix <0.1 with 0?
但在我的情况下,我假设有更大的数据集和可变阈值。 我需要使用同一数据帧的第一列上的值创建一个数据框,其中包含从条件中检索的每个值。每行的这些值都不同。
以下是数据框的示例:
SNP A1 A2 MAF
rs3094315 G A 0.172
rs7419119 G T 0.240
rs13302957 G A 0.081
rs6696609 T C 0.393
以下是我的代码示例:
seqIndividuals = seq(1:201)
for(i in seqIndividuals) {
alFrequ[paste("IND",i,"a",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) < alFrequ$MAF, alFrequ$A1, alFrequ$A2)
alFrequ[paste("IND",i,"b",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) < alFrequ$MAF, alFrequ$A1, alFrequ$A2)
}
我正在“seqIndividuals”中为每个“i”创建两个新列,如果随机值低于列“MAF”,则检索“A1”列中的值,如果更高,则检索“A2”。代码工作得很好,但随着数据集在行和列(个体)中的增长,时间也会显着增长。
有没有办法避免在这种情况下使用IFELSE,因为我知道它可以作为循环使用?我尝试生成一个随机值矩阵然后替换它们,但它需要相同的时间甚至更长。
mtxAlFrequ = matrix(runif(length(alFrequ$SNP)*(201)),nrow=length(alFrequ$SNP),ncol=201)
mtxAlFrequ[mtxAlFrequ < alFrequ$MAF] = alFrequ$A1
谢谢!
答案 0 :(得分:0)
一个选项是data.table
library(data.table)
nm1 <- paste0("IND", rep(letters[1:2], length(seqIndividuals)),
rep(seqIndividuals, each = 2))
setDT(alFrequ)
for(j in seq_along(nm1)) {
alFrequ[, nm1[j] := A2
][runif(.N, 0, 1) < MAF , nm1[j] := A1][]
}
set.seed(24)
alFrequ <- data.frame(SNP= paste0('rs', sample(600000, 340000, replace=FALSE)),
A1 = sample(c("G", "T", "A", "C"), 340000, replace=TRUE),
A2 = sample(c("G", "T", "A", "C"), 340000, replace=TRUE),
MAF = runif(340000, 0, 1), stringsAsFactors=FALSE)
nm1 <- paste0("IND", rep(letters[1:2], length(seqIndividuals)),
rep(seqIndividuals, each = 2))
system.time({
setDT(alFrequ)
for(j in seq_along(nm1)){
alFrequ[, nm1[j] := A2][runif(.N, 0, 1) < MAF , nm1[j] := A1][]
}
})
# user system elapsed
# 10.72 1.05 11.76
并在原始数据集上使用OP代码
system.time({
for(i in seqIndividuals) {
alFrequ[paste("IND",i,"a",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) <
alFrequ$MAF, alFrequ$A1, alFrequ$A2)
alFrequ[paste("IND",i,"b",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) <
alFrequ$MAF, alFrequ$A1, alFrequ$A2)
}
})
# user system elapsed
# 72.16 6.82 79.33