我正在尝试计算线对之间的多态snps的数量,并且我遇到了回答问题所需的计算资源的问题。我从概念上知道这个问题可以(并且应该)使用并行处理来回答,但我正在努力弄清楚如何编写并行处理问题。我没有找到像这样的并行处理问题。提前感谢您的建议。
基本上,我试图比较线对之间的SNP:第1行到第2,3 ...... 7行;然后第2行到第3,4 ...... 7行。所以n(n-1)/ 2比较。对于每个SNP,如果被比较的两条线匹配AA,AB或BB,那么线对于该SNP不是多态的。如果SNP中的任一行具有'NC',则SNP将从计算中取出。因此比较第1和第2行:存在1个匹配的SNP,2个“NC SNP”和2个多态性SNP(2 = 5-(1 + 2))。
我尝试使用foreach更快地制作for循环,但我必须做错了,因为结果需要更多时间才能完成。
我也尝试将代码编写为函数,然后调用稍微提高速度的函数。
这是一个7行和5个SNP的玩具数据集,但实际上,数据集是1000个SNP和数百行。
Line SNP1 SNP2 SNP3 SNP4 SNP5
Line1 AA BB AA NC BB
Line2 AA AA NC NC AA
Line3 BB AB NC BB AA
Line4 NC BB AB NC BB
Line5 AA AA BB AB AA
Line6 NC NC AA AA NC
Line7 BB AA AA NC BB
到目前为止,代码在同事的帮助下
#load in the snps
snps <-read.csv("data.csv", header=T, stringsAsFactors = F)
#create all combinations first
#this is a built-in function that will spit out every combination. Just give it the line names twice.
#remove combinations with matching lines
test <- expand.grid(lineA = snps$Line, lineB = snps$Line)
test <- test[which(test$lineA!=test$lineB),]
test <- test[order(test$lineA),]
test <- test[!duplicated(t(apply(test, 1, sort))),]
#create empty columns to be populated
test["NC"] <- NA
test["match"] <- NA
test["polymorphic"] <- NA
#get the total number of snps so we can count polymorphic loci for each combo
snp_total_count <- ncol(snps)-1
for (i in 1:nrow(test))
{
#get the lines you are going to compare
lineA <- which(snps$Line==test$lineA[i])
lineB <- which(snps$Line==test$lineB[i])
#find the matches not counting NC's
test$match[i] <- length(which(snps[lineA,]!="NC" & snps[lineA,]==snps[lineB,]))
#find the NCs/- cases so paired NC's or single NC's. can't tell polymorphic state or not. count all together
#1st count positions in which both lineA and lineB are NC,
#then count positions in which only lineA is "NC" (lineA = NC and does not equal LineB) and
#then count positions in which only lineB is "NC"(lineB = NC and does not equal LineA)
#then add all 3 values together
test$NC[i] <- length(which(snps[lineA,]=="NC" & snps[lineA,]==snps[lineB,])) + length(which(snps[lineA,]=="NC" & snps[lineA,]!=snps[lineB,])) + length(which(snps[lineB,]=="NC" & snps[lineA,]!=snps[lineB,]))
#calculate # polymorphic SNPs = total - matching - NC snps
test$polymorphic[i] <- snp_total_count - (test$NC[i]+ test$match[i])
}
答案 0 :(得分:0)
如果data.table
+ foreach
可能,doMC
和多个核心可能会显着提高速度。下面是一个简单的示例,您需要添加特定条件以了解NC值的用途。将registerDoMC
内的核心设置为可用的核心数。
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
dt <- data.table(Line=paste("Line", 1:100, sep=""),
SNP1=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP2=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP3=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP4=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE)
)
查看head(dt)
:
Line SNP1 SNP2 SNP3 SNP4
1: Line1 AC BC AB AB
2: Line2 BC BB AA AC
3: Line3 AB BB AA AC
4: Line4 BC BC AC BC
5: Line5 AB AA BB AA
6: Line6 AB AB CC AC
继续前进......
snpCols <- colnames(dt)[2:length(colnames(dt))]
results <- foreach(index.1 = 1:dim(dt)[1], .combine="rbind") %dopar% {
row1 <- dt[index.1]
foreach(index.2 = index.1:dim(dt)[1], .combine="rbind") %do% {
row2 <- dt[index.2]
# do operations / return final data.table object that has values containing column values you want
return(data.table("lineX"=row1$Line,
"lineY"=row2$Line,
"nMatches"=sum(row1[,snpCols, with=FALSE] == row2[,snpCols, with=FALSE])
)
)
}
}
产生对象results
lineX lineY nMatches
1: Line1 Line1 4
2: Line1 Line2 0
3: Line1 Line3 0
4: Line1 Line4 1
5: Line1 Line5 0
---
5046: Line98 Line99 0
5047: Line98 Line100 0
5048: Line99 Line99 4
5049: Line99 Line100 0
5050: Line100 Line100 4
请注意,这也是将每一行与自身进行比较;你可以根据你的需要保留或删除它们。
答案 1 :(得分:0)
要获得匹配的SNP,请使用data[LineX, ] == d[LineY, ]
,以获取NC
个SNP使用:data[LineX, ] == "NC" | data[LineY, ] == "NC"
。要并行运行它,您可以使用future
,它为foreach并行化提供支持。
library(doFuture)
registerDoFuture()
plan(multiprocess)
N <- nrow(d)
d$Line <- NULL
result <- foreach(i = 1:(N - 1), .combine = rbind) %do% {
foreach(j = (i + 1):N, .combine = rbind) %dopar% {
data.frame(
NC = sum(d[i, ] == "NC" | d[j, ] == "NC"),
MATCH = sum(d[i, ] == d[j, ] & d[i, ] != "NC"),
I = i, J = j)
}
}
数据(d
):
structure(list(Line = c("Line1", "Line2", "Line3", "Line4", "Line5",
"Line6", "Line7"), SNP1 = c("AA", "AA", "BB", "NC", "AA", "NC",
"BB"), SNP2 = c("BB", "AA", "AB", "BB", "AA", "NC", "AA"), SNP3 = c("AA",
"NC", "NC", "AB", "BB", "AA", "AA"), SNP4 = c("NC", "NC", "BB",
"NC", "AB", "AA", "NC"), SNP5 = c("BB", "AA", "AA", "BB", "AA",
"NC", "BB")), .Names = c("Line", "SNP1", "SNP2", "SNP3", "SNP4",
"SNP5"), row.names = c(NA, -7L), class = "data.frame")
结果(result
):
NC MATCH I J
1 2 1 1 2
2 2 0 1 3
3 2 2 1 4
4 1 1 1 5
5 4 1 1 6
6 1 2 1 7
7 2 1 2 3
8 3 0 2 4
9 2 3 2 5
10 5 0 2 6
...