在双线比较中计算多态SNP的并行处理

时间:2018-04-09 18:33:35

标签: r

我正在尝试计算线对之间的多态snps的数量,并且我遇到了回答问题所需的计算资源的问题。我从概念上知道这个问题可以(并且应该)使用并行处理来回答,但我正在努力弄清楚如何编写并行处理问题。我没有找到像这样的并行处理问题。提前感谢您的建议。

基本上,我试图比较线对之间的SNP:第1行到第2,3 ...... 7行;然后第2行到第3,4 ...... 7行。所以n(n-1)/ 2比较。对于每个SNP,如果被比较的两条线匹配AA,AB或BB,那么线对于该SNP不是多态的。如果SNP中的任一行具有'NC',则SNP将从计算中取出。因此比较第1和第2行:存在1个匹配的SNP,2个“NC SNP”和2个多态性SNP(2 = 5-(1 + 2))。

我尝试使用foreach更快地制作for循环,但我必须做错了,因为结果需要更多时间才能完成。

我也尝试将代码编写为函数,然后调用稍微提高速度的函数。

这是一个7行和5个SNP的玩具数据集,但实际上,数据集是1000个SNP和数百行。

Line    SNP1    SNP2    SNP3    SNP4    SNP5
Line1   AA  BB  AA  NC  BB
Line2   AA  AA  NC  NC  AA
Line3   BB  AB  NC  BB  AA
Line4   NC  BB  AB  NC  BB
Line5   AA  AA  BB  AB  AA
Line6   NC  NC  AA  AA  NC
Line7   BB  AA  AA  NC  BB
到目前为止,

代码在同事的帮助下

#load in the snps
snps <-read.csv("data.csv", header=T, stringsAsFactors = F)

#create all combinations first
#this is a built-in function that will spit out every combination. Just give it the line names twice.
#remove combinations with matching lines
test <- expand.grid(lineA = snps$Line, lineB = snps$Line) 
test <- test[which(test$lineA!=test$lineB),] 
test <- test[order(test$lineA),]
test <- test[!duplicated(t(apply(test, 1, sort))),]

#create empty columns to be populated
test["NC"]          <- NA
test["match"]       <- NA
test["polymorphic"] <- NA

#get the total number of snps so we can count polymorphic loci for each combo
snp_total_count <- ncol(snps)-1

for (i in 1:nrow(test))   
{
  #get the lines you are going to compare
  lineA <- which(snps$Line==test$lineA[i])
  lineB <- which(snps$Line==test$lineB[i])

  #find the matches not counting NC's 
  test$match[i] <- length(which(snps[lineA,]!="NC" & snps[lineA,]==snps[lineB,]))

  #find the NCs/- cases so paired NC's or single NC's. can't tell polymorphic state or not. count all together 
  #1st count positions in which both lineA and lineB are NC, 
  #then count positions in which only lineA is "NC" (lineA = NC and does not equal LineB) and 
  #then count positions in which only lineB is "NC"(lineB = NC and does not equal LineA) 
  #then add all 3 values together
  test$NC[i] <- length(which(snps[lineA,]=="NC" & snps[lineA,]==snps[lineB,])) + length(which(snps[lineA,]=="NC" & snps[lineA,]!=snps[lineB,])) + length(which(snps[lineB,]=="NC" & snps[lineA,]!=snps[lineB,]))

  #calculate # polymorphic SNPs = total - matching - NC snps 
  test$polymorphic[i] <- snp_total_count - (test$NC[i]+ test$match[i])
}

2 个答案:

答案 0 :(得分:0)

如果data.table + foreach可能,doMC和多个核心可能会显着提高速度。下面是一个简单的示例,您需要添加特定条件以了解NC值的用途。将registerDoMC内的核心设置为可用的核心数。

library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)

dt <- data.table(Line=paste("Line", 1:100, sep=""), 
                SNP1=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
                SNP2=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
                SNP3=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
                SNP4=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE)
                )

查看head(dt)

    Line SNP1 SNP2 SNP3 SNP4
1: Line1   AC   BC   AB   AB
2: Line2   BC   BB   AA   AC
3: Line3   AB   BB   AA   AC
4: Line4   BC   BC   AC   BC
5: Line5   AB   AA   BB   AA
6: Line6   AB   AB   CC   AC

继续前进......

snpCols <- colnames(dt)[2:length(colnames(dt))]

results <- foreach(index.1 = 1:dim(dt)[1], .combine="rbind") %dopar% {
                row1 <- dt[index.1]
                foreach(index.2 = index.1:dim(dt)[1], .combine="rbind") %do% {
                    row2 <- dt[index.2]
                    # do operations / return final data.table object that has values containing column values you want
                    return(data.table("lineX"=row1$Line, 
                                      "lineY"=row2$Line,
                                      "nMatches"=sum(row1[,snpCols, with=FALSE] == row2[,snpCols, with=FALSE])
                                      )
                          )
    }
}

产生对象results

        lineX   lineY nMatches
   1:   Line1   Line1        4
   2:   Line1   Line2        0
   3:   Line1   Line3        0
   4:   Line1   Line4        1
   5:   Line1   Line5        0
  ---
5046:  Line98  Line99        0
5047:  Line98 Line100        0
5048:  Line99  Line99        4
5049:  Line99 Line100        0
5050: Line100 Line100        4

请注意,这也是将每一行与自身进行比较;你可以根据你的需要保留或删除它们。

答案 1 :(得分:0)

要获得匹配的SNP,请使用data[LineX, ] == d[LineY, ],以获取NC个SNP使用:data[LineX, ] == "NC" | data[LineY, ] == "NC"。要并行运行它,您可以使用future,它为foreach并行化提供支持。

library(doFuture)
registerDoFuture()
plan(multiprocess)

N <- nrow(d)
d$Line <- NULL

result <- foreach(i = 1:(N - 1), .combine = rbind) %do% {
    foreach(j = (i + 1):N, .combine = rbind) %dopar% {
        data.frame(
            NC = sum(d[i, ] == "NC" | d[j, ] == "NC"),
            MATCH = sum(d[i, ] == d[j, ] & d[i, ] != "NC"),
            I = i, J = j)
    }
}

数据(d):

structure(list(Line = c("Line1", "Line2", "Line3", "Line4", "Line5", 
"Line6", "Line7"), SNP1 = c("AA", "AA", "BB", "NC", "AA", "NC", 
"BB"), SNP2 = c("BB", "AA", "AB", "BB", "AA", "NC", "AA"), SNP3 = c("AA", 
"NC", "NC", "AB", "BB", "AA", "AA"), SNP4 = c("NC", "NC", "BB", 
"NC", "AB", "AA", "NC"), SNP5 = c("BB", "AA", "AA", "BB", "AA", 
"NC", "BB")), .Names = c("Line", "SNP1", "SNP2", "SNP3", "SNP4", 
"SNP5"), row.names = c(NA, -7L), class = "data.frame")

结果(result):

   NC MATCH I J
1   2     1 1 2
2   2     0 1 3
3   2     2 1 4
4   1     1 1 5
5   4     1 1 6
6   1     2 1 7
7   2     1 2 3
8   3     0 2 4
9   2     3 2 5
10  5     0 2 6
...