寻找如何将“方法”导出到R中的集群?

时间:2019-07-11 15:14:49

标签: r parallel-processing

我正在尝试通过首先是供体然后是受体的剪接方式对大范围的RNA序列进行排序。串联后,我的方法起作用了,但是当尝试将过程与软件包parallel中的函数并行化时,我遇到了错误:

Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: no method for coercing this S4 class to a vector

我知道错误的第一部分说笔记本电脑的所有4个内核都遇到了此错误,但除此之外,我只能猜测某些内容没有随我的功能导出到群集中。我在下面附上了我的代码的简化版本:

numcores <- detectCores()
cl <- makeCluster(numcores)

fasta.file <- list("---cg-a-ct---g--atggtactg",
                   "--cg-a-ct---g--atggcactg-",
                   "cg-a-ct---g--atggtactg-gc",
                   "agt-gt-tcacccggtga--aattg")
#Just a list of DNA sequences with gaps inserted. In reality, I am reading a fasta file
#with 94k+ sequences of length 1000.

filelength <- length(fasta.file)

clusterExport(cl,c("DNAString",
                   "fasta.file",
                   "matchPattern"))
#^Exporting the list of sequences, as well as two functions I will need.

UnknownDonor <- parLapply(cl,
                          1:filelength, 
                          function(i) DNAString(
                                                gsub(
                                                     "-",
                                                     "",
                                                     fasta.file[[i]]
                                                    ),
                                                start = 1,
                                                nchar = nchar(
                                                              gsub(
                                                                   "-", "", fasta.file[[i]]
                                                                   )
                                                              )
                                                )
                         )
#^The sequences I received are aligned, so I am removing gaps, represented by "-". 
#Then I convert each sequence to object class "DNAString", from their first character, 
#after removing gaps, to their last. This is so they can be sorted later. So far no 
#errors.

UnknownDonors <- length(UnknownDonor)

clusterExport(cl,c("UnknownDonor","UnknownDonors"))
#Exporting new variables to be used by the cluster.

D1map <- parSapply(cl,
                   1:UnknownDonors,
                   function(i) length(
                                      nchar(
                                            matchPattern("CGACTG",
                                                         UnknownDonor[[i]], 
                                                         max.mismatch = 0,
                                                         min.mismatch = 0, 
                                                         with.indels = TRUE,
                                                         fixed = TRUE, 
                                                         algorithm = "auto"
                                                        )
                                            )
                                     )
                  )
#The code for D1map is supposed to produce a binary vector map of list UnknownDonor,
#saying which positions in the list have the exact sequence "CGACTG", a sequence known
#to only come immediately before the donor splice site D1. It does this when running
#sapply, but not parSapply.

根据我提供的数据,我的预期结果是为D1map分配了向量[1,1,1,0]。相反,我得到了上述错误。 我尝试过

A。clusterExport(cl,"Biostrings::matchPattern")

B。

D1map <- parSapply(cl,
                   1:UnknownDonors,
                   function(i) length(
                                      nchar(
                                            **Biostrings::matchPattern**("CGACTG",
                                                        UnknownDonor[[i]], 
                                                        max.mismatch = 0,
                                                        min.mismatch = 0, 
                                                        with.indels = TRUE,
                                                        fixed = TRUE, 
                                                        algorithm = "auto"
                                                                        )
                                            )
                                     )
                  ) 

C。 importMethodsFrom(Biostrings, matchPattern)

A。导致object "Biostrings::matchPattern" not found

B。导致与我的原始代码相同的错误。

C。导致function importMethodsFrom not found.

此时,不能连续运行我的代码。在我的笔记本电脑上运行9.4k序列需要9.5分钟。线性推断,需要花费26.5天的时间才能对400,000,000个未对齐的序列文件进行分类。我了解我只是输入了很多代码。希望我的评论有所帮助。我试图使代码尽可能易于阅读。如果需要任何澄清,或者您认为我的问题需要重新组织,请告诉我。

0 个答案:

没有答案