我正在尝试通过首先是供体然后是受体的剪接方式对大范围的RNA序列进行排序。串联后,我的方法起作用了,但是当尝试将过程与软件包parallel
中的函数并行化时,我遇到了错误:
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: no method for coercing this S4 class to a vector
我知道错误的第一部分说笔记本电脑的所有4个内核都遇到了此错误,但除此之外,我只能猜测某些内容没有随我的功能导出到群集中。我在下面附上了我的代码的简化版本:
numcores <- detectCores()
cl <- makeCluster(numcores)
fasta.file <- list("---cg-a-ct---g--atggtactg",
"--cg-a-ct---g--atggcactg-",
"cg-a-ct---g--atggtactg-gc",
"agt-gt-tcacccggtga--aattg")
#Just a list of DNA sequences with gaps inserted. In reality, I am reading a fasta file
#with 94k+ sequences of length 1000.
filelength <- length(fasta.file)
clusterExport(cl,c("DNAString",
"fasta.file",
"matchPattern"))
#^Exporting the list of sequences, as well as two functions I will need.
UnknownDonor <- parLapply(cl,
1:filelength,
function(i) DNAString(
gsub(
"-",
"",
fasta.file[[i]]
),
start = 1,
nchar = nchar(
gsub(
"-", "", fasta.file[[i]]
)
)
)
)
#^The sequences I received are aligned, so I am removing gaps, represented by "-".
#Then I convert each sequence to object class "DNAString", from their first character,
#after removing gaps, to their last. This is so they can be sorted later. So far no
#errors.
UnknownDonors <- length(UnknownDonor)
clusterExport(cl,c("UnknownDonor","UnknownDonors"))
#Exporting new variables to be used by the cluster.
D1map <- parSapply(cl,
1:UnknownDonors,
function(i) length(
nchar(
matchPattern("CGACTG",
UnknownDonor[[i]],
max.mismatch = 0,
min.mismatch = 0,
with.indels = TRUE,
fixed = TRUE,
algorithm = "auto"
)
)
)
)
#The code for D1map is supposed to produce a binary vector map of list UnknownDonor,
#saying which positions in the list have the exact sequence "CGACTG", a sequence known
#to only come immediately before the donor splice site D1. It does this when running
#sapply, but not parSapply.
根据我提供的数据,我的预期结果是为D1map分配了向量[1,1,1,0]
。相反,我得到了上述错误。
我尝试过
A。clusterExport(cl,"Biostrings::matchPattern")
B。
D1map <- parSapply(cl,
1:UnknownDonors,
function(i) length(
nchar(
**Biostrings::matchPattern**("CGACTG",
UnknownDonor[[i]],
max.mismatch = 0,
min.mismatch = 0,
with.indels = TRUE,
fixed = TRUE,
algorithm = "auto"
)
)
)
)
C。 importMethodsFrom(Biostrings, matchPattern)
A。导致object "Biostrings::matchPattern" not found
B。导致与我的原始代码相同的错误。
C。导致function importMethodsFrom not found.
此时,不能连续运行我的代码。在我的笔记本电脑上运行9.4k序列需要9.5分钟。线性推断,需要花费26.5天的时间才能对400,000,000个未对齐的序列文件进行分类。我了解我只是输入了很多代码。希望我的评论有所帮助。我试图使代码尽可能易于阅读。如果需要任何澄清,或者您认为我的问题需要重新组织,请告诉我。