我正在遍历一个文件(所谓的multifasta文件,其中每个记录以>开头,并且具有ACTGC等。
>
ACTG
然后,我将这些记录传递给一个调用外部shell脚本的函数。问题是输出全部混乱了,所以我想要的是g的输出,然后是a and的输出。当我使用一个简单的循环时,这种情况发生的很好,但是我想对200万条记录*数百条记录进行处理,因此编写了一个函数来对此并行化。
我最初在循环中遇到相同的问题:
for record in fasta:
f=str(record.seq)
g=str(record.id)
print(g)
a=subprocess.call(["bash","do.sh", f,g])
subprocess.call(["/well/bag/users/lipworth/cobs/build/src/cobs","query","-i","out.cobs_compact","-l","1","-t","0.1", f])
cat do.sh
stdbuf -o0 -e0 echo $2
stdbuf -o0 -e0 /well/bag/users/lipworth/cobs/build/src/cobs query -t .1 -i out.cobs_compact -l 1 --load-complete $1
添加stdbuf位可以解决问题。
这是我当前的代码:
import subprocess
from Bio import SeqIO
from joblib import Parallel, delayed
import multiprocessing
file=open('BLC.fa','rU')
fasta=SeqIO.parse(file,"fasta")
def cobbler(record):
f=str(record.seq)
g=str(record.id)
print(g)
a= subprocess.call(["stdbuf", "-o0", "-e0","bash","do.sh", f, g])
num_cores=multiprocessing.cpu_count()
results=Parallel(n_jobs=num_cores)(delayed(cobbler)(record) for record in fasta)
编辑: 现在我有了这个:
import subprocess
import re
import sys
from Bio import SeqIO
from joblib import Parallel, delayed
import multiprocessing
file=open('BLC.fa','rU')
fasta=SeqIO.parse(file,"fasta")
outfile=open('out','wb')
def cobbler(record):
outfile=open('out','wb')
f=str(record.seq)
g=str(record.id)
a= subprocess.Popen(["bash","do.sh", f, g],stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = a.communicate()
return out
def mp_handler():
p = multiprocessing.Pool(4)
with open('out.txt', 'w') as f:
for result in p.imap(cobbler, fasta):
print(result)
f.write('%s' % result)
if __name__ =='__main__':
mp_handler()
除了没有任何内容保存到out.txt文件外,其行为与预期的一样-为什么?