使用dict.items()针对大型数据集优化字典查找

时间:2019-01-16 02:19:06

标签: python multithreading dictionary multiprocessing biopython

我是新手,最近几个月开始在pyhton中进行编码。我有一个脚本,该脚本需要一个蛋白质组(800 Kb文件,包含2850个字符串),并针对一个大型数据集(将8Gb文件中的2300万个字符串保存为ID:protein_string字典中的代码)检查每个单独的蛋白质(protein_string),并报告Id所有相同的字符串(每个字符串最多可报告8500个ID)。当前脚本需要4个小时才能运行。一般而言,可以做些什么来加快该过程,以及如何将我的脚本转换为多处理或多线程(不确定差异)?

[ 33%] Building CXX object 
/home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/bin/x86_64-w64-mingw32-g++    @CMakeFiles/Transform360.dir/includes_CXX.rsp -std=c++11  -O3 -DNDEBUG   -o CMakeFiles/Transform360.dir/Library/VideoFrameTransform.cpp.obj -c /home/rdp/ffmpeg-windows-build-helpers/sandbox/win64/transform360_git/Transform360/Library/VideoFrameTransform.cppCMakeFiles/Transform360.dir/Library/VideoFrameTransform.cpp.obj
In file included from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/ext/string_conversions.h:41,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/bits/basic_string.h:6391,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/string:52,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/stdexcept:39,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/array:39,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/tuple:39,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/bits/stl_map.h:63,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/map:61,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/win64/transform360_git/Transform360/Library/VideoFrameTransform.h:18,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/win64/transform360_git/Transform360/Library/VideoFrameTransform.cpp:14:
/home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/cstdlib:75:15: fatal error: stdlib.h: No such file or directory
 #include_next <stdlib.h>
               ^~~~~~~~~~
compilation terminated.

2 个答案:

答案 0 :(得分:3)

在我看来,初步看来,将字典存储为{ id : seq }而不是将字典存储为{ seq : [id_list] }可能更有意义。由于听起来每个序列都有很多重复,因此可以节省访问特定序列的所有ID的时间。您可以通过使用defaultdict并将默认值作为空列表来读取数据来执行此操作,并且在读取ID和序列时,可以使用sequences_dict[record.seq].append(record.description)将其添加到字典中。 / p>

让我知道这是否有帮助,我是否还能提供其他帮助。

答案 1 :(得分:2)

根据Sam Hollenbach的建议,我可能会对您的代码进行以下(4)更改。

import sys
from Bio import AlignIO
from Bio import SeqIO
from Bio.Seq import Seq
import time
start_time = time.time()
from collections import defaultdict


databasefile = sys.argv[1]
queryfile = sys.argv[2]

file_hits = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_hits_v1.txt"
file_report = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_report_v1.txt"
_format = "fasta" #(change 1)
output_file = open(file_hits, 'w')
output_file_2 = open(file_report,'w')
sequences_dict = defaultdict(list)

output_file.write("{}\t{}\n".format("protein_query", "hits"))
for record in SeqIO.parse(databasefile, _format):
    sequences_dict[record.seq].append(record.description) #(change 2)
    #sequences_dict[record.description] = str(record.seq)
print("processed database in --- {:.3f} seconds ---".format(time.time() - start_time))

processed_counter = 0
for record in SeqIO.parse(queryfile, _format):
    query_seq = record.seq #(change 3)
    count = 0
    output_file.write("{}\t".format(record.description))
    if query_seq in sequences_dict: #(change 4)
        count = len(sequences_dict[query_seq])
        output_file.write('\t'.join(sequences_dict[query_seq]) + "\n")
    processed_counter += 1
    print("processed protein", processed_counter)
    output_file_2.write(record.description+'\t'+str(count)+
                        '\t'+str(len(record.seq))+'\t'+str(record.seq)+'\n')
output_file.close()
output_file_2.close()
print("Done in --- {:.3f} seconds ---".format(time.time() - start_time))

更改#1:-将格式变量的名称更改为_format(以避免与Python术语“格式”冲突) 并在使用代码的地方进行更改。

更改2:使用record.seq作为字典的键,并将record.description附加到列表(作为值)

更改3:无需将record.seq强制转换为str-它已经是字符串。

更改4:这3行将比在原始代码中遍历字典更快地定位任何匹配的记录。

我不确定output_file.write("{}\t".format(record.description))的处理方式。

此外,不能说我已经找到了完整的工作程序所需的所有更改。如果您在尝试建议的更改后有任何疑问,请告诉我。