Question

我是新手，最近几个月开始在pyhton中进行编码。我有一个脚本，该脚本需要一个蛋白质组（800 Kb文件，包含2850个字符串），并针对一个大型数据集（将8Gb文件中的2300万个字符串保存为ID：protein_string字典中的代码）检查每个单独的蛋白质（protein_string），并报告Id所有相同的字符串（每个字符串最多可报告8500个ID）。当前脚本需要4个小时才能运行。一般而言，可以做些什么来加快该过程，以及如何将我的脚本转换为多处理或多线程（不确定差异）？

[ 33%] Building CXX object 
/home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/bin/x86_64-w64-mingw32-g++    @CMakeFiles/Transform360.dir/includes_CXX.rsp -std=c++11  -O3 -DNDEBUG   -o CMakeFiles/Transform360.dir/Library/VideoFrameTransform.cpp.obj -c /home/rdp/ffmpeg-windows-build-helpers/sandbox/win64/transform360_git/Transform360/Library/VideoFrameTransform.cppCMakeFiles/Transform360.dir/Library/VideoFrameTransform.cpp.obj
In file included from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/ext/string_conversions.h:41,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/bits/basic_string.h:6391,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/string:52,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/stdexcept:39,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/array:39,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/tuple:39,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/bits/stl_map.h:63,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/map:61,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/win64/transform360_git/Transform360/Library/VideoFrameTransform.h:18,
                 from /home/rdp/ffmpeg-windows-build-helpers/sandbox/win64/transform360_git/Transform360/Library/VideoFrameTransform.cpp:14:
/home/rdp/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32/include/c++/8.2.0/cstdlib:75:15: fatal error: stdlib.h: No such file or directory
 #include_next <stdlib.h>
               ^~~~~~~~~~
compilation terminated.

Answer 1

在我看来，初步看来，将字典存储为{ id : seq }而不是将字典存储为{ seq : [id_list] }可能更有意义。由于听起来每个序列都有很多重复，因此可以节省访问特定序列的所有ID的时间。您可以通过使用defaultdict并将默认值作为空列表来读取数据来执行此操作，并且在读取ID和序列时，可以使用sequences_dict[record.seq].append(record.description)将其添加到字典中。 / p>

让我知道这是否有帮助，我是否还能提供其他帮助。

Answer 2

根据Sam Hollenbach的建议，我可能会对您的代码进行以下（4）更改。

import sys
from Bio import AlignIO
from Bio import SeqIO
from Bio.Seq import Seq
import time
start_time = time.time()
from collections import defaultdict


databasefile = sys.argv[1]
queryfile = sys.argv[2]

file_hits = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_hits_v1.txt"
file_report = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_report_v1.txt"
_format = "fasta" #(change 1)
output_file = open(file_hits, 'w')
output_file_2 = open(file_report,'w')
sequences_dict = defaultdict(list)

output_file.write("{}\t{}\n".format("protein_query", "hits"))
for record in SeqIO.parse(databasefile, _format):
    sequences_dict[record.seq].append(record.description) #(change 2)
    #sequences_dict[record.description] = str(record.seq)
print("processed database in --- {:.3f} seconds ---".format(time.time() - start_time))

processed_counter = 0
for record in SeqIO.parse(queryfile, _format):
    query_seq = record.seq #(change 3)
    count = 0
    output_file.write("{}\t".format(record.description))
    if query_seq in sequences_dict: #(change 4)
        count = len(sequences_dict[query_seq])
        output_file.write('\t'.join(sequences_dict[query_seq]) + "\n")
    processed_counter += 1
    print("processed protein", processed_counter)
    output_file_2.write(record.description+'\t'+str(count)+
                        '\t'+str(len(record.seq))+'\t'+str(record.seq)+'\n')
output_file.close()
output_file_2.close()
print("Done in --- {:.3f} seconds ---".format(time.time() - start_time))

更改＃1：-将格式变量的名称更改为_format（以避免与Python术语“格式”冲突）并在使用代码的地方进行更改。

更改2：使用record.seq作为字典的键，并将record.description附加到列表（作为值）

更改3：无需将record.seq强制转换为str-它已经是字符串。

更改4：这3行将比在原始代码中遍历字典更快地定位任何匹配的记录。

我不确定output_file.write("{}\t".format(record.description))的处理方式。

此外，不能说我已经找到了完整的工作程序所需的所有更改。如果您在尝试建议的更改后有任何疑问，请告诉我。

使用dict.items（）针对大型数据集优化字典查找

2 个答案: