我有一个python脚本,我初始化一个包含大约490万个键的字典。 Eack键有一个包含24个元素的列表,我将其初始化为零。我需要解析一个包含大约970万行(每列20列)的文本文件,并根据与字典键的特定匹配,增加键的相应列表整数。
问题是解析非常慢,我的工作被杀死(群集上最长24小时的挂起时间)。要初始化的字典大小约为200 Mb,在进行一些时间检查后,我发现解析10,000行需要大约16分钟,因此解析整个970万行需要大约242小时
简而言之,我只需要计算并增加字典键的适当值。是否有python字典的替代数据结构可以优化此脚本并使其在合理的时间内运行?
def count_dict_init(file):
gff_file = open(file, 'r')
pos_list = []
for line in gff_file:
line_list = line.strip().split('\t')
if line.startswith('chr') and line[0:5] != 'chrmt':
if line_list[2] == 'CDS':
leftpos = int(line_list[3])
rightpos = int(line_list[4])
for position in range(leftpos - 100, rightpos + 101):
pos_list.append(position)
uniq_list = set(pos_list)
sorted_list = list(uniq_list)
sorted_list.sort()
pos_dict = {}
for pos in sorted_list:
pos_dict[pos] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '', '']
print 'Size of count dicitonary is ', sys.getsizeof(pos_dict)
return pos_dict
def sam_parser(sam_file, count):
dict_count = count
parsed_file = open('Sam_parsed_dict.tab', 'w')
non_cds_file = open('Non_Cds_file', 'w')
for line in sam_file:
if line[0] != '@':
fields = line.split('\t')
if len(fields) > 19:
multi_flag = fields[19].strip()
# If the read has more than one alignment then report it as multiple mapping
if multi_flag != 'NH:i:1':
multi_align = 'Y'
else:
multi_align = 'N'
else:
multi_align = 'N'
non_cds = False
sam_flag = int(fields[1])
chr_num = fields[2]
read_length = len(fields[9])
pos_in_value = (read_length - 27) * 2 #Determines which list position to update
if 27 <= read_length <= 37:
if sam_flag == 0: # Primary alignment on forward strand
five_prime = int(fields[3])
if five_prime in dict_count.keys():
dict_count[five_prime][pos_in_value] += 1
aligner_cis = dict_count[five_prime][22]
if aligner_cis == 'Y':
continue
else:
dict_count[five_prime][22] = multi_align
else:
non_cds = True
if sam_flag == 16: # On reverse strand
five_prime = int(fields[3]) + read_length - 1
if five_prime in dict_count.keys():
dict_count[five_prime][pos_in_value + 1] += 1
aligner_trans = dict_count[five_prime][23]
if aligner_trans == 'Y':
continue
else:
dict_count[five_prime][23] = multi_align
else:
non_cds = True
if sam_flag == 256: # Not primary alignment
five_prime = int(fields[3])
if five_prime in dict_count.keys():
aligner_cis = dict_count[five_prime][22]
if aligner_cis == 'Y':
continue
else:
dict_count[five_prime][22] = multi_align
else:
non_cds = True
if sam_flag == 272: # Not primary alignment and on reverse strand
five_prime = int(fields[3]) + read_length - 1
if five_prime in dict_count.keys():
aligner_trans = dict_count[five_prime][23]
if aligner_trans == 'Y':
continue
else:
dict_count[five_prime][23] = multi_align
else:
non_cds = True
if non_cds:
non_cds_file.write(str(chr_num)+'\t'+str(fields[3])+'\n')
for pos, counts in dict_count.iteritems():
parsed_file.write(str(pos)+'\t'+'\t'.join(map(str, counts))+'\n')
parsed_file.close()
non_cds_file.close()
if __name__ == "__main__":
# Parse arguments from commandline
arguments = parse_arguments()
GFF = arguments.gfffile
chrnum = arguments.chrnum
initial_count_dict = count_dict_init(GFF)
SAM = open(arguments.inputPath)
sam_parser(SAM, initial_count_dict)
答案 0 :(得分:12)
我认为你的问题是这个表达式:if five_prime in dict_count.keys():
这将创建一个包含字典中每个键的新列表(4.9M),然后线性地遍历它直到找到该键(如果找不到该键,它将遍历整个列表)。
由于在字典中查找某个键需要进行1次操作并在列表中查找4.9M操作,因此您需要使用此操作:if five_prime in dict_count:
。
另一件事是你的查找次数比你需要的多几倍。如果在字典中进行查找以任何方式成为瓶颈,您可以通过每次迭代仅执行一次查找来最小化它。这是一些示例代码:
five_prime = int(fields[3])
record = dict_count.get(five_prime)
if record is not None:
record[pos_in_value] += 1
aligner_cis = record[22]
if aligner_cis == 'Y':
continue
else:
record[22] = multi_align
else:
non_cds = True