由于庞大的字典大小,最佳替代数据结构可以改善运行时间?

时间:2015-02-10 23:57:05

标签: python optimization dictionary data-structures

我有一个python脚本,我初始化一个包含大约490万个键的字典。 Eack键有一个包含24个元素的列表,我将其初始化为零。我需要解析一个包含大约970万行(每列20列)的文本文件,并根据与字典键的特定匹配,增加键的相应列表整数。

问题是解析非常慢,我的工作被杀死(群集上最长24小时的挂起时间)。要初始化的字典大小约为200 Mb,在进行一些时间检查后,我发现解析10,000行需要大约16分钟,因此解析整个970万行需要大约242小时

简而言之,我只需要计算并增加字典键的适当值。是否有python字典的替代数据结构可以优化此脚本并使其在合理的时间内运行?

def count_dict_init(file):
    gff_file = open(file, 'r')
    pos_list = []
    for line in gff_file:
       line_list = line.strip().split('\t')
       if line.startswith('chr') and line[0:5] != 'chrmt':
          if line_list[2] == 'CDS':
            leftpos = int(line_list[3])
            rightpos = int(line_list[4])
            for position in range(leftpos - 100, rightpos + 101):
                pos_list.append(position)

    uniq_list = set(pos_list)
    sorted_list = list(uniq_list)
    sorted_list.sort()
    pos_dict = {}
    for pos in sorted_list:
        pos_dict[pos] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '', '']

    print 'Size of count dicitonary is ', sys.getsizeof(pos_dict)    
    return pos_dict

def sam_parser(sam_file, count):
   dict_count = count
   parsed_file = open('Sam_parsed_dict.tab', 'w')
   non_cds_file = open('Non_Cds_file', 'w')
   for line in sam_file:
       if line[0] != '@':
          fields = line.split('\t')
          if len(fields) > 19:
              multi_flag = fields[19].strip()  
              # If the read has more than one alignment then report it as multiple mapping
              if multi_flag != 'NH:i:1':
                  multi_align = 'Y'
              else:
                  multi_align = 'N'
          else:
              multi_align = 'N'
        non_cds = False
        sam_flag = int(fields[1])
        chr_num = fields[2]
        read_length = len(fields[9])
        pos_in_value = (read_length - 27) * 2  #Determines which list position to update
        if 27 <= read_length <= 37:
           if sam_flag == 0:  # Primary alignment on forward strand
               five_prime = int(fields[3])
               if five_prime in dict_count.keys():
                   dict_count[five_prime][pos_in_value] += 1
                   aligner_cis = dict_count[five_prime][22]
                   if aligner_cis == 'Y':
                       continue
                   else:
                       dict_count[five_prime][22] = multi_align
               else:
                  non_cds = True
           if sam_flag == 16:  # On reverse strand
               five_prime = int(fields[3]) + read_length - 1
               if five_prime in dict_count.keys():
                   dict_count[five_prime][pos_in_value + 1] += 1
                   aligner_trans = dict_count[five_prime][23]
                   if aligner_trans == 'Y':
                       continue
                   else:
                       dict_count[five_prime][23] = multi_align
               else:
                  non_cds = True
           if sam_flag == 256:  # Not primary alignment
               five_prime = int(fields[3])
               if five_prime in dict_count.keys():
                  aligner_cis = dict_count[five_prime][22]
                  if aligner_cis == 'Y':
                     continue
                  else:
                     dict_count[five_prime][22] = multi_align
               else:
                  non_cds = True
           if sam_flag == 272:  # Not primary alignment and on reverse strand
               five_prime = int(fields[3]) + read_length - 1
               if five_prime in dict_count.keys():
                  aligner_trans = dict_count[five_prime][23]
                  if aligner_trans == 'Y':
                    continue
                  else:
                    dict_count[five_prime][23] = multi_align
               else:
                  non_cds = True
           if non_cds:
               non_cds_file.write(str(chr_num)+'\t'+str(fields[3])+'\n')

   for pos, counts in dict_count.iteritems():
        parsed_file.write(str(pos)+'\t'+'\t'.join(map(str, counts))+'\n')

   parsed_file.close()
   non_cds_file.close()

if __name__ == "__main__":
   # Parse arguments from commandline
   arguments = parse_arguments()
   GFF = arguments.gfffile
   chrnum = arguments.chrnum
   initial_count_dict = count_dict_init(GFF)
   SAM = open(arguments.inputPath)
   sam_parser(SAM, initial_count_dict)

1 个答案:

答案 0 :(得分:12)

我认为你的问题是这个表达式:if five_prime in dict_count.keys():

这将创建一个包含字典中每个键的新列表(4.9M),然后线性地遍历它直到找到该键(如果找不到该键,它将遍历整个列表)。

由于在字典中查找某个键需要进行1次操作并在列表中查找4.9M操作,因此您需要使用此操作:if five_prime in dict_count:

另一件事是你的查找次数比你需要的多几倍。如果在字典中进行查找以任何方式成为瓶颈,您可以通过每次迭代仅执行一次查找来最小化它。这是一些示例代码:

           five_prime = int(fields[3])
           record = dict_count.get(five_prime)
           if record is not None:
               record[pos_in_value] += 1
               aligner_cis = record[22]
               if aligner_cis == 'Y':
                   continue
               else:
                   record[22] = multi_align
           else:
              non_cds = True