Question

我正在尝试读取一个非常大的输入csv文件~2gb。

Input.csv -

symbol,line_id,tfp_seq_nbr,lfnet_seq_nbr,capture_time,exch_time,time_gap,lrt_index
XPP=P,242,1423455,1,1467099363540804,1467099363482921,57883,12529312792643
YINN=P,242,1423455,2,1467099363540804,1467099363483013,57791,12540190457924
XTH=P,242,1423455,3,1467099363540804,1467099363483029,57775,12532854751287
XWEB=P,242,1423455,4,1467099363540804,1467099363483041,57763,12534562422840
TNA=P,239,1423455,1,1467099363540804,1467099363482811,57993,12271505440835
UMDD=P,239,1423455,2,1467099363540804,1467099363483057,57747,12343774478404
UBR=P,239,1423455,3,1467099363540804,1467099363483077,57727,12323714760771
UMX=P,239,1423456,4,1467099363552436,1467099363483094,69342,12344826593347
SAA=P,237,1423456,1,1467099363552436,1467099363482487,69949,12067058221123
SBND=P,237,1423456,2,1467099363552436,1467099363482756,69680,12074232840260
SDYL=P,237,1423456,3,1467099363552436,1467099363482779,69657,12098695594052
WBII=P,241,1423456,1,1467099363552436,1467099363483070,69366,12463264235588
A_CTS13,205,1423470,2,1467099363758563,1467099363718247,40316,11138697396278
Z_CTS13,205,1423470,2,1467099363758563,1467099363718247,40316,12566955032630

任务 -

我的目标是创建名为line_id.csv（即242.csv）的不同文件。此文件包含该line_id的所有记录，仅与所有line_id.csv一样。 line_id.csv文件的列是捕获时间，time_gap，time_gap_count。在读取输入csv文件时，我想计算time_gap_count，即在一个唯一的捕获时间内，相应的time_gap是重复的次数，即time_gap_count值。

242.csv -

capture_time,time_gap,time_gap_count
1467099363540804,57791,1
1467099363540804,57775,1
1467099363540804,57883,1
1467099363540804,57763,1

就像我想要生成的所有line_id.csv文件一样。

对于这项任务，我创建了程序。但它需要很长时间并没那么高效。

程序编码

import time,datetime
import sys, getopt, csv
import os
from collections import defaultdict

main_dict = defaultdict(lambda: defaultdict(list))
fieldname = ['capture_time','time_gap','time_gap_count']
op_directory = 'HeatMap_Data'

def make_file():

   try:
      with open(inputfile,'rb') as f_obj:
      reader = csv.reader(f_obj, delimiter=',')
      next(reader,None)
      start_time = time.time()
      for line in reader:
         if line and not len(line) == 8:
            continue
         main_dict[line[1]][line[4]].append(line[6])

      end_time = time.time()
      print 'Normal read-time elapsed:',end_time - start_time
      if not os.path.exists(outputfile + op_directory):
         os.makedirs(outputfile + op_directory)
      start_time = time.time()
      for key,value in main_dict.iteritems():
         f1 = open(outputfile + op_directory +'/'+ key+'.csv', 'w')
         writer1 = csv.DictWriter(f1, delimiter=',', fieldnames = fieldname)
         writer1.writeheader()
         for k,v in value.iteritems():
            if type(v) == type([]):
               set1 = set(v)
               for se in set1:
                   writer1.writerow({'capture_time':k,'time_gap':se,'time_gap_count':v.count(se)})


        end_time = time.time()
        print 'Normal write-time elapsed:',end_time - start_time
   except IOError as e:
    print 'RUN AS :  --->>> test.py -i <inputfile path> -o <outputfile path>\n',e
   except OSError as e:
    print 'RUN AS :  --->>> test.py -i <inputfile path> -o <outputfile path with end "/">\n',e



if __name__ == "__main__":

   start_time = time.time()
   argv = sys.argv[1:]
   inputfile = ''
   outputfile = ''
   try:
      opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
   except getopt.GetoptError:
      print 'test.py -i <inputfile> -o <outputfile>'
      sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
         print 'test.py -i <inputfile> -o <outputfile>'
         sys.exit()
      elif opt in ("-i", "--ifile"):
         inputfile = arg
      elif opt in ("-o", "--ofile"):
         outputfile = arg

    make_file()

end_time = time.time()
print 'Normal time elapsed:',end_time - start_time

此代码需要12分钟。我希望减少这个时间并使它成为一个有效的，因此它需要更少的时间来执行。请建议我，如果任何其他工具适合阅读和写作，也给我建议如何减少此代码的执行时间？

提前致谢！

如何最小化csv文件读取的执行时间，在python中编写？

0 个答案: