我正在尝试读取一个非常大的输入csv文件~2gb。
Input.csv -
symbol,line_id,tfp_seq_nbr,lfnet_seq_nbr,capture_time,exch_time,time_gap,lrt_index
XPP=P,242,1423455,1,1467099363540804,1467099363482921,57883,12529312792643
YINN=P,242,1423455,2,1467099363540804,1467099363483013,57791,12540190457924
XTH=P,242,1423455,3,1467099363540804,1467099363483029,57775,12532854751287
XWEB=P,242,1423455,4,1467099363540804,1467099363483041,57763,12534562422840
TNA=P,239,1423455,1,1467099363540804,1467099363482811,57993,12271505440835
UMDD=P,239,1423455,2,1467099363540804,1467099363483057,57747,12343774478404
UBR=P,239,1423455,3,1467099363540804,1467099363483077,57727,12323714760771
UMX=P,239,1423456,4,1467099363552436,1467099363483094,69342,12344826593347
SAA=P,237,1423456,1,1467099363552436,1467099363482487,69949,12067058221123
SBND=P,237,1423456,2,1467099363552436,1467099363482756,69680,12074232840260
SDYL=P,237,1423456,3,1467099363552436,1467099363482779,69657,12098695594052
WBII=P,241,1423456,1,1467099363552436,1467099363483070,69366,12463264235588
A_CTS13,205,1423470,2,1467099363758563,1467099363718247,40316,11138697396278
Z_CTS13,205,1423470,2,1467099363758563,1467099363718247,40316,12566955032630
任务 -
我的目标是创建名为line_id.csv(即242.csv)的不同文件。此文件包含该line_id的所有记录,仅与所有line_id.csv一样。 line_id.csv文件的列是捕获时间,time_gap,time_gap_count。在读取输入csv文件时,我想计算time_gap_count,即在一个唯一的捕获时间内,相应的time_gap是重复的次数,即time_gap_count值。
242.csv -
capture_time,time_gap,time_gap_count
1467099363540804,57791,1
1467099363540804,57775,1
1467099363540804,57883,1
1467099363540804,57763,1
就像我想要生成的所有line_id.csv文件一样。
对于这项任务,我创建了程序。但它需要很长时间并没那么高效。
程序编码
import time,datetime
import sys, getopt, csv
import os
from collections import defaultdict
main_dict = defaultdict(lambda: defaultdict(list))
fieldname = ['capture_time','time_gap','time_gap_count']
op_directory = 'HeatMap_Data'
def make_file():
try:
with open(inputfile,'rb') as f_obj:
reader = csv.reader(f_obj, delimiter=',')
next(reader,None)
start_time = time.time()
for line in reader:
if line and not len(line) == 8:
continue
main_dict[line[1]][line[4]].append(line[6])
end_time = time.time()
print 'Normal read-time elapsed:',end_time - start_time
if not os.path.exists(outputfile + op_directory):
os.makedirs(outputfile + op_directory)
start_time = time.time()
for key,value in main_dict.iteritems():
f1 = open(outputfile + op_directory +'/'+ key+'.csv', 'w')
writer1 = csv.DictWriter(f1, delimiter=',', fieldnames = fieldname)
writer1.writeheader()
for k,v in value.iteritems():
if type(v) == type([]):
set1 = set(v)
for se in set1:
writer1.writerow({'capture_time':k,'time_gap':se,'time_gap_count':v.count(se)})
end_time = time.time()
print 'Normal write-time elapsed:',end_time - start_time
except IOError as e:
print 'RUN AS : --->>> test.py -i <inputfile path> -o <outputfile path>\n',e
except OSError as e:
print 'RUN AS : --->>> test.py -i <inputfile path> -o <outputfile path with end "/">\n',e
if __name__ == "__main__":
start_time = time.time()
argv = sys.argv[1:]
inputfile = ''
outputfile = ''
try:
opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
except getopt.GetoptError:
print 'test.py -i <inputfile> -o <outputfile>'
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print 'test.py -i <inputfile> -o <outputfile>'
sys.exit()
elif opt in ("-i", "--ifile"):
inputfile = arg
elif opt in ("-o", "--ofile"):
outputfile = arg
make_file()
end_time = time.time()
print 'Normal time elapsed:',end_time - start_time
此代码需要12分钟。我希望减少这个时间并使它成为一个有效的,因此它需要更少的时间来执行。请建议我,如果任何其他工具适合阅读和写作,也给我建议如何减少此代码的执行时间?
提前致谢!