我有2个csv文件,我需要针对另一个文件中的所有行测试第一个文件中的每一行,并将匹配行写入新的csv文件。这是文件结构。
csv 1 (读数相差20s。第一行总是标题):
日期时间Col1 Col2
6/27/16 17:28:21 21 3244
6/27/16 17:28:41 21 3278
6/27/16 17:29:01 21 3299
csv 2 (读数相隔1秒。没有标题):
6/27/16 17:28:21 3245
6/27/16 17:28:22 3266
6/27/16 17:28:23 3277
我比较了来自csv1和csv2的时间戳,并且在匹配时,我创建了一个包含csv1行的输出行以及从csv2读取的第二列。输出csv文件中的示例行将是:
日期时间Col1 Col2 Col3
6/27/16 17:28:21 21 3244 3245
这是我执行此操作的python代码:
with open("file1.csv",'r') as csv1:
with open("out.csv", 'w') as myoutput:
writer = csv.writer(myoutput)
row_count=0
headerSet=0
for row in csv.reader(csv1):
with open ("file2.csv",'r') as csv2:
in2 = csv.reader(csv2)
for mrow in in2:
if row_count == 0 and headerSet==0:
# Generate Header Row for the output csv file
writer.writerow(row+["Col3"])
headerSet=1
else:
# Code to fetch timestamp from csv1 and csv2
if csv1_ts == csv2_ts:
# Fetch 2nd column value from csv2
val=mrow[1]
writer.writerow(row+[val])
break
else:
continue
row_count += 1
代码似乎需要花费大量时间来生成输出csv文件。我该怎么做才能提高此代码的性能并加快速度呢?
答案 0 :(得分:1)
由于行似乎是按时间排序的,因此您最初可以从两个文件中读取一行。如果行的时间戳匹配,则将行写入输出并前进到两个文件中的下一行。如果时间戳不同,则从当前时间戳较小的文件中读取下一行。下面是代码的简单实现:
import csv
def get_key(row):
date = [int(x) for x in row[0].split('/')]
date[0], date[2] = date[2], date[0]
return date, row[1]
with open('file1.csv') as csv1, open('file2.csv') as csv2, open('out.csv', 'w') as out:
csv1 = csv.reader(csv1)
csv2 = csv.reader(csv2)
out = csv.writer(out)
# Header
out.writerow(next(csv1) + ['Col3'])
row1 = next(csv1, None)
row2 = next(csv2, None)
while row1 and row2:
key1 = get_key(row1)
key2 = get_key(row2)
if key1 < key2:
row1 = next(csv1, None)
elif key1 > key2:
row2 = next(csv2, None)
else:
out.writerow(row1 + row2[-1:])
row1 = next(csv1, None)
row2 = next(csv2, None)