比较2个巨大的csv文件时的性能提升[python]

时间:2016-06-29 03:41:03

标签: python loops csv

我有2个csv文件,我需要针对另一个文件中的所有行测试第一个文件中的每一行,并将匹配行写入新的csv文件。这是文件结构。

csv 1 (读数相差20s。第一行总是标题):

日期时间Col1 Col2
6/27/16 17:28:21 21 3244
6/27/16 17:28:41 21 3278
6/27/16 17:29:01 21 3299

csv 2 (读数相隔1秒。没有标题):

6/27/16 17:28:21 3245
6/27/16 17:28:22 3266
6/27/16 17:28:23 3277

我比较了来自csv1和csv2的时间戳,并且在匹配时,我创建了一个包含csv1行的输出行以及从csv2读取的第二列。输出csv文件中的示例行将是:

日期时间Col1 Col2 Col3
6/27/16 17:28:21 21 3244 3245

这是我执行此操作的python代码:

    with open("file1.csv",'r') as csv1:  
             with open("out.csv", 'w') as myoutput:
             writer = csv.writer(myoutput)
             row_count=0
             headerSet=0
             for row in csv.reader(csv1):
                 with open ("file2.csv",'r') as csv2:
                     in2 = csv.reader(csv2)
                     for mrow in in2:
                        if row_count == 0 and headerSet==0:
                            # Generate Header Row for the output csv file
                            writer.writerow(row+["Col3"]) 
                            headerSet=1
                        else:
                            # Code to fetch timestamp from csv1 and csv2
                            if csv1_ts == csv2_ts:
                                # Fetch 2nd column value from csv2
                                val=mrow[1]
                                writer.writerow(row+[val])
                                break
                     else:
                        continue
                     row_count += 1

代码似乎需要花费大量时间来生成输出csv文件。我该怎么做才能提高此代码的性能并加快速度呢?

1 个答案:

答案 0 :(得分:1)

由于行似乎是按时间排序的,因此您最初可以从两个文件中读取一行。如果行的时间戳匹配,则将行写入输出并前进到两个文件中的下一行。如果时间戳不同,则从当前时间戳较小的文件中读取下一行。下面是代码的简单实现:

import csv

def get_key(row):
    date = [int(x) for x in row[0].split('/')]
    date[0], date[2] = date[2], date[0]
    return date, row[1]

with open('file1.csv') as csv1, open('file2.csv') as csv2, open('out.csv', 'w') as out:
    csv1 = csv.reader(csv1)
    csv2 = csv.reader(csv2)
    out = csv.writer(out)

    # Header
    out.writerow(next(csv1) + ['Col3'])
    row1 = next(csv1, None)
    row2 = next(csv2, None)

    while row1 and row2:
        key1 = get_key(row1)
        key2 = get_key(row2)
        if key1 < key2:
            row1 = next(csv1, None)
        elif key1 > key2:
            row2 = next(csv2, None)
        else:
            out.writerow(row1 + row2[-1:])
            row1 = next(csv1, None)
            row2 = next(csv2, None)