通过重复的单元格Python将CSV拆分为两个文件

时间:2014-11-30 05:32:08

标签: python python-2.7 csv

任何时候在第一个实例之后重复行ID(奇怪地放在第8列,即行[7]),我想将这些行写入第二个文件。我尝试过的代码非常慢 - 它是一个包含大约一百万行的40列CSV。这就是我所拥有的:

def in_out_repsplit(inf, outf1, outf2):
    outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
    outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
    inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
    inf1.next()
    checklist = []
    for row in inf1:
        id_num = str(row[7])
        if id_num not in checklist:
            outf1.writerow(row)
            checklist.append(id_num)
        else:
            outf2.writerow(row)

1 个答案:

答案 0 :(得分:1)

in运算符对Python list()执行线性搜索,因为您只需要进行成员资格测试,Python set()是一个更合适的结构,具有平均恒定时间成员资格试验。对于具有一百万行的CSV,这种小的改变应该会使事情变得更快。

def in_out_repsplit(inf, outf1, outf2):
    outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
    outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
    inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
    inf1.next()
    checklist = set()
    for row in inf1:
        id_num = str(row[7])
        if id_num not in checklist:
            outf1.writerow(row)
            checklist.add(id_num)
        else:
            outf2.writerow(row)

如果id_num是整数,请使用int代替str。如果id_num在[0 ... N]范围内(其中N合理地接近百万行),则可以使用布尔值列表并获得更快的查找。

    ...
    checklist = [False] * (MAXID + 1)
    for row in inf1:
        id_num = int(row[7])
        if not checklist[id_num]:
            outf1.writerow(row)
            checklist[id_num] = True
        else:
            outf2.writerow(row)