任何时候在第一个实例之后重复行ID(奇怪地放在第8列,即行[7]),我想将这些行写入第二个文件。我尝试过的代码非常慢 - 它是一个包含大约一百万行的40列CSV。这就是我所拥有的:
def in_out_repsplit(inf, outf1, outf2):
outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
inf1.next()
checklist = []
for row in inf1:
id_num = str(row[7])
if id_num not in checklist:
outf1.writerow(row)
checklist.append(id_num)
else:
outf2.writerow(row)
答案 0 :(得分:1)
in
运算符对Python list()
执行线性搜索,因为您只需要进行成员资格测试,Python set()
是一个更合适的结构,具有平均恒定时间成员资格试验。对于具有一百万行的CSV,这种小的改变应该会使事情变得更快。
def in_out_repsplit(inf, outf1, outf2):
outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
inf1.next()
checklist = set()
for row in inf1:
id_num = str(row[7])
if id_num not in checklist:
outf1.writerow(row)
checklist.add(id_num)
else:
outf2.writerow(row)
如果id_num
是整数,请使用int
代替str
。如果id_num
在[0 ... N]范围内(其中N合理地接近百万行),则可以使用布尔值列表并获得更快的查找。
...
checklist = [False] * (MAXID + 1)
for row in inf1:
id_num = int(row[7])
if not checklist[id_num]:
outf1.writerow(row)
checklist[id_num] = True
else:
outf2.writerow(row)