所以我希望Python使用csv读取器/写入器将目录中的所有CSV组合起来并合并它们,同时过滤掉包含第二列中任何其他行值的副本的任何行。
这是我的工作脚本:
import csv
import glob
with open('merged.csv','a') as out:
seen = set()
output = []
out_writer = csv.writer(out)
csv_files = [f for f in glob.glob('*.csv') if 'merged' not in f]
#csv_files = glob.glob('*.csv')
# I'd like to use all files including the output so that I don't
# have to rename it when reusing the script - it should dupe-filter itself!
for filename in csv_files:
with open(filename, 'rb') as ifile:
read = csv.reader(ifile, delimiter=',')
for row in read:
if row[1] not in seen:
seen.add(row[1])
if row: #was getting extra rows
output.append(row)
out_writer.writerows(output)
我觉得我必须错过一些简单的事情。我的文件大小各约为100MB,我最终想要自动执行此操作,以便不同的计算机可以共享合并文件进行欺骗检查。
如需额外赠送金额,我如何更改此选项以检查同时包含row[1]
和row[2]
的行? (一旦dupe-filter和self-inclusion正在工作,当然......)
答案 0 :(得分:2)
我建议使用pandas而不是csv writer。我会将你的代码重写为:
import pandas as pd
import glob
data = pd.concat([pd.DataFrame.from_csv(file) for
file in glob.glob("*.csv")]).drop_duplicates(cols=COLNAME_LIST)
data.to_csv('merged.csv')
在完全披露中我没有测试过这段代码,因为我没有大量的csv文件,但是我在成功之前写过类似的东西
答案 1 :(得分:1)
这不仅仅是pandas可能需要的少量行,因为它是Python的库存,但另一方面它相对简单,将过滤多个列值,并处理重新读取以前的结果。它使用fileinput
模块允许它将其多个输入文件视为单个连续的数据行流。
import csv
import fileinput
import glob
import os
merged_csv = 'merged.csv'
columns = (1, 2) # columns used for filtering
pathname = '*.csv'
tmpext = os.extsep + "tmp"
csv_files = glob.glob(pathname)
if merged_csv not in csv_files:
prev_merged = None
else:
prev_merged = merged_csv + tmpext
os.rename(merged_csv, prev_merged)
csv_files[csv_files.index(merged_csv)] = prev_merged
with open(merged_csv, 'wb') as ofile:
csv_writer = csv.writer(ofile)
written = set() # unique combinations of column values written
csv_stream = fileinput.input(csv_files, mode='rb')
for row in csv.reader(csv_stream, delimiter=','):
combination = tuple(row[col] for col in columns)
if combination not in written:
csv_writer.writerow(row)
written.add(combination)
if prev_merged:
os.unlink(prev_merged) # clean up
print '{!r} file {}written'.format(merged_csv, 're' if prev_merged else '')