我需要匹配来自多个CSV文件的数据。我编写了一个可以处理简单数据的脚本,但分析4000行的速度很慢。
我已经尝试使用set(a) & set(b)
,但无法从每个文件返回匹配的数据。
输出文件必须包含所有文件中的匹配数据。
脚本:
for file_1 in files:
with open(file_1, 'rt') as f1,open(saved_file, 'w') as f3:
reader1 = csv.reader(f1, delimiter = ';')
writer = csv.writer(f3, delimiter = ';', lineterminator = '\n')
for row1 in reader1:
for row1 in reader1:
for file_2 in files:
with open(file_2, 'rt') as f2:
reader2 = csv.reader(f2, delimiter = ';')
if row1 in reader2:
writer.writerow(row1)
我想要匹配的数据如下所示:
File_1:
May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18%
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39%
May 18, 2017;12,590.06;12,608.19;12,634.26;12,489.95;123.48M;-0.33%
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35%
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29%
File_2:
May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28%
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94%
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69%
May 16, 2017;1.1082;1.0975;1.1098;1.0971;0.97%
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40%
输出: saved_file_1:
May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18%
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39%
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35%
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29%
saved_file_2:
May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28%
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94%
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69%
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40%
答案 0 :(得分:0)
性能问题是因为您正在使用for循环中的多个文件读取器/写入器。
我建议您首先使用Pandas将数据从File_1和File_2导入数据框架。你可以这样做:
import pandas as pd
df1=pd.read_csv("file_1.csv")
df2=pd.read_csv("file_2.csv")
然后您可以对导入的数据应用计算,然后可以将它们再次保存为CSV:
dfOut.to_csv(file_name, sep='\t')
您需要在此处理正确的CSV分隔符。
答案 1 :(得分:0)
如果不诉诸熊猫,你可以这样做,这可能与你想到的一致。
首先浏览每个文件,只是在单独的列表中收集日期。然后找到这些列表的交集,视为集合。现在再次浏览每个文件,写出日期在交集中的每条记录。
def get_dates(one_file):
one_file_dates = []
with open(one_file) as the_file:
for line in the_file.readlines():
the_date = line[:line.find(';')]
if not the_date in one_file_dates:
one_file_dates.append(the_date)
return one_file_dates
common_dates = set(get_dates('file_1.csv')).intersection(set(get_dates('file_2.csv')))
print ('*** processing file_1')
with open('file_1.csv') as the_file:
for line in the_file.readlines():
if line[:line.find(';')] in common_dates:
print(line.strip())
print ('*** processing file_2')
with open('file_2.csv') as the_file:
for line in the_file.readlines():
if line[:line.find(';')] in common_dates:
print(line.strip())
结果:
*** processing file_1
May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18%
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39%
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35%
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29%
*** processing file_2
May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28%
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94%
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69%
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40%
编辑:响应评论的新代码。
def get_dates(one_file):
one_file_dates = []
with open(one_file) as the_file:
for line in the_file.readlines():
the_date = line[:line.find(';')]
if not the_date in one_file_dates:
one_file_dates.append(the_date)
return one_file_dates
file_list = ['file_1.csv', 'file_2.csv'] # add more file names here
common_dates = set(get_dates(file_list[0]))
for file in file_list[1:]:
common_dates = common_dates.intersection(set(get_dates(file)))
for file in file_list:
print ('*** processing ', file)
with open(file) as the_file:
for line in the_file.readlines():
if line[:line.find(';')] in common_dates:
print(line.strip())