我是python的初学者,我正在尝试比较两个csv文件中的两个字段(时间戳),如果匹配则将它们合并到第三个文件中。输入文件如下所示:
文件1:
name,time,operation
Cassandra,2015-10-06T15:07:22.333662984Z,INSERT
Cassandra,2015-10-06T15:07:24.334536781Z,INSERT
Cassandra,2015-10-06T15:07:27.339662984Z,READ
file2的:
name,time,host,instance,type,type_instance,value
cpu_value,2015-10-06T15:07:22.333662984Z,vm1-VirtualBox,0,cpu,user,24874
cpu_value,2015-10-06T15:07:24.334536781Z,vm1-VirtualBox,0,cpu,nice,592
cpu_value,2015-10-06T15:07:27.339662984Z,vm1-VirtualBox,0,cpu,system,2932
这是我到目前为止所尝试的:
import csv
with open('f1.csv', 'rb') as f1, open ('f2.csv', 'rb') as f2:
next(f1) #skip line 1
next(f2) #skip line 1
reader1 = csv.reader(f1)
reader2 = csv.reader(f2)
for row1 in reader1:
for row2 in reader2:
if row1[1] == row2[1]:
data = [row1[0],row2[0]]
print data
然后,我收到了这个错误:
['cpu_value', 'Cassandra']
Traceback (most recent call last):
File "merger.py", line 10, in <module>
if row1[1] == row2[1]:
IndexError: list index out of range
更新
预期产出:
Cassandra,2015-10-06T15:07:22.333662984Z,INSERT,cpu_value,vm1-VirtualBox,0,cpu,user,24874
Cassandra,2015-10-06T15:07:24.334536781Z,INSERT,cpu_value,vm1-VirtualBox,0,cpu,nice,592
Cassandra,2015-10-06T15:07:27.339662984Z,READ,cpu_value,vm1-VirtualBox,0,cpu,nice,592
您可以通过此link访问这些文件。如果您有任何想法,请告诉我。谢谢。
答案 0 :(得分:2)
如果我们可以假设所有时间戳都是唯一的并且它们将完全匹配(也就是说,不需要插值),那么我们可以从将第一个文件读入字典开始,其中键是时间戳和值为CSV行。
然后,我们读取第二个文件,并为每个CSV行查找字典。如果有匹配的行,我们可以打印,或存储,或其他任何东西。此外,我们使用pop
删除匹配的行,以便我们可以在最后轻松找到不匹配的行!
以下代码应为2 * 3行的测试数据提供所需的输出。
import csv
matches = []
unmatched1 = []
unmatched2 = []
f1_dict = {}
with open('f1.csv', 'r') as f1:
next(f1) # skip line 1
reader1 = csv.reader(f1)
for row1 in reader1:
f1_dict[row1[1]] = row1
with open('f2.csv', 'r') as f2:
next(f2) # skip line 1
reader2 = csv.reader(f2)
for row2 in reader2:
row1 = f1_dict.pop(row2[1], None)
if row1 is None:
unmatched2.append(row2)
else:
matches.append((row1, row2))
unmatched1 = list(f1_dict.values())
for row1, row2 in matches:
output_row = row1 + [row2[0]] + row2[2:]
print(','.join(output_row)) # or use csv.writer
答案 1 :(得分:1)
您还可以使用pandas DataFrame:pandas package
import csv
import pandas as pd
L1 = pd.read_csv('f1.csv')
L2 = pd.read_csv('f2.csv')
result = pd.merge(L1, L2, on='time')
for row in result.values:
print row
输出:
['Cassandra' '2015-10-06T15:07:22.333662984Z' 'INSERT' 'cpu_value' 'vm1-VirtualBox' 0L 'cpu' 'user' 24874L]
['Cassandra' '2015-10-06T15:07:24.334536781Z' 'INSERT' 'cpu_value' 'vm1-VirtualBox' 0L 'cpu' 'nice' 592L]
['Cassandra' '2015-10-06T15:07:27.339662984Z' 'READ' 'cpu_value' 'vm1-VirtualBox' 0L 'cpu' 'system' 2932L]
更多信息:http://pandas.pydata.org/pandas-docs/stable/merging.html
答案 2 :(得分:0)
import csv
L1 = []
L2 = []
with open('f1.csv', 'rb') as f1, open ('f2.csv', 'rb') as f2:
next(f1) #skip line 1
next(f2) #skip line 1
reader1 = csv.reader(f1)
reader2 = csv.reader(f2)
for row in reader1:
L1.append(row)
for row in reader2:
L2.append(row)
for row1 in L1:
for row2 in L2:
if row1[1] == row2[1]:
data = row1+[row2[0]]+row[2:]
print data
输出:
['Cassandra', '2015-10-06T15:07:22.333662984Z', 'INSERT', 'cpu_value', 'vm1-VirtualBox', '0', 'cpu', 'system', '2932']
['Cassandra', '2015-10-06T15:07:24.334536781Z', 'INSERT', 'cpu_value', 'vm1-VirtualBox', '0', 'cpu', 'system', '2932']
['Cassandra', '2015-10-06T15:07:27.339662984Z', 'READ', 'cpu_value', 'vm1-VirtualBox', '0', 'cpu', 'system', '2932']
您也可以尝试pandas DataFrame:https://stackoverflow.com/a/33244071/1924666