比较两个csv文件中的列,并在第三个csv文件中提取匹配的列

时间:2013-12-14 02:03:46

标签: python csv

我想比较两个csv文件中的列,并提取匹配的值如下:

file1.csv

115.06603,5.9

114.74721,5.4

114.85107,6.2

111.17744,5.5

192.77787,3.2

189.70226,5

0.46762,3.7

2.21539,3.5

2.96667,3.6

而file2.csv

115.06603

115.06603

114.74721

114.74721

114.74721

114.74721

114.85107

114.85107

114.85107

114.85107

114.85107

111.17744

111.17744

输出文件file3.csv应为

115.06603,5.9

115.06603,5.9

114.74721,5.4

114.74721,5.4

114.74721,5.4

114.74721,5.4

114.85107,6.2

114.85107,6.2

114.85107,6.2

114.85107,6.2

114.85107,6.2

111.17744,5.5

111.17744,5.5

我使用了以下代码,但输出文件只提供第一列而不是两列。你能帮我解决这个问题吗?

>>> with open("file1.csv", "rb") as in_file1, open("file2.csv", "rb") as in_file2,    open("file3.csv", "wb") as out_file:
...   reader1 = csv.reader(in_file1)
...   reader2 = csv.reader(in_file2)
...   writer = csv.writer(out_file)
...   for row2 in reader2:
...     for row1 in reader1:
...       if row2[0] == row1[0]:
...         row2[1] = row1[1]
...     writer.writerow(row2)

修改 我使用了你的代码,但第一部分给出了以下错误:

data1 = {}
with open("file1.csv", "rb") as in_file1:
...   reader1 = csv.reader(in_file1)
...   for row1 in reader1:
...     data1[row1[0]] = row1[1]
... 
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IndexError: list index out of range

由于file1.csv中的分隔符是;不,我添加了delimiter =';'如下

data1 = {}
with open("file1.csv", "rb") as in_file1:
...   reader1 = csv.reader(in_file1, delimiter=';')
...   for row1 in reader1:
...     data1[row1[0]] = row1[1]
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IndexError: list index out of range

与您看到的错误相同

我添加了in.file.seek(0),如下所示

data1 = {}
with open("file1.csv", "rb") as in_file1:
...   reader1 = csv.reader(in_file1)
...   for row1 in reader1:
...     in_file1.seek(0)
...     data1[row1[0]] = row1[1]
... 
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
IndexError: list index out of range

同样的错误。问题是什么?我很沮丧。

修改

我用来删除空行的代码

with open("file2.csv", "r") as in_file2, open("out.csv", "w") as out_file:
...  reader2 = csv.reader(in_file2)
...  writer = csv.writer(out_file)
...  for row in reader2:
...    if any(field.strip() for field in row):
...      writer.writerow(row)

2 个答案:

答案 0 :(得分:1)

这一行

row2[1] = row1[1]

不起作用,因为row2 [1]尚不存在。

你应该使用

row2.append(row1[1])

代替。

修改 此外,内部for循环也只是第一次执行,因为文件只能迭代一次。您应该执行以下操作:

data1 = {}
with open("file1.csv", "rb") as in_file1:
     reader1 = csv.reader(in_file1)
     for row1 in reader1:
         data1[row1[0]] = row1[1]
with open("file2.csv","rb") as in_file2, open("file3.csv","wb") as out_file:
    reader2 = csv.reader(in_file2)
    writer = csv.writer(out_file)
    for row2 in reader2:
        if row2[0] in data1:
            row2.append(data1[row2[0]])
        writer.writerow(row2)

请注意,这实际上会将所有file1加载到内存中。如果这是一个问题,您可以通过在原始代码中迭代读取器1之后添加in_file1.seek(0)(或等同于回放与csv读取器一起使用的文件的东西)来解决仅读取文件一次的问题。该方法将比我提供的方法慢。

答案 1 :(得分:0)

Awk很容易为你做到这一点:

awk -F, 'FNR==NR{a[$1]=$0;next}{print a[$1]}' file1.csv file2.csv

结果:

115.06603, 5.9
115.06603, 5.9
114.74721, 5.4
114.74721, 5.4
114.74721, 5.4
114.74721, 5.4
114.85107, 6.2
114.85107, 6.2
114.85107, 6.2
114.85107, 6.2
114.85107, 6.2
111.17744, 5.5
111.17744, 5.5

以下是它的工作原理......首先,“-F”表示字段分隔符是逗号。然后,“FNR == NR”部分告诉awk在查看第一个文件(file1.csv)时只处理第一组花括号({})之间的东西。也就是说,将整行($ 0)存储到一个名为“a”的数组中,该数组位于file1.csv的第一列所指示的位置。第二组花括号({})之间的片段适用于file2.csv的处理。它表示在当前文件的第一列(file2.csv)给出的索引处打印数组“a”中的条目。