使用Python合并两个CSV文件

时间:2013-04-28 17:48:15

标签: python csv dictionary merge key

好的,我已经在Stack Overflow上阅读了几个主题。我认为这对我来说相当容易,但我发现我仍然没有很好地掌握Python。我尝试了位于How to combine 2 csv files with common column value, but both files have different number of lines的示例,这很有帮助,但我仍然没有达到我希望实现的结果。

基本上我有2个带有共同第一列的csv文件。我想合并2.即

filea.csv

title,stage,jan,feb
darn,3.001,0.421,0.532
ok,2.829,1.036,0.751
three,1.115,1.146,2.921

fileb.csv

title,mar,apr,may,jun,
darn,0.631,1.321,0.951,1.751
ok,1.001,0.247,2.456,0.3216
three,0.285,1.283,0.924,956

output.csv(不是我得到的,但我想要的)

title,stage,jan,feb,mar,apr,may,jun
darn,3.001,0.421,0.532,0.631,1.321,0.951,1.751
ok,2.829,1.036,0.751,1.001,0.247,2.456,0.3216
three,1.115,1.146,2.921,0.285,1.283,0.924,956

output.csv(我实际得到的输出)

title,feb,may
ok,0.751,2.456
three,2.921,0.924
darn,0.532,0.951

我正在尝试的代码:

'''
testing merging of 2 csv files
'''
import csv
import array
import os

with open('Z:\\Desktop\\test\\filea.csv') as f:
    r = csv.reader(f, delimiter=',')
    dict1 = {row[0]: row[3] for row in r}

with open('Z:\\Desktop\\test\\fileb.csv') as f:
    r = csv.reader(f, delimiter=',')
    #dict2 = {row[0]: row[3] for row in r}
    dict2 = {row[0:3] for row in r}

print str(dict1)
print str(dict2)

keys = set(dict1.keys() + dict2.keys())
with open('Z:\\Desktop\\test\\output.csv', 'wb') as f:
    w = csv.writer(f, delimiter=',')
    w.writerows([[key, dict1.get(key, "''"), dict2.get(key, "''")] for key in keys])

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:59)

当我使用csv文件时,我经常使用pandas库。它使这样的事情变得非常容易。例如:

import pandas as pd

a = pd.read_csv("filea.csv")
b = pd.read_csv("fileb.csv")
b = b.dropna(axis=1)
merged = a.merge(b, on='title')
merged.to_csv("output.csv", index=False)

以下是一些解释。首先,我们读入csv文件:

>>> a = pd.read_csv("filea.csv")
>>> b = pd.read_csv("fileb.csv")
>>> a
   title  stage    jan    feb
0   darn  3.001  0.421  0.532
1     ok  2.829  1.036  0.751
2  three  1.115  1.146  2.921
>>> b
   title    mar    apr    may       jun  Unnamed: 5
0   darn  0.631  1.321  0.951    1.7510         NaN
1     ok  1.001  0.247  2.456    0.3216         NaN
2  three  0.285  1.283  0.924  956.0000         NaN

我们看到有一个额外的数据列(注意fileb.csv的第一行 - title,mar,apr,may,jun, - 最后有一个额外的逗号)。我们可以很容易地摆脱它:

>>> b = b.dropna(axis=1)
>>> b
   title    mar    apr    may       jun
0   darn  0.631  1.321  0.951    1.7510
1     ok  1.001  0.247  2.456    0.3216
2  three  0.285  1.283  0.924  956.0000

现在我们可以在标题列上合并ab

>>> merged = a.merge(b, on='title')
>>> merged
   title  stage    jan    feb    mar    apr    may       jun
0   darn  3.001  0.421  0.532  0.631  1.321  0.951    1.7510
1     ok  2.829  1.036  0.751  1.001  0.247  2.456    0.3216
2  three  1.115  1.146  2.921  0.285  1.283  0.924  956.0000

最后写出来:

>>> merged.to_csv("output.csv", index=False)
制造

title,stage,jan,feb,mar,apr,may,jun
darn,3.001,0.421,0.532,0.631,1.321,0.951,1.751
ok,2.829,1.036,0.751,1.001,0.247,2.456,0.3216
three,1.115,1.146,2.921,0.285,1.283,0.924,956.0

答案 1 :(得分:1)

您需要将所有额外行存储在字典中的文件中,而不仅仅是其中一行:

dict1 = {row[0]: row[1:] for row in r}
...
dict2 = {row[0]: row[1:] for row in r}

然后,由于字典中的值是列表,您需要将列表连接在一起:

w.writerows([[key] + dict1.get(key, []) + dict2.get(key, []) for key in keys])