Question

我相对较新的Python，并尝试使用它来合并两个包含4列的排序文件：

文件1：

x-coordinate, y-coordinate, data 1, data 2  
1, 10, 20, 0  
5, 15, 1, 2  
...

文件2：

x-coordinate, y-coordinate, data 3, data 4  
1, 10, 7, 8  
3, 25, 1, 2  
...

进入一个包含6列的单个排序文件，其中包含每组唯一的（x，y）坐标：

x-coordinate, y-coordinate, data 1, data 2, data 3, data 4  
1, 10, 20, 0, 7, 8  
3, 25, 0, 0, 1, 2  
5, 15, 1, 2, 0, 0

如果只输出文件的顺序无关紧要，我认为使用字典这个任务是微不足道的。由于我的输入文件长达100行，我试图想出一种有效的“pythonic”方式来处理重复的情况（即两个文件中存在相同的（x，y）坐标），但到目前为止我很难过。

感谢任何和所有帮助。提前谢谢！

Answer 1

我可能会使用defaultdict来做这样的事情：

from collections import defaultdict
from itertools import chain   

d = defaultdict(lambda:[0,0,0,0])
with open('file1') as f1, open('file2') as f2:
    next(f1) #get rid of header info
    next(f2)
    for line1,line2 in zip(f1,f2):
        data1 = [int(x) for x in line1.split(',')]
        data2 = [int(x) for x in line2.split(',')]
        d[tuple(data1[:2])][:2] = data1[2:]
        d[tuple(data2[:2])][2:] = data2[2:]

#now sort the items and write them out:
#This puts them in stdout, but you could easily use `file.write` here.
for k,v in sorted(d.items()):
    print(', '.join(str(x) for x in chain(k,v)))

Answer 2

使用pandas：

import pandas as pd

df1 = pd.read_csv("coord1.csv")
df2 = pd.read_csv("coord2.csv")
combined = df1.merge(df2, how='outer').fillna(0)
combined.sort(list(combined.columns[:2]), inplace=True)
combined.to_csv("coord_merged.csv",index=False)

首先，阅读原始数据：

>>> import pandas as pd
>>> df1 = pd.read_csv("coord1.csv")
>>> df2 = pd.read_csv("coord2.csv")
>>> df1
   x-coordinate   y-coordinate   data 1   data 2
0             1             10       20        0
1             5             15        1        2
>>> df2
   x-coordinate   y-coordinate   data 3   data 4  
0             1             10        7          8
1             3             25        1          2

合并它们，并用零填充缺失的数据：

>>> combined = df1.merge(df2, how='outer')
>>> combined
   x-coordinate   y-coordinate   data 1   data 2   data 3   data 4  
0             1             10       20        0        7          8
1             5             15        1        2      NaN        NaN
2             3             25      NaN      NaN        1          2
>>> combined = df1.merge(df2, how='outer').fillna(0)
>>> combined
   x-coordinate   y-coordinate   data 1   data 2   data 3   data 4  
0             1             10       20        0        7          8
1             5             15        1        2        0          0
2             3             25        0        0        1          2

类别：

>>> combined.sort(list(combined.columns[:2]), inplace=True)
>>> combined
   x-coordinate   y-coordinate   data 1   data 2   data 3   data 4  
0             1             10       20        0        7          8
2             3             25        0        0        1          2
1             5             15        1        2        0          0

最后写出：

>>> combined.to_csv("coord_merged.csv",index=False)
>>> !cat coord_merged.csv
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4  
1.0,10.0,20.0,0.0,7.0,8.0
3.0,25.0,0.0,0.0,1.0,2.0
5.0,15.0,1.0,2.0,0.0,0.0

如果保持整数格式很重要，那么

>>> combined.astype(int).to_csv("coord_merged.csv",index=False)
>>> !cat coord_merged.csv
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4  
1,10,20,0,7,8
3,25,0,0,1,2
5,15,1,2,0,0

会这样做。

在python中合并具有重复坐标的2个排序文件的有效方法

2 个答案: