我相对较新的Python,并尝试使用它来合并两个包含4列的排序文件:
文件1:
x-coordinate, y-coordinate, data 1, data 2
1, 10, 20, 0
5, 15, 1, 2
...
文件2:
x-coordinate, y-coordinate, data 3, data 4
1, 10, 7, 8
3, 25, 1, 2
...
进入一个包含6列的单个排序文件,其中包含每组唯一的(x,y)坐标:
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4
1, 10, 20, 0, 7, 8
3, 25, 0, 0, 1, 2
5, 15, 1, 2, 0, 0
如果只输出文件的顺序无关紧要,我认为使用字典这个任务是微不足道的。由于我的输入文件长达100行,我试图想出一种有效的“pythonic”方式来处理重复的情况(即两个文件中存在相同的(x,y)坐标),但到目前为止我很难过。
感谢任何和所有帮助。提前谢谢!
答案 0 :(得分:3)
我可能会使用defaultdict
来做这样的事情:
from collections import defaultdict
from itertools import chain
d = defaultdict(lambda:[0,0,0,0])
with open('file1') as f1, open('file2') as f2:
next(f1) #get rid of header info
next(f2)
for line1,line2 in zip(f1,f2):
data1 = [int(x) for x in line1.split(',')]
data2 = [int(x) for x in line2.split(',')]
d[tuple(data1[:2])][:2] = data1[2:]
d[tuple(data2[:2])][2:] = data2[2:]
#now sort the items and write them out:
#This puts them in stdout, but you could easily use `file.write` here.
for k,v in sorted(d.items()):
print(', '.join(str(x) for x in chain(k,v)))
答案 1 :(得分:2)
使用pandas:
import pandas as pd
df1 = pd.read_csv("coord1.csv")
df2 = pd.read_csv("coord2.csv")
combined = df1.merge(df2, how='outer').fillna(0)
combined.sort(list(combined.columns[:2]), inplace=True)
combined.to_csv("coord_merged.csv",index=False)
首先,阅读原始数据:
>>> import pandas as pd
>>> df1 = pd.read_csv("coord1.csv")
>>> df2 = pd.read_csv("coord2.csv")
>>> df1
x-coordinate y-coordinate data 1 data 2
0 1 10 20 0
1 5 15 1 2
>>> df2
x-coordinate y-coordinate data 3 data 4
0 1 10 7 8
1 3 25 1 2
合并它们,并用零填充缺失的数据:
>>> combined = df1.merge(df2, how='outer')
>>> combined
x-coordinate y-coordinate data 1 data 2 data 3 data 4
0 1 10 20 0 7 8
1 5 15 1 2 NaN NaN
2 3 25 NaN NaN 1 2
>>> combined = df1.merge(df2, how='outer').fillna(0)
>>> combined
x-coordinate y-coordinate data 1 data 2 data 3 data 4
0 1 10 20 0 7 8
1 5 15 1 2 0 0
2 3 25 0 0 1 2
类别:
>>> combined.sort(list(combined.columns[:2]), inplace=True)
>>> combined
x-coordinate y-coordinate data 1 data 2 data 3 data 4
0 1 10 20 0 7 8
2 3 25 0 0 1 2
1 5 15 1 2 0 0
最后写出:
>>> combined.to_csv("coord_merged.csv",index=False)
>>> !cat coord_merged.csv
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4
1.0,10.0,20.0,0.0,7.0,8.0
3.0,25.0,0.0,0.0,1.0,2.0
5.0,15.0,1.0,2.0,0.0,0.0
如果保持整数格式很重要,那么
>>> combined.astype(int).to_csv("coord_merged.csv",index=False)
>>> !cat coord_merged.csv
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4
1,10,20,0,7,8
3,25,0,0,1,2
5,15,1,2,0,0
会这样做。