考虑2个数据帧,需要使用2个唯一列(idA,idB)连接2个数据帧并计算其col距离的总和 。顺便说一下(idA,idB)等于(idB,idA),所以它们的距离必须相加
In [1]: df1 = pd.DataFrame({'idA': ['1', '2', '3', '2'],
...: 'idB': ['1', '4', '8', '1'],
...: 'Distance': ['0.727273', '0.827273', '0.127273', '0.927273']},
...: index=[0, 1, 2, 3])
...:
In [2]: df2 = pd.DataFrame({'idA': ['1', '5', '2', '5'],
...: 'idB': ['2', '1', '4', '7'],
...: 'Distance': ['0.11', '0.1', '3.0', '0.8']},
...: index=[4, 5, 6, 7])
输出必须是这样的:
Sum_Distance idA idB
0 0.727273 1 1
1 3.827273 2 4 <-- 2,4 = 3.0 + 2,4 = 0.827273
2 0.127273 3 8
3 1.037273 2 1 <-- 2,1 = 0.927273 + 1,2 = 0.11
4 0.1 5 1
5 0.8 5 7
使用Pandas / Spark帮助找到如何使用它的方法。
答案 0 :(得分:2)
首先将两列都转换为数字,然后将add
与set_index
一起使用,以便每行对齐并排序每对列:
df1['Distance'] = df1['Distance'].astype(float)
df2['Distance'] = df2['Distance'].astype(float)
#if some data are not parseable convert them to NaNs
#df1['Distance'] = pd.to_numeric(df1['Distance'], errors='coerce')
#df2['Distance'] = pd.to_numeric(df2['Distance'], errors='coerce')
df1[['idA','idB']] = np.sort(df1[['idA','idB']], axis=1)
df2[['idA','idB']] = np.sort(df2[['idA','idB']], axis=1)
print (df1)
Distance idA idB
0 0.727273 1 1
1 0.827273 2 4
2 0.127273 3 8
3 0.927273 1 2
print (df2)
Distance idA idB
4 0.11 1 2
5 0.10 1 5
6 3.00 2 4
7 0.80 5 7
df3=df1.set_index(['idA','idB']).add(df2.set_index(['idA','idB']),fill_value=0).reset_index()
print (df3)
idA idB Distance
0 1 1 0.727273
1 1 2 1.037273
2 1 5 0.100000
3 2 4 3.827273
4 3 8 0.127273
5 5 7 0.800000
df3 = pd.concat([df1, df2]).groupby(['idA','idB'], as_index=False)['Distance'].sum()
print (df3)
idA idB Distance
0 1 1 0.727273
1 1 2 1.037273
2 1 5 0.100000
3 2 4 3.827273
4 3 8 0.127273
5 5 7 0.800000
答案 1 :(得分:2)
df1.Distance=pd.to_numeric(df1.Distance)
df2.Distance=pd.to_numeric(df2.Distance)
df=pd.concat([df1.assign(key=df1.idA+df1.idB),df2.assign(key=df2.idA+df2.idB)]).\
groupby('key').agg({'Distance':'sum','idA':'first','idB':'first'})
df
Out[672]:
Distance idA idB
key
2 0.727273 1 1
3 1.037273 2 1
6 3.927273 2 4
11 0.127273 3 8
12 0.800000 5 7
更新
df1[['idA','idB']]=np.sort(df1[['idA','idB']].values)
df2[['idA','idB']]=np.sort(df2[['idA','idB']].values)
pd.concat([df1,df2]).groupby(['idA','idB'],as_index=False).Distance.sum()
Out[678]:
idA idB Distance
0 1 1 0.727273
1 1 2 1.037273
2 1 5 0.100000
3 2 4 3.827273
4 3 8 0.127273
5 5 7 0.800000