我想知道如何合并/合并两个具有相似列和一些缺失值的数据框,同时仍然能够保留所有信息。我的数据框看起来与此类似:
df1
Item ID Order ID Name Location
21 1 John IL
22 2 John LA
24 2 Ron LA
25 3 Ron LA
29 5 John IL
df2
Item ID Order ID Name Location Type
20 1 John IL cable
22 2 Ron LA cable
23 2 Ron LA Box
26 3 Ron LA cable
27 N/A N/A IL Box
29 5 John IL Box
我希望我的数据框看起来像什么
Item ID Order ID Name Location Type
20 1 John IL Cable
21 4 John IL N/A
22 2 John/Ron LA Cable
23 2 Ron LA Box
24 2 Ron LA N/A
25 3 Ron LA N/A
26 3 Ron LA Cable
27 N/A N/A IL Box
28 N/A N/A N/A N/A
29 5 John IL Box
我试图做类似的事情
dataframes = [df1, df2]
merged = reduce(lambda left,right: pd.merge(left,right,on='Item ID', how='outer'), dataframes)
但是排序是错误的,或者它遗漏了一些信息,并且没有填写缺失的值(项目ID:28)。
答案 0 :(得分:1)
这可能有效
pd.concat([df1, df2]).sort_values('Item_ID').drop_duplicates(['Item_ID'], keep='last')
Item_ID Location Name Order_ID Type
0 20 IL John 1.0 cable
0 21 IL John 1.0 NaN
1 22 LA Ron 2.0 cable
2 23 LA Ron 2.0 Box
2 24 LA Ron 2.0 NaN
3 25 LA Ron 3.0 NaN
3 26 LA Ron 3.0 cable
4 27 IL NaN NaN Box
5 29 IL John 5.0 Box
答案 1 :(得分:1)
如果要填充缺失值的另一种方法是使用reindex
和combine_first
:
l=pd.concat((df1['Item ID'],df2['Item ID']))
final=(df1.set_index('Item ID').reindex(range(l.min(),l.max()+1))
.combine_first(df2.set_index('Item ID')).reset_index().reindex(columns=df2.columns))
Item ID Order ID Name Location Type
0 20 1.0 John IL cable
1 21 1.0 John IL NaN
2 22 2.0 Ron LA cable
3 23 2.0 Ron LA Box
4 24 2.0 Ron LA NaN
5 25 3.0 Ron LA NaN
6 26 3.0 Ron LA cable
7 27 NaN NaN IL Box
8 28 NaN NaN NaN NaN
9 29 5.0 John IL Box
答案 2 :(得分:0)
我在另一个帖子上找到了这个,并做了一点改动,就实现了我想要的。也会为需要它的人发布定义版本。
# combine the common columns
def merge_dfs(dfs):
df1 = dfs[0]
df2= dfs[1]
left= df1
right = df2
keyCol = 'Request ID'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
#print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))
mergeDf = left.merge(right, on=keyCol, how='outer', suffixes=('_left', '_right'))
# combine the common columns
for col in commonCols:
if col != keyCol:
for i, row in mergeDf.iterrows():
leftVal = str(row[col + '_left']).replace('nan', "").strip()
rightVal = str(row[col + '_right']).replace('nan', "").strip()
#print(leftVal + ',' + rightVal)
if leftVal == rightVal:
mergeDf.loc[i, col] = leftVal
else:
mergeDf.loc[i, col] = leftVal + "~" + rightVal
# only use the finalCols
mergeDf = mergeDf[finalCols]
for df in dfs[2:]:
df1 = mergeDf
df2= df
left= df1
right = df2
keyCol ='Request ID'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
#print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))
mergeDf = left.merge(right, on=keyCol, how='outer', suffixes=('_left', '_right'))
# combine the common columns
for col in commonCols:
if col != keyCol:
for i, row in mergeDf.iterrows():
leftVal = str(row[col + '_left']).replace('nan', "").strip()
rightVal = str(row[col + '_right']).replace('nan', "").strip()
#print(leftVal + ',' + rightVal)
leftValWords = leftVal.split('~')
#print(leftValWords)
if rightVal in leftValWords:
mergeDf.loc[i, col] = leftVal
else:
mergeDf.loc[i, col] = leftVal + '~' + rightVal
# only use the finalCols
mergeDf = mergeDf[finalCols]
mergeDf = mergeDf
return mergeDf