Question

我想知道如何合并/合并两个具有相似列和一些缺失值的数据框，同时仍然能够保留所有信息。我的数据框看起来与此类似：

df1

Item ID       Order ID      Name         Location
21            1             John         IL
22            2             John          LA
24            2             Ron          LA
25            3             Ron          LA
29            5             John         IL

df2

Item ID       Order ID      Name         Location    Type
20            1             John         IL          cable
22            2             Ron          LA          cable
23            2             Ron          LA          Box
26            3             Ron          LA          cable
27            N/A           N/A          IL          Box
29            5             John         IL          Box

我希望我的数据框看起来像什么

Item ID       Order ID      Name         Location     Type
20            1             John         IL           Cable
21            4             John         IL           N/A
22            2             John/Ron     LA           Cable
23            2             Ron          LA           Box
24            2             Ron          LA           N/A
25            3             Ron          LA           N/A
26            3             Ron          LA           Cable
27            N/A           N/A          IL           Box
28            N/A           N/A          N/A          N/A
29            5             John         IL           Box

我试图做类似的事情

dataframes = [df1, df2]
merged = reduce(lambda left,right: pd.merge(left,right,on='Item ID', how='outer'), dataframes)

但是排序是错误的，或者它遗漏了一些信息，并且没有填写缺失的值（项目ID：28）。

Answer 1

这可能有效

pd.concat([df1, df2]).sort_values('Item_ID').drop_duplicates(['Item_ID'], keep='last')

   Item_ID Location  Name  Order_ID   Type
0       20       IL  John       1.0  cable
0       21       IL  John       1.0    NaN
1       22       LA   Ron       2.0  cable
2       23       LA   Ron       2.0    Box
2       24       LA   Ron       2.0    NaN
3       25       LA   Ron       3.0    NaN
3       26       LA   Ron       3.0  cable
4       27       IL   NaN       NaN    Box
5       29       IL  John       5.0    Box

Answer 2

如果要填充缺失值的另一种方法是使用reindex和combine_first：

l=pd.concat((df1['Item ID'],df2['Item ID']))
final=(df1.set_index('Item ID').reindex(range(l.min(),l.max()+1))
    .combine_first(df2.set_index('Item ID')).reset_index().reindex(columns=df2.columns))

   Item ID  Order ID  Name Location   Type
0       20       1.0  John       IL  cable
1       21       1.0  John       IL    NaN
2       22       2.0   Ron       LA  cable
3       23       2.0   Ron       LA    Box
4       24       2.0   Ron       LA    NaN
5       25       3.0   Ron       LA    NaN
6       26       3.0   Ron       LA  cable
7       27       NaN   NaN       IL    Box
8       28       NaN   NaN      NaN    NaN
9       29       5.0  John       IL    Box

Answer 3

我在另一个帖子上找到了这个，并做了一点改动，就实现了我想要的。也会为需要它的人发布定义版本。

# combine the common columns
def merge_dfs(dfs):
df1 = dfs[0]
df2= dfs[1]

left= df1
right = df2

keyCol = 'Request ID'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
#print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))

mergeDf = left.merge(right, on=keyCol, how='outer', suffixes=('_left', '_right'))


   # combine the common columns
for col in commonCols:
    if col != keyCol:
        for i, row in mergeDf.iterrows():
            leftVal = str(row[col + '_left']).replace('nan', "").strip()
            rightVal = str(row[col + '_right']).replace('nan', "").strip()
            #print(leftVal + ',' + rightVal)
            if leftVal == rightVal:
                mergeDf.loc[i, col] = leftVal
            else:
                mergeDf.loc[i, col] = leftVal + "~" + rightVal

# only use the finalCols
mergeDf = mergeDf[finalCols]
for df in dfs[2:]:
    df1 = mergeDf
    df2= df

    left= df1
    right = df2

    keyCol ='Request ID'
    commonCols = list(set(left.columns & right.columns))
    finalCols = list(set(left.columns | right.columns))
    #print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))

    mergeDf = left.merge(right, on=keyCol, how='outer', suffixes=('_left', '_right'))


       # combine the common columns
    for col in commonCols:
        if col != keyCol:
            for i, row in mergeDf.iterrows():
                leftVal = str(row[col + '_left']).replace('nan', "").strip()
                rightVal = str(row[col + '_right']).replace('nan', "").strip()
                #print(leftVal + ',' + rightVal)
                leftValWords = leftVal.split('~')
                #print(leftValWords)
                if rightVal in leftValWords:
                    mergeDf.loc[i, col] = leftVal
                else:
                    mergeDf.loc[i, col] = leftVal + '~' + rightVal

# only use the finalCols
    mergeDf = mergeDf[finalCols]
    mergeDf = mergeDf
return mergeDf

合并两个具有相似列的数据框

3 个答案: