合并两个熊猫数据框中的同一列

时间:2021-05-24 18:50:27

标签: python pandas dataframe

我想在两个 Pandas 数据框中合并相同的 Result 列。 “结果”列不应填充有矛盾的值。两个数据框都有两列 idsub_id 作为唯一标识符。

第一个数据框看起来像这样:

   id sub_id           Result
0  G1     00              
1  G1     F1  under-reporting
2  G2     N1  under-reporting             

第二个数据框看起来像这样:

   id sub_id           Result
0  G3     W1   over-reporting         
1  G3     00   over-reporting          
2  G4     K5              

如果记录未填充 under-reportingover-reporting,我想用字符串 pass 填充该记录。 结果,我希望输出看起来像这样:

   id sub_id           Result
0  G1     00             Pass   
1  G1     F1  under-reporting
2  G2     N1  under-reporting
3  G3     W1   over-reporting          
4  G3     00   over-reporting            
5  G4     K5             Pass         

下面是我现在申请的代码:

#User a jointed mask to filter reportable deals
reportable_deals = df[joint_logic_of_reportable_deals]
under_reporting_df = reportable_deals[['id', 'sub_id']].copy()

#User left merge to identify under-reporting deals (i.e., reportable deals not in the trade_state_df)
under_reporting_df = under_reporting_df.merge(trade_state_df, how='left', on=['id', 'sub_id'], indicator='Result')

under_reporting_df['Result'] = under_reporting_df['Result'].map({
    'both': np.nan,
    'left_only': 'under-reporting',
    'right_only': np.nan
})

#Obtain not-reportable deals using the inverse of the jointed mask
not_reportable_deals = df_data_store[~joint_logic_of_reportable_deals]
over_reporting_df = not_reportable_deals[['id', 'sub_id']].copy()

over_reporting_df['sub_id'] = over_reporting_df['sub_id'].astype(str).str.zfill(2)

#User the left merge to identify over-reporting deals (i.e., not-reportable but exists in the trade_state_df)
over_reporting_df = over_reporting_df.merge(trade_state_df, how='left', on=['id', 'sub_id'], indicator='Result')

over_reporting_df['Result'] = (over_reporting_df['Result'] == 'both')
over_reporting_df['Result'] = np.where(over_reporting_df['Result'], 'over-reporting', np.nan)

output_df = pd.concat([under_reporting_df, over_reporting_df])
output_df = output_df.reset_index(drop=True)
header = ['id', 'sub_id', 'Result']
output_df.to_csv("Eligibility Result.csv", columns = header)

然而,问题是在 concat 方法之后,output_df 现在比原来的 df 多了 7 个交易 非常感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

假设缺失值是'NaN',你可以试试fillna:

(df1.set_index(["id", 'sub_id'])
    .fillna(df2.set_index(["id", 'sub_id']))
    .fillna("pass")
    .reset_index())

结果

    id  sub_id  Result
0   G1  00  over-reporting
1   G1  F1  under-reporting
2   G2  N1  under-reporting
3   G3  W1  pass
4   G3  00  pass
5   G4  K5  over-reporting