我一直在寻找这个问题的解决方案,并且所有答案似乎都没有起作用,所以我决定就这个特定的用例请求帮助。我正在合并两个具有不同尺寸但共享两个相等列的csv。我首先将csv放入pandas数据框中,如下所示:
df_td和df_ld:
>>> df_td.head(2)
trans_id store_num cust_id bus_date type
0 0000001 104 111111 10/5/2017 12:00:00 AM Payment
1 0000002 104 111111 10/5/2017 12:00:00 AM Payment
2 0000003 104 111111 10/5/2017 12:00:00 AM Received
>>> df_ld.head(2)
cust_id nxt_date store_num amt_received type_rec
0 111111 11/5/2017 104 10.00 NaN
1 111112 11/6/2017 104 10.00 NaN
运行此代码后:
merged = pd.merge(df_td, df_ld, how='inner', on=['cust_id','store_num']).fillna(0)
我有这个合并的数据框:
>>> df_td_ld.head(3)
trans_id store_num cust_id bus_date type nxt_date amt_received type_rec
0 0000001 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 10.00 NaN
1 0000002 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 10.00 NaN
2 0000003 104 111111 10/5/2017 12:00:00 AM Received 11/5/2017 10.00 NaN
正如您所看到的,我在df_ld列中获得了重复,因为cust_id 111111仅在该数据帧中出现一次。如果我尝试这样查询并对该列求和,它将在该日期为该客户报告该日期的30.00而不是正确的10.00。我尝试了outer
,left
,right
以及concat
和join
函数,但要么获得相同的输出,要么完全错误。
我想要的是:
trans_id store_num cust_id bus_date type nxt_date amt_received type_rec
0 0000001 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 0 NaN
1 0000002 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 0 NaN
2 0000003 104 111111 10/5/2017 12:00:00 AM Received 11/5/2017 10.00 NaN
有没有办法使用merg / join / concat这是可行的? 谢谢!
答案 0 :(得分:1)
合并后,您是否只能将amt_received的所有适用值设置为0?
merged = pd.merge(df_td, df_ld, how='inner', on=['cust_id','store_num'])
merged.loc[merged.type_y != 'Received','amt_received'] = 0
cust_id nxt_date store_num amt_received type_x trans_id bus_date type_y
0 111111 11/5/2017 104 0.0 1 10/5/2017 Payment
1 111111 11/5/2017 104 0.0 2 10/5/2017 Payment
2 111111 11/5/2017 104 10.0 3 10/5/2017 Received