合并两个不同维度的CSV时,Pandas删除重复行

时间:2017-10-19 13:24:06

标签: python pandas csv merge

我一直在寻找这个问题的解决方案,并且所有答案似乎都没有起作用,所以我决定就这个特定的用例请求帮助。我正在合并两个具有不同尺寸但共享两个相等列的csv。我首先将csv放入pandas数据框中,如下所示:

df_td和df_ld:

>>> df_td.head(2)
   trans_id  store_num  cust_id               bus_date          type
0   0000001        104   111111  10/5/2017 12:00:00 AM       Payment
1   0000002        104   111111  10/5/2017 12:00:00 AM       Payment 
2   0000003        104   111111  10/5/2017 12:00:00 AM       Received



>>> df_ld.head(2)
   cust_id  nxt_date  store_num   amt_received           type_rec 
0   111111  11/5/2017       104          10.00            NaN
1   111112  11/6/2017       104          10.00            NaN

运行此代码后:

merged = pd.merge(df_td, df_ld, how='inner', on=['cust_id','store_num']).fillna(0)

我有这个合并的数据框:

>>> df_td_ld.head(3)
   trans_id  store_num  cust_id               bus_date          type    nxt_date    amt_received    type_rec
0   0000001        104   111111  10/5/2017 12:00:00 AM       Payment    11/5/2017          10.00     NaN
1   0000002        104   111111  10/5/2017 12:00:00 AM       Payment    11/5/2017          10.00     NaN
2   0000003        104   111111  10/5/2017 12:00:00 AM       Received   11/5/2017          10.00     NaN

正如您所看到的,我在df_ld列中获得了重复,因为cust_id 111111仅在该数据帧中出现一次。如果我尝试这样查询并对该列求和,它将在该日期为该客户报告该日期的30.00而不是正确的10.00。我尝试了outerleftright以及concatjoin函数,但要么获得相同的输出,要么完全错误。

我想要的是:

   trans_id  store_num  cust_id               bus_date          type    nxt_date    amt_received    type_rec
0   0000001        104   111111  10/5/2017 12:00:00 AM       Payment    11/5/2017              0     NaN
1   0000002        104   111111  10/5/2017 12:00:00 AM       Payment    11/5/2017              0     NaN
2   0000003        104   111111  10/5/2017 12:00:00 AM       Received   11/5/2017          10.00     NaN

有没有办法使用merg / join / concat这是可行的? 谢谢!

1 个答案:

答案 0 :(得分:1)

合并后,您是否只能将amt_received的所有适用值设置为0?

merged = pd.merge(df_td, df_ld, how='inner', on=['cust_id','store_num'])
merged.loc[merged.type_y != 'Received','amt_received'] = 0

    cust_id nxt_date    store_num   amt_received    type_x  trans_id    bus_date    type_y
0   111111  11/5/2017   104         0.0                     1       10/5/2017   Payment
1   111111  11/5/2017   104         0.0                     2      10/5/2017    Payment
2   111111  11/5/2017   104         10.0                    3      10/5/2017    Received