Python合并两个数据框和组

时间:2018-11-27 08:18:23

标签: python pandas dataframe merge pandas-groupby

我需要通过合并df和df1来得出上述预期的df3,并且需要以下统计信息:

合并时要注意:如果“ Desc1”中没有该值,则应从“ Desc2”中选择该值

  1. 每张卡的名称并在每个类别的交叉标签中以及在每个类别上花费的金额百分比。 (即)每个类别的金额之和/每个卡名的金额之和
  2. 每个卡名称的前2个类别(基于花费的金额)(所有这些都应按卡分组) 能帮忙吗?还建议我们可以从df3推断出更多统计信息?

我有2个DF,如下所示:

df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
      "Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
      "Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
      "Group_code":['123','123','456','678','123','434','678','365','678','987'],
      "Amount":['100','240','450','212','432','123','543','567','232','453']})

第二个DF:

df1 = pd.DataFrame({"Group_code": ['123', '123','456', '678','678', '434', '987','421'],
                 "Desc1": ['Electrical', 'Electrical','Hardware', 'House', 'House', 'Car','','Toy'],
                "Desc2":['Electricals111','Electricals123','Hardware112','House232','House112',
                        'Car','Bike','Toy']})

期望的DF:

df3 = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
      "Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
      "Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
      "Group_code":['123','123','456','678','123','434','678','365','678','987'],
      "Amount":['100','240','450','212','432','123','543','567','232','453'],
      "Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','','House','Bike']})

2 个答案:

答案 0 :(得分:0)

您可以先离开联接,然后使用where

合并列
df3 = df.merge(df1, how='left')  # do the join
df3 = df3.rename(columns={"Desc1": "Category"})
df3 = df3.replace("", np.nan)  # replace empty strings
# if Category is NaN, replace with value from Desc2
df3["Category"] = df3["Category"].where(~df3["Category"].isna(), df3["Desc2"])  
df3 = df3.drop("Desc2", axis=1).drop_duplicates()  # drop Desc2

   Customer_no Card_no Card_name Group_code Amount    Category
0            1     111       AAA        123    100  Electrical
2            1     222       AAA        123    240  Electrical
4            1     333       BBB        456    450    Hardware
5            2     444       CCC        678    212       House
7            2     555       AAA        123    432  Electrical
9            6     666       DDD        434    123         Car
10           7     777       EEE        678    543       House
12           8     888       BBB        365    567         NaN
13           9     999       CCC        678    232       House
15          10     000       CCC        987    453        Bike

答案 1 :(得分:0)

df4 = pd.merge(df, df1[['Desc1','Group_code']].drop_duplicates(), how='left', on=['Group_code'])
df4=df4[['Amount','Card_name','Card_no','Desc1','Customer_no','Group_code']] # Reordering of column sequence
df4=df4.rename(columns={'Desc1':'Category'})
df4=df4.fillna({'Category':''})
df4
  Amount Card_name Card_no    Category Customer_no Group_code
0    100       AAA     111  Electrical           1        123
1    240       AAA     222  Electrical           1        123
2    450       BBB     333    Hardware           1        456
3    212       CCC     444       House           2        678
4    432       AAA     555  Electrical           2        123
5    123       DDD     666         Car           6        434
6    543       EEE     777       House           7        678
7    567       BBB     888                       8        365
8    232       CCC     999       House           9        678
9    453       CCC     000        Bike          10        987