通过操作和创建新列对数据框进行复杂的分组

时间:2019-09-11 22:38:07

标签: python-3.x pandas pandas-groupby

我有一个问题,无法找到可以应用的好的答案。它似乎比我想象的要复杂:

这是我当前的数据框 df =

[customerid, visit_number, date,        purchase_amount]
[1,          38,           01-01-2019,  40             ]
[1,          39,           01-03-2019,  20             ]
[2,          10,           01-02-2019,  60             ]
[2,          14,           01-05-2019,  0              ]
[3,          10,           01-01-2019,  5              ]

我要寻找的是汇总此表,在此表中我每1个客户最终得到1行,并且还从原始数据中获得了其他派生列,如下所示:

df_new =

[customerid, visits,      days,              purchase_amount]
[1,          2,           3,                 60             ]
[2,          5,           4,                 60             ]
[3,          1,           1,                 5              ]

请注意,如果没有用户的日期或访问要与之进行比较,则这些指标将始终为1(请参见for customerid = 3)。

就像我说的那样,我试图环顾了几天,但找不到太多帮助。我希望有人可以指导。非常感谢。

1 个答案:

答案 0 :(得分:0)

您可以使用groupby.agg

import datetime
df['date']=pd.to_datetime(df['date'])
g=df.groupby('customerid')
df.index=df['customerid']
df_new=g.agg({'purchase_amount':'sum','visit_number':'diff','date':'diff'})
df_new=df_new.reset_index().sort_values('date').drop_duplicates('customerid').reset_index(drop=True)
df_new['visit_number']=df_new['visit_number']+1
df_new['date']=df_new['date']+pd.Timedelta('1 days')
df_new=df_new.rename(columns={'visit_number':'visits','date':'days'}).reindex(columns=['customerid','visits','days','purchase_amount'])
df_new['visits']=df_new['visits'].fillna(1)
df_new['days']=df_new['days'].fillna(pd.Timedelta('1 days'))
print(df_new)


     customerid  visits   days  purchase_amount
0           1     2.0   3 days               60
1           2     5.0   4 days               60
2           3     1.0   1 days                5

替代解决方案:

import datetime
df['date']=pd.to_datetime(df['date'])
g=df.groupby('customerid')
df.index=df['customerid']
df2=g.agg({'visit_number':'diff','date':'diff'})
df2=df2.loc[df2['visit_number'].notnull()]
df2['visit_number']=df2['visit_number']+1
df2['date']=df2['date']+pd.Timedelta('1 days')
df3=g.agg({'purchase_amount':'sum'})
df_new=pd.concat([df2,df3],sort=False,axis=1).rename(columns={'visit_number':'visits','date':'days'}).reset_index()
df_new['visits']=df_new['visits'].fillna(1)
df_new['days']=df_new['days'].fillna(pd.Timedelta('1 days'))
print(df_new)