我有一个问题,无法找到可以应用的好的答案。它似乎比我想象的要复杂:
这是我当前的数据框 df =
[customerid, visit_number, date, purchase_amount]
[1, 38, 01-01-2019, 40 ]
[1, 39, 01-03-2019, 20 ]
[2, 10, 01-02-2019, 60 ]
[2, 14, 01-05-2019, 0 ]
[3, 10, 01-01-2019, 5 ]
我要寻找的是汇总此表,在此表中我每1个客户最终得到1行,并且还从原始数据中获得了其他派生列,如下所示:
df_new =
[customerid, visits, days, purchase_amount]
[1, 2, 3, 60 ]
[2, 5, 4, 60 ]
[3, 1, 1, 5 ]
请注意,如果没有用户的日期或访问要与之进行比较,则这些指标将始终为1(请参见for customerid = 3)。
就像我说的那样,我试图环顾了几天,但找不到太多帮助。我希望有人可以指导。非常感谢。
答案 0 :(得分:0)
您可以使用groupby.agg:
import datetime
df['date']=pd.to_datetime(df['date'])
g=df.groupby('customerid')
df.index=df['customerid']
df_new=g.agg({'purchase_amount':'sum','visit_number':'diff','date':'diff'})
df_new=df_new.reset_index().sort_values('date').drop_duplicates('customerid').reset_index(drop=True)
df_new['visit_number']=df_new['visit_number']+1
df_new['date']=df_new['date']+pd.Timedelta('1 days')
df_new=df_new.rename(columns={'visit_number':'visits','date':'days'}).reindex(columns=['customerid','visits','days','purchase_amount'])
df_new['visits']=df_new['visits'].fillna(1)
df_new['days']=df_new['days'].fillna(pd.Timedelta('1 days'))
print(df_new)
customerid visits days purchase_amount
0 1 2.0 3 days 60
1 2 5.0 4 days 60
2 3 1.0 1 days 5
替代解决方案:
import datetime
df['date']=pd.to_datetime(df['date'])
g=df.groupby('customerid')
df.index=df['customerid']
df2=g.agg({'visit_number':'diff','date':'diff'})
df2=df2.loc[df2['visit_number'].notnull()]
df2['visit_number']=df2['visit_number']+1
df2['date']=df2['date']+pd.Timedelta('1 days')
df3=g.agg({'purchase_amount':'sum'})
df_new=pd.concat([df2,df3],sort=False,axis=1).rename(columns={'visit_number':'visits','date':'days'}).reset_index()
df_new['visits']=df_new['visits'].fillna(1)
df_new['days']=df_new['days'].fillna(pd.Timedelta('1 days'))
print(df_new)