有一个包含订单号的表(Pandas Dataframe)
>>> ord = pd.DataFrame([[241147,'01.01.2016'], [241148,'01.01.2016']], columns=['order_id','created'])
>>> ord
order_id created
0 241147 01.01.2016
1 241148 01.01.2016
订单状态有变化的历史记录
>>> ord_status_history['ord_id','osh_id','osh_created','osh_status_id','osh_status_reason']
ord_id osh_id osh_id_created osh_status_id osh_status_reason
0 241147 124632 01.01.2016 1 None
1 241147 124682 02.01.2016 2 None
2 241147 124719 03.01.2016 10 None
7 241148 124633 01.01.2016 1 None
8 241148 126181 06.01.2016 5 Test_reason
我想在表中添加关于最后订单状态和订单倒数第二状态的信息(订单由字段'osh_created'确定)。
order_id created Last_status_id Last_status_date Prev_status_id Prev_status_date reason
0 241147 01.01.2016 10 03.01.2016 9 02.01.2016 NaN
1 241148 01.01.2016 5 06.01.2016 1 01.01.2016 Test Reason
但我不明白如何使用np.where或loc条件。由于ord_status_history中的一个订单有几行,但我需要为每个订单只选择一个。
我试着做这个(但它非常糟糕):
for i in range(ord_stat['order_id'].count()-1):
if (ord_stat.loc[i,'order_id']==ord_stat.loc[i+1,'order_id']):
if (ord_stat.loc[i,'osh_id_created']<=ord_stat.loc[i+1,'osh_id_created']):
if (ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=='NAN'):
ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord_stat.loc[i,'osh_id_created']
ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i+1,'osh_id_created']
else:
ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']
ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i+1,'osh_id_created']
else:
if (ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=='NAN'):
ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord_stat.loc[i+1,'osh_id_created']
ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i,'osh_id_created']
else:
ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']
ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i,'osh_id_created']
阅读nlargest,但我不明白我如何使用status_id,如果我使用'osh_created'和nlargest
ord_stat.groupby('order_id')['osh_id_created'].nlargest(2)
答案 0 :(得分:1)
假设我们有以下DataFrames:
In [291]: ord
Out[291]:
order_id created
0 241147 2016-01-01
1 241148 2016-01-01
In [292]: hst
Out[292]:
ord_id osh_id osh_id_created osh_status_id osh_status_reason
0 241147 124632 2016-01-01 1 None
1 241147 124682 2016-02-01 2 None
2 241147 124719 2016-03-01 10 None
7 241148 124633 2016-01-01 1 None
8 241148 126181 2016-06-01 5 Test_reason
我们可以按如下方式汇总:
In [293]: funcs = {
...: 'osh_status_id':{
...: 'Last_status_id':'last',
...: 'Prev_status_id':lambda x: x.shift().iloc[-1]
...: },
...: 'osh_id_created':{
...: 'Last_status_date':'last',
...: 'Prev_status_date':lambda x: x.shift().iloc[-1]
...: }
...: }
...:
In [294]: x = (hst.sort_values('osh_id_created')
...: .groupby('ord_id')['osh_status_id','osh_id_created']
...: .agg(funcs)
...: )
...:
导致
In [295]: x
Out[295]:
Last_status_id Prev_status_id Last_status_date Prev_status_date
ord_id
241147 10 2 2016-03-01 2016-02-01
241148 5 1 2016-06-01 2016-01-01
现在我们可以将它合并回原来的ord
DF:
In [296]: ord.set_index('order_id').join(x).reset_index()
Out[296]:
order_id created Last_status_id Prev_status_id Last_status_date Prev_status_date
0 241147 2016-01-01 10 2 2016-03-01 2016-02-01
1 241148 2016-01-01 5 1 2016-06-01 2016-01-01
或使用merge()
方法:
In [297]: pd.merge(ord, x, left_on='order_id', right_index=True)
Out[297]:
order_id created Last_status_id Prev_status_id Last_status_date Prev_status_date
0 241147 2016-01-01 10 2 2016-03-01 2016-02-01
1 241148 2016-01-01 5 1 2016-06-01 2016-01-01