Pandas复杂条件,包含上一个和前一个值

时间:2017-03-17 13:48:04

标签: python pandas numpy conditional-statements

有一个包含订单号的表(Pandas Dataframe)

>>> ord = pd.DataFrame([[241147,'01.01.2016'], [241148,'01.01.2016']], columns=['order_id','created'])
>>> ord
    order_id    created
0   241147  01.01.2016
1   241148  01.01.2016

订单状态有变化的历史记录

>>> ord_status_history['ord_id','osh_id','osh_created','osh_status_id','osh_status_reason']

    ord_id  osh_id  osh_id_created  osh_status_id   osh_status_reason
0   241147  124632  01.01.2016  1   None
1   241147  124682  02.01.2016  2   None
2   241147  124719  03.01.2016  10   None
7   241148  124633  01.01.2016  1   None
8   241148  126181  06.01.2016  5   Test_reason

我想在表中添加关于最后订单状态和订单倒数第二状态的信息(订单由字段'osh_created'确定)。

order_id    created Last_status_id  Last_status_date    Prev_status_id  Prev_status_date    reason
0   241147  01.01.2016  10  03.01.2016  9   02.01.2016  NaN
1   241148  01.01.2016  5   06.01.2016  1   01.01.2016  Test Reason

但我不明白如何使用np.where或loc条件。由于ord_status_history中的一个订单有几行,但我需要为每个订单只选择一个。

我试着做这个(但它非常糟糕):

for i in range(ord_stat['order_id'].count()-1):
    if (ord_stat.loc[i,'order_id']==ord_stat.loc[i+1,'order_id']):
        if (ord_stat.loc[i,'osh_id_created']<=ord_stat.loc[i+1,'osh_id_created']):
            if (ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=='NAN'):
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord_stat.loc[i,'osh_id_created']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i+1,'osh_id_created']
            else: 
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i+1,'osh_id_created']
        else:
            if (ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=='NAN'):
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord_stat.loc[i+1,'osh_id_created']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i,'osh_id_created']
            else: 
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i,'osh_id_created']

阅读nlargest,但我不明白我如何使用status_id,如果我使用'osh_created'和nlargest

ord_stat.groupby('order_id')['osh_id_created'].nlargest(2)

1 个答案:

答案 0 :(得分:1)

假设我们有以下DataFrames:

In [291]: ord
Out[291]:
   order_id    created
0    241147 2016-01-01
1    241148 2016-01-01

In [292]: hst
Out[292]:
   ord_id  osh_id osh_id_created  osh_status_id osh_status_reason
0  241147  124632     2016-01-01              1              None
1  241147  124682     2016-02-01              2              None
2  241147  124719     2016-03-01             10              None
7  241148  124633     2016-01-01              1              None
8  241148  126181     2016-06-01              5       Test_reason

我们可以按如下方式汇总:

In [293]: funcs = {
     ...:     'osh_status_id':{
     ...:         'Last_status_id':'last',
     ...:         'Prev_status_id':lambda x: x.shift().iloc[-1]
     ...:     },
     ...:     'osh_id_created':{
     ...:         'Last_status_date':'last',
     ...:         'Prev_status_date':lambda x: x.shift().iloc[-1]
     ...:     }
     ...: }
     ...:

In [294]: x = (hst.sort_values('osh_id_created')
     ...:         .groupby('ord_id')['osh_status_id','osh_id_created']
     ...:         .agg(funcs)
     ...: )
     ...:

导致

In [295]: x
Out[295]:
        Last_status_id  Prev_status_id Last_status_date Prev_status_date
ord_id
241147              10               2       2016-03-01       2016-02-01
241148               5               1       2016-06-01       2016-01-01

现在我们可以将它合并回原来的ord DF:

In [296]: ord.set_index('order_id').join(x).reset_index()
Out[296]:
   order_id    created  Last_status_id  Prev_status_id Last_status_date Prev_status_date
0    241147 2016-01-01              10               2       2016-03-01       2016-02-01
1    241148 2016-01-01               5               1       2016-06-01       2016-01-01

或使用merge()方法:

In [297]: pd.merge(ord, x, left_on='order_id', right_index=True)
Out[297]:
   order_id    created  Last_status_id  Prev_status_id Last_status_date Prev_status_date
0    241147 2016-01-01              10               2       2016-03-01       2016-02-01
1    241148 2016-01-01               5               1       2016-06-01       2016-01-01