GroupBy - 结合其他条件的Datetime diff()

时间:2016-05-07 19:35:10

标签: python python-3.x pandas

我有一个如下所示的数据框:

In [265]: df_2
Out[265]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
3   B-023  b76cd912ff 2014-10-08 18:39:34         True
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
7   B-033  b76cd912ff 2014-10-08 18:44:30         True
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

我需要放弃所有重复的'失败的订单' - 除了最后一个 - 在任何失败的订单序列中。

  

A'序列'是一系列符合以下条件的失败订单:

     
      
  1. 由同一用户放置 - 由'ID'
  2. 标识   
  3. 'ORDER_FAILED' == True
  4.   
  5. 没有连续的订单相距5分钟以上。
  6.   

我希望可以这样做:

In [298]: df_2[df_2.ORDER_FAILED == True].sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff().dt.total_seconds()
Out[298]: 
0         NaN
3     17767.0
4         NaN
5       148.0
7       148.0
8        90.0
9       375.0
11     1533.0
Name: DATETIME, dtype: float64

然后使用pd.join来达到此目的:

In [302]: df_2 = df_2.join(df_tmp); df_2
Out[302]: 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN
7   B-033  b76cd912ff 2014-10-08 18:44:30         True    148.0
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN

但遗憾的是,这不正确。订单7应该有diff == NaN,因为这是一系列失败订单中的第一个订单,在此用户成功订购之后(订单6)。

我意识到我计算上面diff的方式是错误的,我还没有设法找到一种方法来重置'每次成功订购后的柜台。

所需的正确结果将是:

In [303]: df_2
Out[303]: 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN ## <- successful order
7   B-033  b76cd912ff 2014-10-08 18:44:30         True      NaN ## <- since this is the first failed order in this sequence of failed orders
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN

在此之后,我只会将diff > 300的订单标记为:

>> df_2.ix[df_2['diff'] > 300, 'remove_flag'] = 1
>> df_2.groupby('ID')['remove_flag'].shift(-1) ## <- adjust flag to mark the previous order in the sequence
>> df_2 = df_2[df_2.remove_flag != 1]

这意味着,最终应该保留或丢弃的订单如下所示:

>> df_2 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN ## STAYS - Failed, but gap to next failed by same user is greater than 5 minutes
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN ## STAYS - successful order
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN ## STAYS - successful order
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0 ## DISCARD - The next failed order by the same user is only 148 seconds away (less than 5 minutes)
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN ## STAYS - successful order
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0 ## STAYS - last in this sequence of failed orders by this user
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN ## STAYS - successful order
7   B-033  b76cd912ff 2014-10-08 18:44:30         True      NaN ## DISCARD - The next failed order by the same user is only 90 seconds away (less than 5 minutes)
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0 ## STAYS - next failed order by the same user is more than 5 minutes away
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0 ## STAYS - More than 5 minutes away from previous failed order by the same user
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN ## STAYS - Successful order
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0 ## STAYS - too long since last failed order by this same user
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN ## STAYS - Successful order

非常感谢任何帮助,谢谢!

1 个答案:

答案 0 :(得分:1)

我将从按ID和DATETIME(升序)排序开始:

df1 = df.sort_values(by = ['ID','DATETIME'])

现在,如果我理解正确,我们需要删除满足以下条件的所有订单(通过&#34; next&#34;我理解&#34;在下一行&#34;):< / p>

  • 订单失败

  • 下一个订单失败

  • 订单与下一订单之间的时差最多为300秒

  • (另外)ID与下一个ID相同(否则它是最后一个订单)

我的想法很简单:添加适当的列,以便每行包含评估这些条件所需的所有数据。

这个添加&#34;下一个ID&#34;和#34;下一个订单&#34;字段:

df1[['Next_ID','Next_ORDER_FAILED']] = df1[['ID','ORDER_FAILED']].shift(-1)

并且这个负责与下一个订单的时间差异:

df1['diff'] = -df1['DATETIME'].diff(-1).dt.total_seconds()

(与句点= -1的相关差异将为负,因此为减号)。

我相信其余的已经非常简单了。

更新:顺便说一下,即使不向数据框添加新列,我们也可以创建一个bool掩码:

mask = (df1['ORDER_FAILED'] == True) and (df1['ORDER_FAILED'].shift(-1) == True) and ...

<强> 更新

没有必要按ID排序,如果正确使用groupby(),整体解决方案实际上会更加清晰。在上述建议之后,最后是如何完成的。

In [478]: df_3
Out[478]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
3   B-023  b76cd912ff 2014-10-08 18:39:34         True
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
7   B-033  b76cd912ff 2014-10-08 18:44:30         True
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

In [479]: df_3['NEXT_FAILED'] = df_3.sort_values(by='DATETIME').groupby('ID')['ORDER_FAILED'].shift(-1)

In [480]: df_3['SECONDS_TO_NEXT_ORDER'] = -df_3.sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff(-1).dt.total_seconds()

In [481]: condition = (df_3.NEXT_FAILED == True) & (df_3.ORDER_FAILED == True) & (df_3.SECONDS_TO_NEXT_ORDER <= 300)

In [482]: df_3[~condition].drop(['NEXT_FAILED','SECONDS_TO_NEXT_ORDER'], axis=1)
Out[482]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

正确的订单 - 按照OP的描述 - 确实被删除了!