根据cumsum和timediff创建标志

时间:2017-09-07 12:48:18

标签: python pandas numpy

考虑以下数据框,

import pandas as pd
import numpy as np

np.random.seed(666)
dd=pd.DataFrame({'v1': np.random.choice(range(30), 20),
                 'v2': np.random.choice(pd.date_range(
                       '5/3/2016', periods=365, freq='D'),
                     20, replace=False)
                 })
dd=dd.sort_values('v2')

#    v1         v2
#5    4 2016-05-03
#11  14 2016-05-26
#19  12 2016-06-26
#15   8 2016-07-06
#7   27 2016-08-04
#4    9 2016-08-28
#17   5 2016-09-08
#13  16 2016-10-04
#14  14 2016-10-10
#18  18 2016-11-25
#3    6 2016-12-03
#8   19 2016-12-04
#12   1 2016-12-12
#10  28 2017-01-14
#1    2 2017-02-12
#0   12 2017-02-15
#9   28 2017-03-11
#6   29 2017-03-18
#16   7 2017-03-21
#2   13 2017-04-29

我想创建基于以下两个条件的组:

  1. v1 <= 40
  2. 的累计总和
  3. v2 <= 61天的时差
  4. 换句话说,每组必须有40 v1或2个月的总和。因此,如果61天过去但40没有完成,那么无论如何关闭该组。如果40日在1天内完成,则再次关闭该组

    最后标志是,

    dd['expected_flag']=[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9]
    

    我在R here中提出了一个非常相似的问题,但现在有一个新的要求(日期)我无法理解它。

    注意我会在庞大的数据集中运行它,因此效率越高越好

    编辑:我发现this question基本上处理的是第一个条件而不是日期条件

    编辑2 :61天的时差仅表示时间限制。实际上,约束将在几分钟内

    编辑3 :使用@Maarten提供的功能,我得到以下(前40行),其中组1还应包括组2中的前2个(即v1 = 6和v1) = 6)。

    Out[330]: 
        index                  v2  v1  max_limit       group
    0       2 2017-04-01 00:00:02  14      335.0        1
    1       3 2017-04-01 00:00:03   8      335.0        1
    2      13 2017-04-01 00:00:13  11      335.0        1
    3      14 2017-04-01 00:00:14  11      335.0        1
    4      29 2017-04-01 00:00:29   4      335.0        1
    5      44 2017-04-01 00:00:44  16      335.0        1
    6      52 2017-04-01 00:00:52  10      335.0        1
    7      58 2017-04-01 00:00:58  11      335.0        1
    8      65 2017-04-01 00:01:05  15      335.0        1
    9      68 2017-04-01 00:01:08   8      335.0        1
    10     81 2017-04-01 00:01:21  12      335.0        1
    11     98 2017-04-01 00:01:38   9      335.0        1
    12    102 2017-04-01 00:01:42   7      335.0        1
    13    107 2017-04-01 00:01:47  12      335.0        1
    14    113 2017-04-01 00:01:53   6      335.0        1
    15    116 2017-04-01 00:01:56   6      335.0        1
    16    121 2017-04-01 00:02:01   4      335.0        1
    17    128 2017-04-01 00:02:08  16      335.0        1
    18    143 2017-04-01 00:02:23   7      335.0        1
    19    149 2017-04-01 00:02:29  11      335.0        1
    20    163 2017-04-01 00:02:43   4      335.0        1
    21    185 2017-04-01 00:03:05   9      335.0        1
    22    239 2017-04-01 00:03:59   6      335.0        1
    23    242 2017-04-01 00:04:02  13      335.0        1
    24    272 2017-04-01 00:04:32   4      335.0        1
    25    293 2017-04-01 00:04:53   8      335.0        1
    26    301 2017-04-01 00:05:01  10      335.0        1
    27    302 2017-04-01 00:05:02   7      335.0        1
    28    305 2017-04-01 00:05:05  12      335.0        1
    29    323 2017-04-01 00:05:23   5      335.0        1
    30    326 2017-04-01 00:05:26  13      335.0        1
    31    329 2017-04-01 00:05:29  10      335.0        1
    32    365 2017-04-01 00:06:05  10      335.0        1
    33    368 2017-04-01 00:06:08  11      335.0        1
    34    411 2017-04-01 00:06:51   6      335.0        2
    35    439 2017-04-01 00:07:19   6      335.0        2
    36    440 2017-04-01 00:07:20   8      335.0        2
    37    466 2017-04-01 00:07:46   7      335.0        2
    38    475 2017-04-01 00:07:55   4      335.0        2
    39    489 2017-04-01 00:08:09   4      335.0        2 
    

    所以说清楚,当我总结并计算得到的timediff时,

    dd.groupby('group', as_index=False).agg({'v1': 'sum', 'v2': lambda x: max(x)-min(x)})
    Out[332]: 
    #      group   v1       v2
    #0         1  320 00:06:06
    #1         2  326 00:07:34
    #2         3  330 00:06:53
    #...
    

2 个答案:

答案 0 :(得分:3)

设定:

dd['days'] = dd['v2'].diff().dt.days.fillna(0).astype(int)
dd = dd[['v1', 'v2', 'days']]  # the order of the columns matters

初​​始化:

increment = pd.Series(False, index=dd.index)
v1_cum = 0
days_cum = 0

循环:

for row in dd.itertuples(name=None):  # faster than iterrows
    v1_cum += row[1]
    days_cum += row[3]
    if v1_cum > 40 or days_cum > 61:
        increment[row[0]] = True  # first element of tuple is index
        # notice the different re-initialization
        v1_cum = row[1]
        days_cum = 0

分配:

dd['flag'] = increment.cumsum() + 1

输出:

[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9]

答案 1 :(得分:2)

来自@IanS的方法略有不同。我不知道哪个会更快。 这个实际上计算了几个月的差异

def diff_in_months(date1, date2):
    import itertools
#     print(date1, date2)
    x, y = max(date1, date2), min(date1, date2)
    coefficients = 12, 100, 24, 100, 100, 1
    coefficients = list(reversed([i for i in itertools.accumulate(reversed(coefficients), operator.mul)]))

    return (sum(i * j for i, j in zip(coefficients, x.timetuple())) - sum(i * j for i, j in zip(coefficients, y.timetuple()))) // coefficients[1]

这可以通过计算系数(并使用global变量)仅一次而不是每次调用方法来加速

def my_grouping(df):
    i = 1
    v1 = 0
    v2 = df['v2'].iloc[0]
    for row in df.itertuples():
#         print(row)
        if diff_in_months(v2, row.v2) >= 2 or (v1 + row.v1 >= 41):
            i += 1
            v1 = row.v1
            v2 = row.v2
        else:
            v1 += row.v1
        yield i

flag_series = pd.Series(my_grouping(dd), index = dd.index))
dd.assign(flag=flag_series, expected_flag = [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9])
    v1  v2  expected_flag   flag
5   4   2016-05-03  1   1
11  14  2016-05-26  1   1
19  12  2016-06-26  1   1
15  8   2016-07-06  2   2
7   27  2016-08-04  2   2
4   9   2016-08-28  3   3
17  5   2016-09-08  3   3
13  16  2016-10-04  3   3
14  14  2016-10-10  4   4
18  18  2016-11-25  4   4
3   6   2016-12-03  4   4
8   19  2016-12-04  5   5
12  1   2016-12-12  5   5
10  28  2017-01-14  6   6
1   2   2017-02-12  6   6
0   12  2017-02-15  7   7
9   28  2017-03-11  7   7
6   29  2017-03-18  8   8
16  7   2017-03-21  8   8
2   13  2017-04-29  9   9

任意间隔

def my_grouping_arbitrary_interval(df, diff_v1 = 41, interval = pd.Timedelta(61, 'D')):
    i = 1
    v1 = 0
    v2 = df['v2'].iloc[0]
    for row in df.itertuples():
#         print(row)
        if max(v2, row.v2) - min(v2, row.v2) >= interval or (v1 + row.v1 >= diff_v1):
            i += 1
            v1 = row.v1
            v2 = row.v2
        else:
            v1 += row.v1
        yield i

这个问题是pd.Timedelta将任何unit : string, [D,h,m,s,ms,us,ns]作为输入,所以没有几个月或几年。对于那些你将不得不调整我的diff_in_months