Question

我有按ID，时间戳排序的数据：

1   2016-05-13 22:21:02.625+00  Deposit 
1   2016-05-13 22:29:26.402+00  Deposit 
1   2016-05-16 00:51:22.835+00  Withdrawal  
1   2019-03-21 21:01:28.528+00  Withdrawal  
1   2019-03-21 21:02:10.509+00  Withdrawal  
1   2019-12-06 23:34:15.194+00  Deposit 
2   2019-12-03 23:21:33.465+00  Withdrawal  
2   2019-12-05 00:51:01.136+00  Deposit 
2   2019-12-06 20:07:11.122+00  Deposit

低于期望值的输出仅按存款/取款分组，但每次从存款变为取款时都停止，反之亦然：

1   2016-05-13 22:29:26.402+00  Deposit 2
1   2019-03-21 21:02:10.509+00  Withdrawal 3
1   2019-12-06 23:34:15.194+00  Deposit 1
2   2019-12-03 23:21:33.465+00  Withdrawal 1
2   2019-12-06 20:07:11.122+00  Deposit 2

有没有一种干净的方法？

Answer 1

您可以使用n = 3 s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0) .sum(1)[3:]) df_final = df.assign(overlap_count=s).fillna('x') Out[68]: col1 col2 overlap_count 0 20 39 x 1 23 32 x 2 40 42 x 3 41 50 1 4 48 63 1 5 49 68 2 6 50 68 3 7 50 69 3（doc）根据ID和类型对数据进行分组。

例如：

itertools.groupby

打印：

data = [
['1', '2016-05-13 22:21:02.625+00', 'Deposit'],
['1', '2016-05-13 22:29:26.402+00', 'Deposit'],
['1', '2016-05-16 00:51:22.835+00', 'Withdrawal'],
['1', '2019-03-21 21:01:28.528+00', 'Withdrawal'],
['1', '2019-03-21 21:02:10.509+00', 'Withdrawal'],
['1', '2019-12-06 23:34:15.194+00', 'Deposit'],
['2', '2019-12-03 23:21:33.465+00', 'Withdrawal'],
['2', '2019-12-05 00:51:01.136+00', 'Deposit'],
['2', '2019-12-06 20:07:11.122+00', 'Deposit']
]

from itertools import groupby

data = sorted(data, key=lambda k: (int(k[0]), k[1]))    # <-- if your data is sorted by (ID, Time), you may skip this

for v, g in groupby(data, lambda k: (k[0], k[2])):
    g = list(g)
    print(v[0], g[-1][1], g[-1][2], len(g))

注意：您可以将时间按字符串排序。格式允许这样做。

基于Python中的列滚动Groupby函数

1 个答案: