我有按ID,时间戳排序的数据:
1 2016-05-13 22:21:02.625+00 Deposit
1 2016-05-13 22:29:26.402+00 Deposit
1 2016-05-16 00:51:22.835+00 Withdrawal
1 2019-03-21 21:01:28.528+00 Withdrawal
1 2019-03-21 21:02:10.509+00 Withdrawal
1 2019-12-06 23:34:15.194+00 Deposit
2 2019-12-03 23:21:33.465+00 Withdrawal
2 2019-12-05 00:51:01.136+00 Deposit
2 2019-12-06 20:07:11.122+00 Deposit
低于期望值的输出仅按存款/取款分组,但每次从存款变为取款时都停止,反之亦然:
1 2016-05-13 22:29:26.402+00 Deposit 2
1 2019-03-21 21:02:10.509+00 Withdrawal 3
1 2019-12-06 23:34:15.194+00 Deposit 1
2 2019-12-03 23:21:33.465+00 Withdrawal 1
2 2019-12-06 20:07:11.122+00 Deposit 2
有没有一种干净的方法?
答案 0 :(得分:0)
您可以使用n = 3
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
df_final = df.assign(overlap_count=s).fillna('x')
Out[68]:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
(doc)根据ID和类型对数据进行分组。
例如:
itertools.groupby
打印:
data = [
['1', '2016-05-13 22:21:02.625+00', 'Deposit'],
['1', '2016-05-13 22:29:26.402+00', 'Deposit'],
['1', '2016-05-16 00:51:22.835+00', 'Withdrawal'],
['1', '2019-03-21 21:01:28.528+00', 'Withdrawal'],
['1', '2019-03-21 21:02:10.509+00', 'Withdrawal'],
['1', '2019-12-06 23:34:15.194+00', 'Deposit'],
['2', '2019-12-03 23:21:33.465+00', 'Withdrawal'],
['2', '2019-12-05 00:51:01.136+00', 'Deposit'],
['2', '2019-12-06 20:07:11.122+00', 'Deposit']
]
from itertools import groupby
data = sorted(data, key=lambda k: (int(k[0]), k[1])) # <-- if your data is sorted by (ID, Time), you may skip this
for v, g in groupby(data, lambda k: (k[0], k[2])):
g = list(g)
print(v[0], g[-1][1], g[-1][2], len(g))
注意:您可以将时间按字符串排序。格式允许这样做。