我有以下数据框
NUMBER TIMESTAMP
0 67 2020-03-06 12:06:25
1 67 2019-09-20 16:21:45
2 67 2019-09-17 17:54:40
3 67 2019-08-21 19:59:30
4 67 2019-08-13 19:40:26
5 67 2019-06-19 19:45:12
6 67 2019-04-30 20:46:03
7 67 2019-04-29 20:46:03
我正在尝试按时差小于30(可以是任意数字)的时间戳对NUMBER
列进行分组
我期望输出为:
NUMBER TIMESTAMP
0 67_4 2020-03-06 12:06:25
1 67_3 2019-09-20 16:21:45
2 67_3 2019-09-17 17:54:40
3 67_3 2019-08-21 19:59:30
4 67_3 2019-08-13 19:40:26
5 67_2 2019-06-19 19:45:12
6 67_1 2019-04-30 20:46:03
7 67_1 2019-04-29 20:46:03
对这样的时间戳进行简单的比较甚至对系列中的ceil都无济于事
#df['diff'] = df.groupby('NUMBER')['TIMESTAMP'].diff(-1).dt.days
#mask = df['diff'] > 30
#
#df.loc[mask, 'NUMBER'] = df\
# .loc[mask, 'NUMBER']\
# .astype(str) \
# + '_' \
# + df[mask]\
# .groupby('NUMBER')\
# .cumcount().add(1).astype(str)
如果我们进行简单的比较,这就是结果
NUMBER TIMESTAMP diff
0 67_1 2020-03-06 12:06:25 167.0
1 67 2019-09-20 16:21:45 2.0
2 67 2019-09-17 17:54:40 26.0
3 67 2019-08-21 19:59:30 8.0
4 67_2 2019-08-13 19:40:26 54.0
5 67_3 2019-06-19 19:45:12 49.0
6 67 2019-04-30 20:46:03 1.0
7 67 2019-04-29 20:46:03 NaN
任何帮助将不胜感激。
答案 0 :(得分:1)
让我们尝试比较阈值和cumsum
的差异:
thresh = 30
dt_thresh = pd.to_timedelta(thresh, unit='D')
df['NUMBER'] = df['NUMBER'].astype(str) + '_' + \
(df.TIMESTAMP.sort_values()
.groupby(df.NUMBER) # added groupby here
.diff().gt(dt_thresh)
.groupby(df.NUMBER) # and groupby here
.cumsum().add(1).astype(str)
)
输出:
NUMBER TIMESTAMP
0 67_4 2020-03-06 12:06:25
1 67_3 2019-09-20 16:21:45
2 67_3 2019-09-17 17:54:40
3 67_3 2019-08-21 19:59:30
4 67_3 2019-08-13 19:40:26
5 67_2 2019-06-19 19:45:12
6 67_1 2019-04-30 20:46:03
7 67_1 2019-04-29 20:46:03