给出以下熊猫数据框:
timestamp
0 2018-10-05 23:07:02
1 2018-10-05 23:07:13
2 2018-10-05 23:07:23
3 2018-10-05 23:07:36
4 2018-10-05 23:08:02
5 2018-10-05 23:09:16
6 2018-10-05 23:09:21
7 2018-10-05 23:09:39
8 2018-10-05 23:09:47
9 2018-10-05 23:10:01
10 2018-10-05 23:10:11
11 2018-10-05 23:10:23
12 2018-10-05 23:10:59
13 2018-10-05 23:11:03
14 2018-10-08 03:35:32
15 2018-10-08 03:35:58
16 2018-10-08 03:37:16
17 2018-10-08 03:38:04
18 2018-10-08 03:38:30
19 2018-10-08 03:38:36
20 2018-10-08 03:38:42
21 2018-10-08 03:38:52
22 2018-10-08 03:38:57
23 2018-10-08 03:39:10
24 2018-10-08 03:39:27
25 2018-10-08 03:40:47
26 2018-10-08 03:40:54
27 2018-10-08 03:41:02
28 2018-10-08 03:41:12
29 2018-10-08 03:41:32
如何在每行十分钟的时间内标记?例如:
timestamp 10min_period
0 2018-10-05 23:07:02 period_1
2 2018-10-05 23:07:23 period_1
1 2018-10-05 23:07:13 period_1
2 2018-10-05 23:07:23 period_1
3 2018-10-05 23:07:36 period_1
4 2018-10-05 23:08:02 period_1
5 2018-10-05 23:09:16 period_1
6 2018-10-05 23:09:21 period_1
7 2018-10-05 23:09:39 period_1
8 2018-10-05 23:09:47 period_1
9 2018-10-05 23:10:01 period_1
10 2018-10-05 23:10:11 period_1
11 2018-10-05 23:10:23 period_1
12 2018-10-05 23:10:59 period_1
13 2018-10-05 23:11:03 period_1
14 2018-10-08 03:35:32 period_2
15 2018-10-08 03:35:58 period_2
16 2018-10-08 03:37:16 period_2
17 2018-10-08 03:38:04 period_2
18 2018-10-08 03:38:30 period_2
19 2018-10-08 03:38:36 period_2
20 2018-10-08 03:38:42 period_2
21 2018-10-08 03:38:52 period_2
22 2018-10-08 03:38:57 period_2
23 2018-10-08 03:39:10 period_2
24 2018-10-08 03:39:27 period_2
25 2018-10-08 03:40:47 period_2
26 2018-10-08 04:40:54 period_3
27 2018-10-08 04:41:02 period_3
28 2018-10-08 04:41:12 period_3
29 2018-10-08 04:41:32 period_3
从上面的预期输出中可以看到,每个period_n
标签都是通过计算10分钟的时间段来创建的,当datetime系列超过10分钟的阈值时,将创建一个新标签。我尝试使用dt.floor(10Min)
对象,但是它无法正常工作,因为它无法跟踪10分钟的开始时间和结束时间。我也尝试过:
a = df['timestamp'].offsets.DateOffset(minutes=10)
但是,它不起作用。是否知道如何在10分钟内对DF进行分段?这个问题与其他问题有所不同,因为我没有指定任何特定的开始计数的时间。也就是说,我从第一个datetime行实例开始计数,并从那开始计算十分钟的时间间隔。
更新:
转换为日期时间对象后,我还尝试了
df['timestamp'].groupby(pd.TimeGrouper(freq='10Min'))
但是,我得到了:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
答案 0 :(得分:3)
使用一些向量化算法,这应该是可能的(并且表现出色):
# Convert to datetime if not already.
# df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
u = (df.assign(timestamp=df['timestamp'].dt.floor('20min'))
.groupby(pd.Grouper(key='timestamp',freq='10min'))
.ngroup())
df['10min_period'] = np.char.add('period_', (pd.factorize(u)[0] + 1).astype(str))
不幸的是,这里的缺点是,虽然这将为您的样本数据产生预期的输出,但没有简单的方法来处理10分钟的连续间隔(pd.Grouper
从您的列,因此dt.floor('20min')
是第一步的必要步骤-这会无意中使“ period_ {i}”下的“ period_ {i + 1}”行中的某些行或大部分行)。
答案 1 :(得分:1)
为重现您的问题,我这样做:
index = pd.date_range(datetime.datetime.now().date() - datetime.timedelta(10), periods=100, freq='min')
这样我就有了这个DataFrame:
a = pd.DataFrame(index)
a
0
0 2018-10-28 00:00:00
1 2018-10-28 00:01:00
2 2018-10-28 00:02:00
3 2018-10-28 00:03:00
4 2018-10-28 00:04:00
5 2018-10-28 00:05:00
6 2018-10-28 00:06:00
7 2018-10-28 00:07:00
8 2018-10-28 00:08:00
9 2018-10-28 00:09:00
10 2018-10-28 00:10:00
...
[100 rows x 1 columns]
然后,我这样做:
a['period'] = a.apply(lambda x: "perdio_%d"%(int(x[0].minute/10) + 1), axis=1)
我有这个解决方案:
0 period
0 2018-10-28 00:00:00 perdio_1
1 2018-10-28 00:01:00 perdio_1
2 2018-10-28 00:02:00 perdio_1
3 2018-10-28 00:03:00 perdio_1
4 2018-10-28 00:04:00 perdio_1
5 2018-10-28 00:05:00 perdio_1
6 2018-10-28 00:06:00 perdio_1
7 2018-10-28 00:07:00 perdio_1
8 2018-10-28 00:08:00 perdio_1
9 2018-10-28 00:09:00 perdio_1
10 2018-10-28 00:10:00 perdio_2
11 2018-10-28 00:11:00 perdio_2
12 2018-10-28 00:12:00 perdio_2
13 2018-10-28 00:13:00 perdio_2
14 2018-10-28 00:14:00 perdio_2
15 2018-10-28 00:15:00 perdio_2
...
我希望它会有所帮助
答案 2 :(得分:1)
我已将您的数据框保存在记事本中,并将其命名为timestamp.txt
。在记事本中看起来像这样:
然后我写了这个简单的代码:
import pandas as pd
timestamp = pd.read_csv("C:\\...path_of_your_file...\\timestamp.txt") # read file
timestamp['10_Minute_Period'] = 0 # add column and initilize it to zero
numb_groups = int((timestamp.shape[0])/10) # calculate number of groups
groups = 1 # initialize number of groups to one
while groups <= numb_groups+1:
for idx, _ in timestamp.iterrows(): # iterate over row indexes
# check if current row is below the group and the value is equal to 0
if idx < groups*10 and timestamp.at[idx,'10_Minute_Period'] == 0:
# in this case, write corresponding Period
timestamp.loc[idx,'10_Minute_Period'] = ('Period' + str(groups))
groups += 1 # increment groups and check while condition
print(timestamp) # print the final modified timestamp
希望有帮助!
答案 3 :(得分:-1)
df['timestamp'] = pd.to_datetime(df['timestamp'])
diffs = df['timestamp'] - df['timestamp'].shift()
laps = diffs > pd.Timedelta('10 min')
periods = laps.cumsum().apply(lambda x: 'period_{}'.format(x+1))
df['10min_period'] = periods