熊猫:将数据分成15分钟

时间:2017-07-05 15:25:05

标签: pandas timedelta

我的数据框和日期看起来像

event_time
2017-01-17 00:12:50      
2016-12-05 01:00:21      
2016-12-04 01:14:36     
2016-12-04 01:04:03     
2016-12-04 02:28:23     
2016-12-04 02:46:49      
2016-12-04 01:58:04

我需要获取列period,其中15分钟从00:00:00开始,而日,月和年并不重要。 时间00:00:00 - 00:15:00 - 期间1 00:15:01 - 00:50:00 - 期间2等 如果我使用df = df.groupby(pd.TimeGrouper(freq='15Min'))它会出错,因为它也会使用它。但我只需要时间。

欲望输出

event_time            period
2017-01-17 00:12:50   1      
2016-12-05 01:00:21   4    
2016-12-04 01:14:36   4 
2016-12-04 01:04:03   4 
2016-12-04 02:28:23   10  
2016-12-04 02:46:49   12 
2016-12-04 01:58:04   8

我该怎么做?

2 个答案:

答案 0 :(得分:1)

df = pd.DataFrame(pd.to_datetime([
    "2017-01-17 00:12:50",    
    "2016-12-05 01:00:21",      
    "2016-12-04 01:14:36",     
    "2016-12-04 01:04:03",     
    "2016-12-04 02:28:23",     
    "2016-12-04 02:46:49",      
    "2016-12-04 01:58:04"]),
    columns=['timestamp']
    )

然后period

df['period'] = df.timestamp.apply(lambda ts: 1 + ts.hour * 4 + ts.minute // 15)

给出以下输入:

            timestamp  period
0 2017-01-17 00:12:50       1
1 2016-12-05 01:00:21       5
2 2016-12-04 01:14:36       5
3 2016-12-04 01:04:03       5
4 2016-12-04 02:28:23      10
5 2016-12-04 02:46:49      12
6 2016-12-04 01:58:04       8

您和我的输出之间存在小的差异 - 行123:例如01:00:21应为5,因为有四个第一个小时和第五个小时才开始。

答案 1 :(得分:1)

dt.hourdt.minute的新解决方案:

> s[/A/i]
=> "a"
> s[/A/]
=> nil

<强>计时

df['label'] = df['event_time'].dt.hour * 4 + df['event_time'].dt.minute // 15 + 1
print (df)
           event_time  label
0 2017-01-17 00:12:50      1
1 2016-12-05 01:00:21      5
2 2016-12-04 01:14:36      5
3 2016-12-04 01:04:03      5
4 2016-12-04 02:28:23     10
5 2016-12-04 02:46:49     12
6 2016-12-04 01:58:04      8

旧解决方案(工作,但有点复杂):

您可以先rng = pd.date_range('2017-04-03', periods=100000, freq='27T') df = pd.DataFrame({'timestamp': rng}) df['label'] = df['timestamp'].dt.hour * 4 + df['timestamp'].dt.minute // 15 + 1 df['period'] = df.timestamp.apply(lambda ts: 1 + ts.hour * 4 + ts.minute // 15) print (df) In [172]: %timeit df['timestamp'].dt.hour * 4 + df['timestamp'].dt.minute // 15 + 1 10 loops, best of 3: 20.2 ms per loop In [173]: %timeit df.timestamp.apply(lambda ts: 1 + ts.hour * 4 + ts.minute // 15) 1 loop, best of 3: 301 ms per loop datetimes转换为to_timedelta,然后按strftime转换为秒。

然后使用total_secondscut

df['tot'] = pd.to_timedelta(df['event_time'].dt.strftime('%H:%M:%S'))
              .dt.total_seconds()
              .astype(int)

#necessary add one group
bins = np.concatenate([np.arange(24 * 4) * 900, np.array([100000])])
labels = np.arange(1, 24 * 4 + 1)
df['label'] = pd.cut(df['tot'], bins=bins, labels=labels)
df = df.assign(label1=np.searchsorted(bins, df['tot']))
print (df)
           event_time    tot label  label1
0 2017-01-17 00:12:50    770     1       1
1 2016-12-05 01:00:21   3621     5       5
2 2016-12-04 01:14:36   4476     5       5
3 2016-12-04 01:04:03   3843     5       5
4 2016-12-04 02:28:23   8903    10      10
5 2016-12-04 02:46:49  10009    12      12
6 2016-12-04 01:58:04   7084     8       8

类似的解决方案,仅适用于Series tot:

tot = pd.to_timedelta(df['event_time'].dt.strftime('%H:%M:%S'))
       .dt.total_seconds()
       .astype(int)

bins = np.concatenate([np.arange(24 * 4) * 900, np.array([100000])])
labels = np.arange(1, 24 * 4 + 1)
df['label'] = pd.cut(tot, bins=bins, labels=labels)

df = df.assign(label1=np.searchsorted(bins, tot))
print (df)
           event_time label  label1
0 2017-01-17 00:12:50     1       1
1 2016-12-05 01:00:21     5       5
2 2016-12-04 01:14:36     5       5
3 2016-12-04 01:04:03     5       5
4 2016-12-04 02:28:23    10      10
5 2016-12-04 02:46:49    12      12
6 2016-12-04 01:58:04     8       8