Pandas groupby用于时间范围

时间:2017-04-21 17:40:16

标签: python pandas dataframe group-by binning

我有一个pandas数据框(存储在.csv文件中),格式如下。

val,date,time
0.001,01JAN90,0:00:00
0.002,01JAN90,0:01:00
0.005,01JAN90,0:02:00
0.056,01JAN90,0:03:00
...
0.067,31DEC90,23:55:00
0.007,31DEC90,23:56:00
0.006,31DEC90,23:57:00
0.004,31DEC90,23:58:00
0.003,31DEC90,23:59:00

这是:一年中每天(val列)的每分钟(time列)的单个浮点(time列)。我需要在整个一年中对val元素进行分组,这些元素属于给定的小时范围。我将15小时范围定义为:

t_range = [['5:30:00', '6:30:00'], ['6:30:00', '7:30:00'], ...,
['19:30:00', '20:30:00']]

这里给出的答案Pandas Groupby Range of Values处理定义为浮点数的范围,但我的范围被定义为字符串。

我的想法是,我需要先将time中的所有HH:MM:SS值转换为浮点数,然后根据groupby和{{3}应用解决方案}。这是正确的方法吗?如果没有,我应该如何使用pandas呢?

1 个答案:

答案 0 :(得分:2)

IIUC你可以这样做:

start = 5*60+30
end = 20*60+30
step = 60

df['ts'] = pd.to_datetime(df.date + ' ' + df.time, format='%d%b%y %H:%M:%S')
df['mins'] = df.ts.dt.hour*60 + df.ts.dt.minute

# filter out all "non-interesting" entries                           
x = df.query("@start <= mins <= @end")

bins = np.arange(start-step, end+step, step)
labels = ['({0[0]:02d}:{0[1]:02d}:00, {0[0]:02d}:{0[1]:02d}:00]'.format(divmod(x,60),
                                                                        divmod(x+step,60))
          for x in bins[:-1]]


x.groupby(pd.cut(x['mins'], bins=bins, labels=labels))['val'].sum().dropna()

结果:

In [164]: x.groupby(pd.cut(x['mins'], bins=bins, labels=labels))['val'].sum().dropna()
Out[164]:
mins
(05:30:00, 06:30:00]    0.006
(06:30:00, 07:30:00]    0.004
(07:30:00, 08:30:00]    0.003
(08:30:00, 09:30:00]    0.111
(09:30:00, 10:30:00]    0.001
(10:30:00, 11:30:00]    0.002
(11:30:00, 12:30:00]    0.005
(12:30:00, 13:30:00]    0.056
Name: val, dtype: float64

来源DF:

In [166]: df
Out[166]:
     val     date      time
0  0.067  01DEC90  04:00:00
1  0.007  01DEC90  05:00:00
2  0.006  01DEC90  06:00:00
3  0.004  01DEC90  07:00:00
4  0.003  01DEC90  08:00:00
5  0.111  01DEC90  09:00:00
6  0.001  01JAN90  10:00:00
7  0.002  01JAN90  11:00:00
8  0.005  01JAN90  12:00:00
9  0.056  01JAN90  13:00:00

说明:

bins:分钟数

In [181]: bins
Out[181]: array([ 270,  330,  390,  450,  510,  570,  630,  690,  750,  810,  870,  930,  990, 1050, 1110, 1170, 1230])

标签

In [182]: labels
Out[182]:
['(04:30:00, 04:30:00]',
 '(05:30:00, 05:30:00]',
 '(06:30:00, 06:30:00]',
 '(07:30:00, 07:30:00]',
 '(08:30:00, 08:30:00]',
 '(09:30:00, 09:30:00]',
 '(10:30:00, 10:30:00]',
 '(11:30:00, 11:30:00]',
 '(12:30:00, 12:30:00]',
 '(13:30:00, 13:30:00]',
 '(14:30:00, 14:30:00]',
 '(15:30:00, 15:30:00]',
 '(16:30:00, 16:30:00]',
 '(17:30:00, 17:30:00]',
 '(18:30:00, 18:30:00]',
 '(19:30:00, 19:30:00]']