为时间戳列创建Bin

时间:2019-12-08 07:24:21

标签: python python-3.x pandas data-science bins

我正在尝试为timestamp interval列创建一个合适的bin,

使用

之类的代码
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))

结果df如下:

time_interval  |           bin
  00:17:00        (0 days 00:10:00, 0 days 00:20:00]
  01:42:00                NaN
  00:15:00        (0 days 00:10:00, 0 days 00:20:00]
  00:00:00                NaN
  00:06:00        (0 days 00:00:00, 0 days 00:10:00]

有一点差,因为我想要的结果是调整时间值而不是天数,我还希望上限或最后一个间隔为60分钟或inf(或更多)

所需的输出:

time_interval  |           bin
      00:17:00        (00:10:00,00:20:00]
      01:42:00        (00:60:00,inf]
      00:15:00        (00:10:00,00:20:00]
      00:00:00        (00:00:00,00:10:00]
      00:06:00        (00:00:00,00:10:00]

感谢您的光临!

2 个答案:

答案 0 :(得分:1)

在熊猫inf中不存在时间增量,因此使用了最大值。如果希望bin由timedelta填充,还可以为include最小值使用参数include_lowest=True

b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
                     '00:30:00','00:40:00',
                     '00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'],  include_lowest=True, bins=b)
print (df)
  time_interval                                             Bin
0      00:17:00              (0 days 00:10:00, 0 days 00:20:00]
1      01:42:00  (0 days 01:00:00, 106751 days 23:47:16.854775]
2      00:15:00              (0 days 00:10:00, 0 days 00:20:00]
3      00:00:00     (-1 days +23:59:59.999999, 0 days 00:10:00]
4      00:06:00     (-1 days +23:59:59.999999, 0 days 00:10:00]

如果要使用字符串代替时间增量,请使用zip来创建带有附加'inf'的标签:

vals = ['00:00:00','00:10:00','00:20:00',
        '00:30:00','00:40:00', '00:50:00','00:60:00']

b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))

vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])] 

df['Bin'] = pd.cut(df['time_interval'],  include_lowest=True, bins=b, labels=labels)
print (df)
  time_interval                Bin
0      00:17:00  00:10:00-00:20:00
1      01:42:00       00:60:00-inf
2      00:15:00  00:10:00-00:20:00
3      00:00:00  00:00:00-00:10:00
4      00:06:00  00:00:00-00:10:00

答案 1 :(得分:1)

您可以只使用标签来解决-

df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])