来自Pandas数据的直方图并指定groupby time

时间:2016-06-03 16:55:41

标签: python pandas

如何制作直方图并从以下条形输出中找到90%和95%的值:

bars = ticks.Volume.resample('1s', how=sum)
bars = bars.dropna()
bars

    Timestamp
2015-12-27 23:00:25     1.0
2015-12-27 23:01:11    10.0
2015-12-27 23:02:03     1.0
2015-12-27 23:02:14     2.0
2015-12-27 23:07:27     1.0
2015-12-27 23:14:58     2.0
2015-12-27 23:17:45     1.0
2015-12-27 23:21:38     1.0
2015-12-27 23:37:29     2.0
2015-12-27 23:37:32     1.0
2015-12-27 23:47:35     2.0
2015-12-27 23:47:38    12.0
2015-12-28 00:18:48     1.0
2015-12-28 00:26:19     1.0
2015-12-28 00:42:52     4.0
2015-12-28 01:25:52     1.0
2015-12-28 01:38:52     4.0
2015-12-28 02:03:47     4.0
2015-12-28 02:04:25     4.0
2015-12-28 02:39:15     3.0
2015-12-28 02:54:11     5.0
2015-12-28 03:07:43     1.0
2015-12-28 03:20:04     1.0
2015-12-28 03:30:00     6.0
2015-12-28 03:42:16     1.0
2015-12-28 04:11:03     6.0
2015-12-28 05:13:37     1.0
2015-12-28 05:15:20     1.0
2015-12-28 05:45:51     2.0
2015-12-28 05:48:14    29.0

另外,我如何将此限制为仅限09:30 - 16:15?我应该使用groupby吗?如果是,请告诉我们如何做到这一点?

感谢

1 个答案:

答案 0 :(得分:2)

使用Numpy的histogrampercentile方法可以轻松完成这些任务。

但首先我们通过首先将索引转换为datetime对象来按时间过滤。在下面的示例中,我更改了目标时间,以在示例数据框中包含观察值。

import numpy as np
import pandas as pd

#EDIT: added code to rename a column
##
# Rename column
##
bars.columns  # check the original column names
>>>Index([u'Unnamed: 1'], dtype='object')

# rename the 'Unnamed: 1' column
bars.rename(columns={'Unnamed: 1': 'Value'}, inplace=True)

bars.columns
>>>Index([u'Value'], dtype='object')

##
# Filter by time of day
##

# Convert to a datetime. 
# WARNING this is operation is very expensive. For very large dataframes, it is much faster
# to use keep the indices as text and use a different filtering function.
bars.index = bars.index.to_datetime()

# Changed the target times to include values in the sample df
start = (2, 30)
end = (5, 15)

# Filter to only keeps times of day that fall within the desired times
idx = pd.Series(bars.index).apply(lambda x: x.replace(hour=start[0], minute=start[1]) < x < x.replace(hour=end[0], minute=end[1])).values

bars_filtered = bars[idx]

bars_filtered 
                     Value
2015-12-28 02:03:47    4.0
2015-12-28 02:04:25    4.0
2015-12-28 02:39:15    3.0
2015-12-28 02:54:11    5.0
2015-12-28 03:07:43    1.0
2015-12-28 03:20:04    1.0
2015-12-28 03:30:00    6.0
2015-12-28 03:42:16    1.0
2015-12-28 04:11:03    6.0

计算直方图和百分位非常容易。

##
# Histograms and Percentiles
##

# Get the histograms
num_bins = 10
hist, edges = np.histogram(bars.Value, bins=num_bins)

hist
array([20,  7,  0,  2,  0,  0,  0,  0,  0,  1])

# Edges defining the histogram bins
edges
array([  1. ,   3.8,   6.6,   9.4,  12.2,  15. ,  17.8,  20.6,  23.4,
    26.2,  29. ])

# Calculate the percentiles
p_90 = np.percentile(bars_filtered.Value, q=90)
p_95 = np.percentile(bars_filtered.Value, q=95)

p_90
6.0

p_95
6.0