我从熊猫那里得到一个奇怪的行为,我想将我的分钟数据重新采样为每小时数据(使用均值)。我的数据如下:
Data.head()
AAA BBB
Time
2009-02-10 09:31:00 86.34 101.00
2009-02-10 09:36:00 86.57 100.50
2009-02-10 09:38:00 86.58 99.78
2009-02-10 09:40:00 86.63 99.75
2009-02-10 09:41:00 86.52 99.66
Data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 961276 entries, 2009-02-10 09:31:00 to 2016-02-29 19:59:00
Data columns (total 2 columns):
AAA 961276 non-null float64
BBB 961276 non-null float64
dtypes: float64(2)
memory usage: 22.0 MB
Data.index
Out[25]:
DatetimeIndex(['2009-02-10 09:31:00', '2009-02-10 09:36:00',
'2009-02-10 09:38:00', '2009-02-10 09:40:00',
'2009-02-10 09:41:00', '2009-02-10 09:44:00',
'2009-02-10 09:45:00', '2009-02-10 09:46:00',
'2009-02-10 09:47:00', '2009-02-10 09:48:00',
...
'2016-02-29 19:41:00', '2016-02-29 19:42:00',
'2016-02-29 19:43:00', '2016-02-29 19:50:00',
'2016-02-29 19:52:00', '2016-02-29 19:53:00',
'2016-02-29 19:56:00', '2016-02-29 19:57:00',
'2016-02-29 19:58:00', '2016-02-29 19:59:00'],
dtype='datetime64[ns]', name='Time', length=961276, freq=None)
要重新取样数据,请执行以下操作:
tframe = '60T'
hr_mean = Data.resample(tframe).mean()
作为输出,我得到了只有两个数字的pandas系列:
In[26]: hr_mean
Out[26]:
AAA 156.535198
BBB 30.197029
dtype: float64
如果我选择不同的时间范围或重新采样功能,我会得到相同的行为。
答案 0 :(得分:5)
您显示的行为是旧版pandas版本的预期行为(pandas&lt; 0.18)。较新的pandas版本具有更改的重新采样API,您在此处可以看到其中一个棘手的案例。
在v0.18之前,resample
使用how
关键字指定如何重新取样,并直接返回重新采样的帧/系列:
In [5]: data = pd.DataFrame(np.random.randn(180, 2), columns=['AAA', 'BBB'], index=pd.date_range("2016-06-01", periods=180, freq='1T'))
# how='mean' is the default, so this is the same as data.resample('60T')
In [6]: data.resample('60T', how='mean')
Out[6]:
AAA BBB
2016-06-01 00:00:00 0.100026 0.210722
2016-06-01 01:00:00 0.093662 -0.078066
2016-06-01 02:00:00 -0.114801 0.002615
# calling .mean() now calculates the mean of each column, resulting in the following series:
In [7]: data.resample('60T', how='mean').mean()
Out[7]:
AAA 0.026296
BBB 0.045090
dtype: float64
In [8]: pd.__version__
Out[8]: u'0.17.1'
从0.18.0开始,resample
本身是一个延迟操作,这意味着您首先必须调用一个方法(在本例中为mean()
)来执行实际重新采样:
In [4]: data.resample('60T')
Out[4]: DatetimeIndexResampler [freq=<60 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]
In [5]: data.resample('60T').mean()
Out[5]:
AAA BBB
2016-06-01 00:00:00 -0.059038 0.102275
2016-06-01 01:00:00 -0.141429 -0.021342
2016-06-01 02:00:00 -0.073341 -0.150091
In [6]: data.resample('60T').mean().mean()
Out[6]:
AAA -0.091270
BBB -0.023052
dtype: float64
In [7]: pd.__version__
Out[7]: '0.18.1'
有关API更改的说明,请参阅http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#resample-api。