我想要比较两个数据集。一个是测量的气象值,大约每15分钟测量一次,但不是每小时一致的时间测量(即12:03,1:05,2:01等)。另一个数据集是完全按小时位置建模的数据。我想从最接近小时标记的测量数据中提取值,以便与建模数据连接。
我目前将这两个集合作为DataFrame格式,并创建了一个每小时的时间序列作为索引。有没有人知道一种简单的方法来对齐这些而不循环遍历所有数据?
感谢。
使用df.resample('H', how='ohlc')
方法,我收到以下错误:
Traceback (most recent call last):
File "<pyshell#81>", line 1, in <module>
df.resample('H', how='ohlc')
File "C:\Python33\lib\site-packages\pandas\core\generic.py", line 290, in resample
return sampler.resample(self)
File "C:\Python33\lib\site-packages\pandas\tseries\resample.py", line 83, in resample
rs = self._resample_timestamps(obj)
File "C:\Python33\lib\site-packages\pandas\tseries\resample.py", line 226, in _resample_timestamps
result = grouped.aggregate(self._agg_method)
File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1695, in aggregate
return getattr(self, arg)(*args, **kwargs)
File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 427, in ohlc
return self._cython_agg_general('ohlc')
File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1618, in _cython_agg_general
new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1656, in _cython_agg_blocks
result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 818, in aggregate
raise NotImplementedError
NotImplementedError
我的数据框示例如下所示:
D
2008-01-01 00:01:00 274.261108
2008-01-01 00:11:00 273.705566
2008-01-01 00:31:00 273.705566
2008-01-01 00:41:00 273.705566
2008-01-01 01:01:00 273.705566
2008-01-01 01:11:00 273.705566
2008-01-01 01:31:00 273.705566
2008-01-01 01:41:00 273.705566
2008-01-01 02:01:00 273.705566
2008-01-01 02:11:00 273.149994
编辑:使用python 3.3时可能会出错。谁能证实这一点?
答案 0 :(得分:2)
我认为pandas.DataFrame.resample()就是你所需要的。您可以查看您想要的method of resampling,例如,选中'ohlc':
>>> df = pd.DataFrame({'data':[1,4,3,2,7,3]}, index=pd.DatetimeIndex(['2013-11-05 12:03', '2013-11-05 12:14','2013-11-05 12:29','2013-11-05 12:46','2013-11-05 13:01','2013-11-05 13:16']))
>>> df.resample('H', how='ohlc')
data
open high low close
2013-11-05 12:00:00 1 4 1 2
2013-11-05 13:00:00 7 7 3 3
之后,您只需使用pandas.DataFrame.join()。
更新这很奇怪,在你的DataFrame上尝试过:
>>> df = pd.DataFrame({'D':[274.261108,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.149994]})
>>> df.index = pd.DatetimeIndex(['2008.01.01 00:01:00','2008.01.01 00:11:00','2008.01.01 00:31:00','2008.01.01 00:41:00','2008.01.01 01:01:00','2008.01.01 01:11:00','2008.01.01 01:31:00','2008.01.01 01:41:00','2008.01.01 02:01:00','2008.01.01 02:11:00'])
>>> df.resample('H', how='ohlc')
D
open high low close
2008-01-01 00:00:00 274.261108 274.261108 273.705566 273.705566
2008-01-01 01:00:00 273.705566 273.705566 273.705566 273.705566
2008-01-01 02:00:00 273.705566 273.705566 273.149994 273.149994
工作正常。