每小时对齐两个时间序列数据集(Python,Pandas)

时间:2013-11-05 18:28:59

标签: python pandas dataframe

我想要比较两个数据集。一个是测量的气象值,大约每15分钟测量一次,但不是每小时一致的时间测量(即12:03,1:05,2:01等)。另一个数据集是完全按小时位置建模的数据。我想从最接近小时标记的测量数据中提取值,以便与建模数据连接。

我目前将这两个集合作为DataFrame格式,并创建了一个每小时的时间序列作为索引。有没有人知道一种简单的方法来对齐这些而不循环遍历所有数据?

感谢。

使用df.resample('H', how='ohlc')方法,我收到以下错误:

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    df.resample('H', how='ohlc')
  File "C:\Python33\lib\site-packages\pandas\core\generic.py", line 290, in resample
    return sampler.resample(self)
  File "C:\Python33\lib\site-packages\pandas\tseries\resample.py", line 83, in resample
    rs = self._resample_timestamps(obj)
  File "C:\Python33\lib\site-packages\pandas\tseries\resample.py", line 226, in _resample_timestamps
    result = grouped.aggregate(self._agg_method)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1695, in aggregate
    return getattr(self, arg)(*args, **kwargs)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 427, in ohlc
    return self._cython_agg_general('ohlc')
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1618, in _cython_agg_general
    new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1656, in _cython_agg_blocks
    result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 818, in aggregate
    raise NotImplementedError
NotImplementedError

我的数据框示例如下所示:

                              D
2008-01-01 00:01:00  274.261108
2008-01-01 00:11:00  273.705566
2008-01-01 00:31:00  273.705566
2008-01-01 00:41:00  273.705566
2008-01-01 01:01:00  273.705566
2008-01-01 01:11:00  273.705566
2008-01-01 01:31:00  273.705566
2008-01-01 01:41:00  273.705566
2008-01-01 02:01:00  273.705566
2008-01-01 02:11:00  273.149994

编辑:使用python 3.3时可能会出错。谁能证实这一点?

1 个答案:

答案 0 :(得分:2)

我认为pandas.DataFrame.resample()就是你所需要的。您可以查看您想要的method of resampling,例如,选中'ohlc':

>>> df = pd.DataFrame({'data':[1,4,3,2,7,3]}, index=pd.DatetimeIndex(['2013-11-05 12:03', '2013-11-05 12:14','2013-11-05 12:29','2013-11-05 12:46','2013-11-05 13:01','2013-11-05 13:16']))
>>> df.resample('H', how='ohlc')
                     data                  
                     open  high  low  close
2013-11-05 12:00:00     1     4    1      2
2013-11-05 13:00:00     7     7    3      3

之后,您只需使用pandas.DataFrame.join()

更新这很奇怪,在你的DataFrame上尝试过:

>>> df = pd.DataFrame({'D':[274.261108,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.149994]})
>>> df.index = pd.DatetimeIndex(['2008.01.01 00:01:00','2008.01.01 00:11:00','2008.01.01 00:31:00','2008.01.01 00:41:00','2008.01.01 01:01:00','2008.01.01 01:11:00','2008.01.01 01:31:00','2008.01.01 01:41:00','2008.01.01 02:01:00','2008.01.01 02:11:00'])
>>> df.resample('H', how='ohlc')
                              D                                    
                           open        high         low       close
2008-01-01 00:00:00  274.261108  274.261108  273.705566  273.705566
2008-01-01 01:00:00  273.705566  273.705566  273.705566  273.705566
2008-01-01 02:00:00  273.705566  273.705566  273.149994  273.149994

工作正常。