我正在玩大熊猫中的一些金融时间序列数据,我正在尝试对某些时间戳数据进行重新采样。这是起始数据:
start_data
Out[12]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 749880 entries, 2012-07-06 03:00:00 to 2013-09-11 23:59:00
Data columns (total 1 columns):
TickMean 749880 non-null values
dtypes: float64(1)
start_data.TickMean
Out[18]:
2012-07-06 03:00:00 1.541194
2012-07-06 03:01:00 1.541216
2012-07-06 03:02:00 1.541201
2012-07-06 03:03:00 1.541088
2012-07-06 03:04:00 1.540999
2012-07-06 03:05:00 1.541011
2012-07-06 03:06:00 1.541090
2012-07-06 03:07:00 1.541256
2012-07-06 03:08:00 1.541341
2012-07-06 03:09:00 1.541386
2012-07-06 03:10:00 1.541511
2012-07-06 03:11:00 1.541469
2012-07-06 03:12:00 1.541506
2012-07-06 03:13:00 1.541584
2012-07-06 03:14:00 1.541453
...
2013-09-11 23:45:00 1.602015
2013-09-11 23:46:00 1.602015
2013-09-11 23:47:00 1.602015
2013-09-11 23:48:00 1.602015
2013-09-11 23:49:00 1.602015
2013-09-11 23:50:00 1.602015
2013-09-11 23:51:00 1.602015
2013-09-11 23:52:00 1.602015
2013-09-11 23:53:00 1.602015
2013-09-11 23:54:00 1.602015
2013-09-11 23:55:00 1.602015
2013-09-11 23:56:00 1.602015
2013-09-11 23:57:00 1.602015
2013-09-11 23:58:00 1.602015
2013-09-11 23:59:00 1.602015
Name: TickMean, Length: 749880, dtype: float64
当我尝试40分钟的重采样时,时间范围会扩大:
start_data = start_data.resample('40min')
start_data
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 25344 entries, 2012-01-07 00:00:00 to 2013-12-10 23:20:00
Freq: 40T
Data columns (total 1 columns):
TickMean 18749 non-null values
dtypes: float64(1)
start_data.TickMean
Out[15]:
2012-01-07 00:00:00 1.5706
2012-01-07 00:40:00 1.5706
2012-01-07 01:20:00 1.5706
2012-01-07 02:00:00 1.5706
2012-01-07 02:40:00 1.5706
2012-01-07 03:20:00 1.5706
2012-01-07 04:00:00 1.5706
2012-01-07 04:40:00 1.5706
2012-01-07 05:20:00 1.5706
2012-01-07 06:00:00 1.5706
2012-01-07 06:40:00 1.5706
2012-01-07 07:20:00 1.5706
2012-01-07 08:00:00 1.5706
2012-01-07 08:40:00 1.5706
2012-01-07 09:20:00 1.5706
...
2013-12-10 14:00:00 1.594563
2013-12-10 14:40:00 1.594796
2013-12-10 15:20:00 1.594766
2013-12-10 16:00:00 1.593523
2013-12-10 16:40:00 1.593171
2013-12-10 17:20:00 1.593702
2013-12-10 18:00:00 1.595145
2013-12-10 18:40:00 1.595796
2013-12-10 19:20:00 1.595527
2013-12-10 20:00:00 1.595099
2013-12-10 20:40:00 1.595060
2013-12-10 21:20:00 1.595575
2013-12-10 22:00:00 1.595575
2013-12-10 22:40:00 1.595575
2013-12-10 23:20:00 1.595575
Freq: 40T, Name: TickMean, Length: 25344, dtype: float64
我觉得我错过了一些明显的东西。为什么要这样做?
快速编辑:我知道40分钟的频率很奇怪,但其他频率也有同样的效果。
编辑2:是的,这是愚蠢的事。我以为索引会被排序。
start_data
Out[23]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 749880 entries, 2012-07-06 03:00:00 to 2013-09-11 23:59:00
Data columns (total 1 columns):
TickMean 749880 non-null values
dtypes: float64(1)
start_data.index.min()
Out[24]: Timestamp('2012-01-07 00:00:00', tz=None)
start_data.index.max()
Out[25]: Timestamp('2013-12-10 23:59:00', tz=None)
编辑3:作为遇到类似奇怪问题的人的奖励,我的日期数据是第一天而不是第一个月。所以也把一切都扔了。这是使用dayfirst = True选项解决的。
ask_data.index = pd.to_datetime(ask_data.index, dayfirst=True)
ask_data
Out[34]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 749880 entries, 2012-06-07 03:00:00 to 2013-11-09 23:59:00
Data columns (total 5 columns):
Open 749880 non-null values
High 749880 non-null values
Low 749880 non-null values
Close 749880 non-null values
Volume 749880 non-null values
dtypes: float64(5)
ask_data.index.min()
Out[35]: Timestamp('2012-06-07 03:00:00', tz=None)
ask_data.index.max()
Out[36]: Timestamp('2013-11-09 23:59:00', tz=None)
答案 0 :(得分:1)
您确定您的索引有序吗?你可以通过以下方式检查:
print start_data.index.min(), start_data.index.max(), start_data.index.is_monotonic