如何用熊猫清洁和转发多天1分钟的时间序列?

时间:2013-10-09 09:26:04

标签: python pandas

我有一个csv文件,其中1分钟的库存数据跨越多天。每天的运行时间为9:30至16:00。

时间序列中的部分会议记录缺失: (这里2013-09-16 09:32:00和2013-09-17 09:31:00失踪)

2013-09-16 09:30:00,461.01,461.49,461,461,183507
2013-09-16 09:31:00,460.82,461.6099,460.39,461.07,212774
2013-09-16 09:33:00,460.0799,460.88,458.97,459.2401,207880
2013-09-16 09:34:00,458.97,460.08,458.8,460.04,148121
...
2013-09-16 15:59:00,449.72,450.0774,449.59,449.95,146399
2013-09-16 16:00:00,450.12,450.12,449.65,449.65,444594
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:32:00,450.6177,450.9,449.05,449.2701,268715
2013-09-17 09:33:00,451.39,451.96,450.58,450.7061,197019
...
...

使用pandas,我如何转发填充系列以便每分钟都存在?我应该这样:

2013-09-16 09:30:00,461.01,461.49,461,461,183507
2013-09-16 09:31:00,460.82,461.6099,460.39,461.07,212774
2013-09-16 09:32:00,460.82,461.6099,460.39,461.07,212774 <-- forward filled
2013-09-16 09:33:00,460.0799,460.88,458.97,459.2401,207880
2013-09-16 09:34:00,458.97,460.08,458.8,460.04,148121
...
2013-09-16 15:59:00,449.72,450.0774,449.59,449.95,146399
2013-09-16 16:00:00,450.12,450.12,449.65,449.65,444594
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:31:00,448,448,447.5,447.96,173624 <-- forward filled
2013-09-17 09:32:00,450.6177,450.9,449.05,449.2701,268715
2013-09-17 09:33:00,451.39,451.96,450.58,450.7061,197019
...

还需要考虑是否缺少连续几分钟...

2 个答案:

答案 0 :(得分:3)

所以我将前4行复制到数据帧中:

Out[49]:
                    0         1         2       3         4       5
0 2013-09-16 09:30:00  461.0100  461.4900  461.00  461.0000  183507
1 2013-09-16 09:31:00  460.8200  461.6099  460.39  461.0700  212774
2 2013-09-16 09:33:00  460.0799  460.8800  458.97  459.2401  207880
3 2013-09-16 09:34:00  458.9700  460.0800  458.80  460.0400  148121

然后

df1 = df.set_index(keys=[0]).resample('1min', fill_method='ffill')
df1

Out[52]:
                            1         2       3         4       5
0                                                                
2013-09-16 09:30:00  461.0100  461.4900  461.00  461.0000  183507
2013-09-16 09:31:00  460.8200  461.6099  460.39  461.0700  212774
2013-09-16 09:32:00  460.8200  461.6099  460.39  461.0700  212774
2013-09-16 09:33:00  460.0799  460.8800  458.97  459.2401  207880
2013-09-16 09:34:00  458.9700  460.0800  458.80  460.0400  148121

这也将处理多个缺失值并向前填充它们。

所以,如果我有像

这样的数据
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:33:00,451.39,451.96,450.58,450.7061,197019

并做同样的事情:

Out[55]:
                          1       2       3         4       5
0                                                            
2013-09-17 09:30:00  448.00  448.00  447.50  447.9600  173624
2013-09-17 09:31:00  448.00  448.00  447.50  447.9600  173624
2013-09-17 09:32:00  448.00  448.00  447.50  447.9600  173624
2013-09-17 09:33:00  451.39  451.96  450.58  450.7061  197019

这里的关键是你必须有一个datetimeindex,如果你想把它作为一个列,那么你可以在drop=False中设置set_index

答案 1 :(得分:1)

这对你来说可能稍好一些,因为它会考虑到不同的日子,因此你不必每天都要填写:

只需创建数据框:

list1 = [["2013-09-16 09:29:00","461.01","461.49","461","461","183507"],
["2013-09-16 09:31:00", "460.82", "461.6099", "460.39", "461.07", "212774"], 
["2013-09-16 09:34:00", "460.0799", "460.88", "458.97", "459.2401", "207880"], 
["2013-09-17 09:35:00", "458.97", "460.08", "458.8", "460.04", "148121"]]

cols = ['date','price1','price2','price3', 'price4', 'price5']

df = DataFrame(list1, columns=cols)

将索引设置为日期列:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date', inplace=True)

重新索引并填充孔,然后向前填充生成的NaN值,然后在上午9:30到下午4:00之间停止所有时间:

df2 = df.reindex(pd.date_range(df.index[0], df.index[-1], freq='Min')).ffill().ix[df2.index.indexer_between_time(pd.datetime(year=1,month=1,day=1,hour=9,minute=30).time(), datetime.time(16))] 

这些陈述可以按顺序分开:

首先,重新索引数据框,使您的索引对应于您的结束日期/时间的开始日期/时间,频率为1分钟:

df2 = df.reindex(pd.date_range(df.index[0], df.index[-1], freq='Min')) 

这将创建许多NaN值,其中新索引与旧索引不对齐。我们用ffill(向前填充)来填充它,尽管还有其他选项:

df2.ffill(inplace=True)

然后最后,删除超出您的上午9:30到下午4:00时间范围的时间:

df_final = df2.ix[df2.index.indexer_between_time(pd.datetime(year=1,month=1,day=1,hour=9,minute=30).time(), datetime.time(16))]

因为.time()不需要9.5并且文档有点稀疏,所以我创建了一个datetime对象,其时间值设置为9:30 AM,然后使用.time()来获取它。有一种更好的方法,我敢肯定。