pandas - 将时间序列操作应用于多索引DataFrame时的行为不一致

时间:2015-05-04 16:28:10

标签: python pandas

这可能是一个潜在的错误:分组时间序列操作在多索引DataFrame上无声地失败。

import pandas as pd
import pandas.io.data as web

# Get some market data
df = web.DataReader(['AAPL', 'GOOG'], 'yahoo', pd.Timestamp('2013'), pd.Timestamp('2014')).to_frame()
df.index.names = ('dt', 'symbol')

In [21]: df.head()
Out[21]: 
                        Open       High        Low      Close     Volume  \
dt         symbol                                                          
2013-01-02 AAPL    553.82001  555.00000  541.62994  549.03003  140129500   
2013-01-03 AAPL    547.88000  549.67004  541.00000  542.10004   88241300   
2013-01-04 AAPL    536.96997  538.63000  525.82996  527.00000  148583400   
2013-01-07 AAPL    522.00000  529.30005  515.20001  523.90002  121039100   
2013-01-08 AAPL    529.21002  531.89001  521.25000  525.31000  114676800   

                   Adj Close  
dt         symbol             
2013-01-02 AAPL     74.63931  
2013-01-03 AAPL     73.69719  
2013-01-04 AAPL     71.64438  
2013-01-07 AAPL     71.22294  
2013-01-08 AAPL     71.41463  

我们想要将其重新采样为月度数据。这会失败并返回一个空的DataFrame:

df_M = df.groupby(level='symbol').resample('M', how='mean')
In [23]: df_M
Out[23]: 
Empty DataFrame
Columns: []
Index: []

然而,这可行,但需要看似不必要的重新索引:

df_M = df.reset_index().set_index('dt').groupby('symbol').resample('M', how='mean')
In [26]: df_M.head()
Out[26]: 
                   Adj Close       Close        High         Low        Open  \
symbol dt                                                                      
AAPL   2013-01-31  67.677750  497.822382  504.407623  492.969997  500.083329   
       2013-02-28  62.388477  456.808942  463.231056  452.106325  458.503692   
       2013-03-31  60.417287  441.841000  446.803495  437.337996  442.011512   
       2013-04-30  57.398619  419.765001  425.553183  414.722271  419.766820   
       2013-05-31  61.340151  446.452734  451.658190  441.495455  446.400919   

                         Volume  
symbol dt                        
AAPL   2013-01-31  1.562312e+08  
       2013-02-28  1.229478e+08  
       2013-03-31  1.147110e+08  
       2013-04-30  1.245851e+08  
       2013-05-31  1.073583e+08  

您需要执行reset_index().set_index('dt')然后groupby('symbol')而不是groupby(level='symbol')这一事实似乎打败了多索引的目的!是什么给了什么?

我也意识到像这样的数据可能更适合Panel而不是DataFrame,但是当处理非常大量(通常是稀疏的)数据时,3D Panel结构会出现性能和内存问题。数据帧。

1 个答案:

答案 0 :(得分:0)

这确实是一个错误,并已修复:https://github.com/pydata/pandas/issues/10063