这可能是一个潜在的错误:分组时间序列操作在多索引DataFrame上无声地失败。
import pandas as pd
import pandas.io.data as web
# Get some market data
df = web.DataReader(['AAPL', 'GOOG'], 'yahoo', pd.Timestamp('2013'), pd.Timestamp('2014')).to_frame()
df.index.names = ('dt', 'symbol')
In [21]: df.head()
Out[21]:
Open High Low Close Volume \
dt symbol
2013-01-02 AAPL 553.82001 555.00000 541.62994 549.03003 140129500
2013-01-03 AAPL 547.88000 549.67004 541.00000 542.10004 88241300
2013-01-04 AAPL 536.96997 538.63000 525.82996 527.00000 148583400
2013-01-07 AAPL 522.00000 529.30005 515.20001 523.90002 121039100
2013-01-08 AAPL 529.21002 531.89001 521.25000 525.31000 114676800
Adj Close
dt symbol
2013-01-02 AAPL 74.63931
2013-01-03 AAPL 73.69719
2013-01-04 AAPL 71.64438
2013-01-07 AAPL 71.22294
2013-01-08 AAPL 71.41463
我们想要将其重新采样为月度数据。这会失败并返回一个空的DataFrame:
df_M = df.groupby(level='symbol').resample('M', how='mean')
In [23]: df_M
Out[23]:
Empty DataFrame
Columns: []
Index: []
然而,这可行,但需要看似不必要的重新索引:
df_M = df.reset_index().set_index('dt').groupby('symbol').resample('M', how='mean')
In [26]: df_M.head()
Out[26]:
Adj Close Close High Low Open \
symbol dt
AAPL 2013-01-31 67.677750 497.822382 504.407623 492.969997 500.083329
2013-02-28 62.388477 456.808942 463.231056 452.106325 458.503692
2013-03-31 60.417287 441.841000 446.803495 437.337996 442.011512
2013-04-30 57.398619 419.765001 425.553183 414.722271 419.766820
2013-05-31 61.340151 446.452734 451.658190 441.495455 446.400919
Volume
symbol dt
AAPL 2013-01-31 1.562312e+08
2013-02-28 1.229478e+08
2013-03-31 1.147110e+08
2013-04-30 1.245851e+08
2013-05-31 1.073583e+08
您需要执行reset_index().set_index('dt')
然后groupby('symbol')
而不是groupby(level='symbol')
这一事实似乎打败了多索引的目的!是什么给了什么?
我也意识到像这样的数据可能更适合Panel而不是DataFrame,但是当处理非常大量(通常是稀疏的)数据时,3D Panel结构会出现性能和内存问题。数据帧。