在pandas版本0.19.2上,我有以下带有multiindex的数据框:
import pandas as pd
import numpy as np
arrays = [[2001, 2001, 2002, 2002, 2002, 2003, 2004, 2004],
['A', 'B', 'A', 'C', 'D', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index, name='signal')
如下所示:
first second
2001 A -2.48
B 0.95
2002 A 0.55
C 0.65
D -1.32
2003 B -0.25
2004 C 0.86
D -0.31
我希望获得一个摘要列联数据框,其中列是唯一的"第二个"和指数是"第一"索引,如下所示:
A B C D
2001 -2.48 0.95 NaN NaN
2002 0.55 NaN 0.65
2003 NaN -0.25 NaN NaN
2004 NaN NaN 0.86 -0.31
知道如何做到这一点?我玩groupby()
如下所示但无法到达任何地方
s.groupby(level=1).apply(lambda x: "to do")
关联问题:Python Pandas - how to do group by on a multiindex 如何-待办事项基团的由式-A-多指标
答案 0 :(得分:1)
如果MultiIndex
中的唯一对:
unstack
df = s.unstack()
print (df)
second A B C D
first
2001 1.752237 0.348548 NaN NaN
2002 -0.022903 NaN -0.961702 0.079236
2003 NaN -0.393272 NaN NaN
2004 NaN NaN -0.600994 -0.594842
但如果在实际数据中获得:
ValueError:索引包含重复的条目,无法重塑
这意味着MultiIndex
中有重复项:
print (s)
first second
2001 A 0.478052 <-2001, A
A 0.485261 <-2001, A
2002 A -0.474997
C -1.165866
D -0.755630
2003 B 0.588104
2004 C -1.439245
D -0.461221
Name: signal, dtype: float64
然后可能的解决方案首先是聚合值:
print (s.groupby(level=[0,1]).mean())
first second
2001 A 0.958668
2002 A -0.459612
C 0.534821
D 1.469257
2003 B -1.103208
2004 C 0.098037
D 0.722135
Name: signal, dtype: float64
df = s.groupby(level=[0,1]).mean().unstack()
print (df)
second A B C D
first
2001 0.481657 NaN NaN NaN
2002 -0.474997 NaN -1.165866 -0.755630
2003 NaN 0.588104 NaN NaN
2004 NaN NaN -1.439245 -0.461221