熊猫:将任何列索引级别的列添加到multiindex

时间:2019-08-01 14:07:19

标签: python pandas

我想将缺少级别(index = 1)的列添加到数据帧的每个父级别(index = 0)。对于简单的数据框,效果很好

-Dide.text.effect.new.metrics=false

数据框:

index = [['A', 'B', 'C', 'D'], ['a', 'b', 'a', 'b']]
cols = [['AC', 'AC', 'BC', 'DC', 'CC'], ['ac', 'aac', 'bc', 'ac', 'bc']]
data = np.random.random((4, 5))
df = pd.DataFrame(data=data, index=index, columns=cols)
df.columns.names = ['col_name_0', 'col_name_1']

处理步骤:

col_name_0        AC                  BC        DC        CC
col_name_1        ac       aac        bc        ac        bc
A a         0.169402  0.899434  0.644941  0.330402  0.805702
B b         0.933743  0.994497  0.060507  0.609129  0.545999
C a         0.064937  0.686350  0.740594  0.985218  0.717699
D b         0.151031  0.932294  0.948751  0.538251  0.085700    

已处理的df:

feature_index = [index for index, item in enumerate(df.columns.names) if item == 'col_name_1'][0]
all_features = df.columns.levels[feature_index].to_list()

for idx, item in df.groupby(level=0, axis=1):
    features = item.columns.get_level_values(1).to_list()
    missing = list(set(all_features) - set(features))
    for m_item in missing:
        df[idx, m_item] = np.nan * np.ones(df.shape[0])

但是对于具有多个列级别(如下面的列)的数据框,该方法将失败:

col_name_0        AC                BC      ...  CC            DC              
col_name_1       aac        ac  bc aac  ac  ...  ac        bc aac        ac  bc
A a         0.561247  0.353270 NaN NaN NaN  ... NaN  0.733714 NaN  0.343174   NaN
B b         0.699053  0.696892 NaN NaN NaN  ... NaN  0.144768 NaN  0.267141 NaN
C a         0.624581  0.064629 NaN NaN NaN  ... NaN  0.856559 NaN  0.772735 NaN
D b         0.563903  0.192823 NaN NaN NaN  ... NaN  0.071497 NaN  0.000361 NaN

原始数据框:

index = [['A', 'B', 'C', 'D'], ['a', 'b', 'a', 'b']]
cols = [['AC', 'AC', 'BC', 'DC', 'CC'], ['ac', 'aac', 'bc', 'ac', 'bc'], ['Xc', 'Xc', 'Xc', 'Xc', 'Xc']]
data = np.random.random((4, 5))
df = pd.DataFrame(data=data, index=index, columns=cols)
df.columns.names = ['col_name_0', 'col_name_1', 'col_name_2']

处理步骤:

col_name_0        AC                  BC        DC        CC
col_name_1        ac       aac        bc        ac        bc
col_name_2        Xc        Xc        Xc        Xc        Xc
A a         0.317022  0.700635  0.305712  0.934382  0.315501
B b         0.601277  0.726890  0.737907  0.571935  0.716260
C a         0.679046  0.314987  0.846560  0.962516  0.770071
D b         0.124029  0.626421  0.967531  0.193875  0.395897

错误消息:

feature_index = [index for index, item in enumerate(df.columns.names) if item == 'col_name_1'][0] all_features = df.columns.levels[feature_index].to_list() for idx, item in df.groupby(level=0, axis=1): features = item.columns.get_level_values(1).to_list() missing = list(set(all_features) - set(features)) for m_item in missing: df[idx, m_item] = np.nan * np.ones(df.shape[0])

有什么想法可以使我的方法更通用以接受任何列级别?

2 个答案:

答案 0 :(得分:1)

因此您可以只使用stackunstack

out = df.stack(level = 1).unstack().swaplevel(1, 2, axis = 1)

答案 1 :(得分:1)

IIUC,您可以使用reindex

full_cols = pd.MultiIndex.from_product(df.columns.levels,
                                       names=df.columns.names)
df.reindex(full_cols, axis=1)