如何在调用groupby并从pandas转换时保留列顺序?

时间:2018-02-07 09:47:34

标签: python pandas

调用pandas.DataFrame.groupby().shift()时,列似乎会按列索引重新排序。 sort参数仅适用于行。

以下是一个例子:

import pandas as pd
df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
                   'E': ['a','b','c','d','e','f'],
                   'B': [10, 12, 10, 25, 10, 12],
                   'C': [100, 102, 100, 250, 100, 102],
                   'D': [1,2,3,4,5,6]
                  })

df.set_index('A',inplace=True)
df = df[['E','C','D','B']]
df

#         E     C   D    B
#     A            
#group1   a   100   1   10
#group1   b   102   2   12
#group2   c   100   3   10
#group2   d   250   4   25
#group3   e   100   5   10
#group3   f   102   6   12

从这里开始,我想实现:

#         E     C   D    B    C_s     D_s   B_s
#     A                     
#group1   a   100   1   10   102.0    2.0  12.0     
#group1   b   102   2   12     NaN    NaN   NaN     
#group2   c   100   3   10   250.0    4.0  25.0     
#group2   d   250   4   25     NaN    NaN   NaN     
#group3   e   100   5   10   102.0    6.0  12.0     
#group3   f   102   6   12     NaN    NaN   NaN

但是

df[['C_s','D_s','B_s']]= df.groupby(level='A')[['C','D','B']].shift(-1)

结果:

#         E     C   D    B    C_s     D_s   B_s
#     A                     
#group1   a   100   1   10   12.0   102.0   2.0
#group1   b   102   2   12    NaN     NaN   NaN
#group2   c   100   3   10   25.0   250.0   4.0
#group2   d   250   4   25    NaN     NaN   NaN
#group3   e   100   5   10   12.0   102.0   6.0
#group3   f   102   6   12    NaN     NaN   NaN

引入列的人工排序有助于维护列的内在逻辑连接:

df = df.sort_index(axis=1)
df[['B_s','C_s','D_s']]= df.groupby(level='A')[['B','C','D']].shift(-1).sort_index(axis=1)
df
#         B    C  D  E   B_s   C_s   D_s
#     A              
#group1  10  100  1  a  12.0  102.0  2.0
#group1  12  102  2  b   NaN   NaN   NaN
#group2  10  100  3  c  25.0  250.0  4.0
#group2  25  250  4  d   NaN   NaN   NaN
#group3  10  100  5  e  12.0  102.0  6.0
#group3  12  102  6  f   NaN   NaN   NaN 

为什么列首先重新排序?

1 个答案:

答案 0 :(得分:3)

在我看来这是错误。

使用自定义lambda函数:

df[['C_s','D_s','B_s']] = (df.groupby(level='A')['C','D','B']
                             .apply(pd.DataFrame.shift, periods=-1))

感谢@cᴏʟᴅsᴘᴇᴇᴅ寻求另一种解决方案:

time_original time_seconds time_round time_below time_above
273.0         21.782       22.0        0.0       52.0
273.0         21.816       22.0        0.0       52.0
273.0         21.849       22.0        0.0       52.0
273.0         21.882       22.0        0.0       52.0
273.0         104.143      104.0       74.0      134.0
273.0         104.176      104.0       74.0      134.0
273.0         104.210      104.0       74.0      134.0