在获得假人之后重新排列列

时间:2017-10-27 03:44:21

标签: python-3.x pandas one-hot-encoding

       A            B            C               D              E
0   165349.20   136897.80    471784.10        New York      192261.83
1   162597.70   151377.59    443898.53        California    191792.06
2   153441.51   101145.55    407934.54        Florida       191050.39
3   144372.41   118671.85    383199.62        New York      182901.99
4   142107.34   91391.77     366168.42        Florida       166187.94

使用 df = pd.get_dummies(df,columns = ['D'])

之后
        A            B              C           E      D_New York    D_California     D_Florida
0   165349.20    136897.80      471784.10   192261.83      0             0                1
1   162597.70    151377.59      443898.53   191792.06      1             0                0
2   153441.51    101145.55      407934.54   191050.39      0             1                0
3   144372.41    118671.85      383199.62   182901.99      0             0                1
4   142107.34    91391.77       366168.42   166187.94      0             1                0

有没有一种方法,输出看起来像没有使用df [['A','B','C','D_Califorina','D_New York','D_Florida','E']]?

        A            B          C      D_New York    D_California     D_Florida     E
0   165349.20   136897.80   471784.10       0               0          1    192261.83
1   162597.70   151377.59   443898.53       1               0          0    191792.06
2   153441.51   101145.55   407934.54       0               1          0    191050.39
3   144372.41   118671.85   383199.62       0               0          1    182901.99
4   142107.34   91391.77    366168.42       0               1          0    166187.94

3 个答案:

答案 0 :(得分:2)

使用sort_index

df.sort_index(axis=1)
Out[813]: 
           A          B          C  D_California  D_Florida  D_NewYork  \
0  165349.20  136897.80  471784.10             0          0          1   
1  162597.70  151377.59  443898.53             1          0          0   
2  153441.51  101145.55  407934.54             0          1          0   
3  144372.41  118671.85  383199.62             0          0          1   
4  142107.34   91391.77  366168.42             0          1          0   
           E  
0  192261.83  
1  191792.06  
2  191050.39  
3  182901.99  
4  166187.94  

编辑:.....列出sort dictlambda

A=dict(zip(df.columns,list(range(0,df.shape[1]))))
#build a dict A store the order of original df
df1=pd.get_dummies(df, columns=['State'])
#get your df
youroder=list(df1)
#new disorder column name
youroder.sort(key=lambda val: A[val.split(sep='_')[0]])
# sort it 
df1[youroder]

Out[842]: 
   R&D Spend  Administration  Marketing Spend  State_California  \
0  165349.20       136897.80        471784.10                 0   
1  162597.70       151377.59        443898.53                 1   
2  153441.51       101145.55        407934.54                 0   
3  144372.41       118671.85        383199.62                 0   
4  142107.34        91391.77        366168.42                 0   
   State_Florida  State_NewYork  Profit(E)  
0              0              1  192261.83  
1              0              0  191792.06  
2              1              0  191050.39  
3              0              1  182901.99  
4              1              0  166187.94  

答案 1 :(得分:2)

可能不按排序顺序排列的列的通用解决方案:
找到列的位置以进行相应的dummify和concat

j = df.columns.get_loc('D')

left = df.iloc[:, :j]
dumb = pd.get_dummies(df[['D']])
rite = df.iloc[:, j+1:]

pd.concat([left, dumb, rite], axis=1)

           A          B          C  D_California  D_Florida  D_New York          E
0  165349.20  136897.80  471784.10             0          0           1  192261.83
1  162597.70  151377.59  443898.53             1          0           0  191792.06
2  153441.51  101145.55  407934.54             0          1           0  191050.39
3  144372.41  118671.85  383199.62             0          0           1  182901.99
4  142107.34   91391.77  366168.42             0          1           0  166187.94

答案 2 :(得分:0)

不确定是否有更好的方法,但这将有效

col = ['R&D Spend', 'Administration', 'Marketing Spend', 'State_California', 'State_New York', 'State_Florida', 'Profit(E)']

df=df.loc[:, col]