如何将熊猫数据框更改为二进制矩阵?

时间:2019-05-14 13:02:56

标签: python pandas

我有一个看起来像这样的数据框:

df.head()

 Name                         Application                  time
Administrator                   Excel                        1
Reception                       Word                         1
Manager                         Internet                     1
Administrator                   Excel                        2
Reception                       Email                        5

我尝试用所有不同的应用程序作为列名创建一个二进制矩阵,并且对于每个不同的用户,每个应用程序的使用时间相加:

Name             Email   Email_time   Excel    Excel_time   Internet  Internet_time   Word    Word_time    
Administrator      0         0           1           3         0               0        0       0
Manager            0         0           0           0         1               1        0       0
Reception          1         5           0           0         0               0        1       1

2 个答案:

答案 0 :(得分:1)

使用DataFrame.pivot_table并为不等于DataFrame.ne的二进制比较值,然后按astype转换为整数:

df2 = df.pivot_table(index='Name',
                    columns='Application',
                    values='time',
                    aggfunc='sum',
                    fill_value=0)

df = df2.ne(0).astype(int).join(df2.add_suffix('_time')).sort_index(axis=1)
print (df)
Application    Email  Email_time  Excel  Excel_time  Internet  Internet_time  \
Name                                                                           
Administrator      0           0      1           3         0              0   
Manager            0           0      0           0         1              1   
Reception          1           5      0           0         0              0   

Application    Word  Word_time  
Name                            
Administrator     0          0  
Manager           0          0  
Reception         1          1  

索引中最后一个必要的列:

df = df.reset_index().rename_axis(None, axis=1)

编辑:

如果可能,某些nagatve值和总和应为0,这是get_dummiesmax的替代选择:

df1 = pd.get_dummies(df.set_index('Name')['Application']).max(level=0)
df2 = df.pivot_table(index='Name',
                    columns='Application',
                    values='time',
                    aggfunc='sum',
                    fill_value=0)

df = df1.join(df2.add_suffix('_time'))
print (df)
               Email  Excel  Internet  Word  Email_time  Excel_time  \
Name                                                                  
Administrator      0      1         0     0           0           3   
Reception          1      0         0     1           5           0   
Manager            0      0         1     0           0           0   

               Internet_time  Word_time  
Name                                     
Administrator              0          0  
Reception                  0          1  
Manager                    1          0  

答案 1 :(得分:0)

groupbyagg

a = df.groupby(['Name', 'Application']).time.agg(['count', 'sum'])
c = a['count'].unstack(fill_value=0)
s = a['sum'].unstack(fill_value=0).add_suffix('_time')
c.join(s).sort_index(1)

Application    Email  Email_time  Excel  Excel_time  Internet  Internet_time  Word  Word_time
Name                                                                                         
Administrator      0           0      2           3         0              0     0          0
Manager            0           0      0           0         1              1     0          0
Reception          1           5      0           0         0              0     1          1