我有一个看起来像这样的数据框:
df.head()
Name Application time
Administrator Excel 1
Reception Word 1
Manager Internet 1
Administrator Excel 2
Reception Email 5
我尝试用所有不同的应用程序作为列名创建一个二进制矩阵,并且对于每个不同的用户,每个应用程序的使用时间相加:
Name Email Email_time Excel Excel_time Internet Internet_time Word Word_time
Administrator 0 0 1 3 0 0 0 0
Manager 0 0 0 0 1 1 0 0
Reception 1 5 0 0 0 0 1 1
答案 0 :(得分:1)
使用DataFrame.pivot_table
并为不等于DataFrame.ne
的二进制比较值,然后按astype
转换为整数:
df2 = df.pivot_table(index='Name',
columns='Application',
values='time',
aggfunc='sum',
fill_value=0)
df = df2.ne(0).astype(int).join(df2.add_suffix('_time')).sort_index(axis=1)
print (df)
Application Email Email_time Excel Excel_time Internet Internet_time \
Name
Administrator 0 0 1 3 0 0
Manager 0 0 0 0 1 1
Reception 1 5 0 0 0 0
Application Word Word_time
Name
Administrator 0 0
Manager 0 0
Reception 1 1
索引中最后一个必要的列:
df = df.reset_index().rename_axis(None, axis=1)
编辑:
如果可能,某些nagatve值和总和应为0
,这是get_dummies
和max
的替代选择:
df1 = pd.get_dummies(df.set_index('Name')['Application']).max(level=0)
df2 = df.pivot_table(index='Name',
columns='Application',
values='time',
aggfunc='sum',
fill_value=0)
df = df1.join(df2.add_suffix('_time'))
print (df)
Email Excel Internet Word Email_time Excel_time \
Name
Administrator 0 1 0 0 0 3
Reception 1 0 0 1 5 0
Manager 0 0 1 0 0 0
Internet_time Word_time
Name
Administrator 0 0
Reception 0 1
Manager 1 0
答案 1 :(得分:0)
groupby
与agg
a = df.groupby(['Name', 'Application']).time.agg(['count', 'sum'])
c = a['count'].unstack(fill_value=0)
s = a['sum'].unstack(fill_value=0).add_suffix('_time')
c.join(s).sort_index(1)
Application Email Email_time Excel Excel_time Internet Internet_time Word Word_time
Name
Administrator 0 0 2 3 0 0 0 0
Manager 0 0 0 0 1 1 0 0
Reception 1 5 0 0 0 0 1 1